Trading Bitcoin with Reinforcement Learning

Introduction to Algorithmic Trading

Algorithmic trading has been around for decades and has, for the most part, enjoyed a fair amount of success in its varied forms. Traditionally, algorithmic trading involves selecting trading rules that are carefully designed, optimized, and tested by humans. While these strategies have the advantage of being systematic and able to operate at speeds and frequencies beyond human traders, they are susceptible to all kinds of selection biases and are unable to adapt to changing market conditions.

Reinforcement learning (RL) on the other hand, is much more "hands off." In RL, an “agent” simply aims to maximize its reward in any given environment and tries to improve its decision making through trial and error as it experiences more examples. It can also learn to make decisions based not only on its beliefs of the environment one step ahead but on how the market plays out farther down the road.  In most traditional trading algorithms, there are separate processes for prediction, turning that prediction into an action, and determining the frequency of the action based on transaction costs. RL supports an approach that integrates these processes.  For all these reasons, RL may discover actions that humans normally would not find.

As a proof of concept, we designed and implemented a trading system for bitcoins as trade data is readily available. To evaluate the efficacy of our reinforcement learning agent, we compare the out of sample investment performance against a buy and hold strategy and a momentum strategy.  We believe this framework could be easily expanded and could also be applied to other investment assets.

Reinforcement Learning Basics

Reinforcement learning is appropriate when the state space (the quantitative description of the environment) is large or even continuous.  It may be especially useful when it is impractical to obtain labels for supervised learning.  Trading is a good example of this where the correct actions aren’t known and even if they were, would be nearly impossible to apply to every situation in which the agent has to act. RL is also appropriate when, as in trading, the actions have long term consequences and rewards may be delayed.

The essential ingredients to reinforcement learning are states, actions, rewards, and an action selection policy.  In a given problem, an agent is supposed to select the best action given its current state. This action produces an observation of the new state as well as a reward, and this is repeated in what is known as a Markov Decision Process.  In order for agent to learn its behavior or policy, the reward feedback for this sequence of actions is used to tune the parameters of the model.  

There are two main ways of formulating the problem: value based and policy based.  In a value based approach, the value of each state or state-action pair is estimated.  The policy is generated by accurately estimating these values and then selecting the action with the highest value.  In a policy based approach, which is our chosen method, we directly parametrize the policy and then find the parameters that maximizes expected rewards.


We downloaded price and respective volume for each transaction, from GDAX exchange (formerly Coinbase exchange) from December 1, 2014 to June 14, 2017 which we aggregated into 15 minute candles (or intervals). We then split this into a 70%/30% train/test set.  

Each 15 minute candle is one step and an episode is defined as 96 steps or roughly 1 day of trading.  During training, a random block of 96 contiguous candles is selected to be played as an episode and a random number of bitcoins between 0 and 4 is selected to start the sequence.  The agent makes a decision to buy, sell, or hold at each step subject to an lower/upper limit of 0 and 4 bitcoins respectively.  The bitcoin holdings at each step are calculated, as well as the returns based on those holdings. Returns is calculated as number of bitcoins*[p(t)/p(t-1)-1]. At the end of each episode, we collect all the inputs, actions taken, and returns.

In order for our RL agent to learn a proper policy, it needs inputs that are representative of the state of the market and are somewhat predictive in aggregate.  We use 18 different technical indicators that express where the current price and volume is in relation to its past history, along with 5 state variables which represent the 5 possible bitcoin holdings between 0 and 4 bitcoins. 

The indicators used are fairly generic momentum/reversion type signals and their details are provided in Table 1.  As an  illustrative example of how these indicators might work, the agent may learn that rising prices along with steady volume is a bullish sign and adjust its weights so that it has a higher tendency to buy more bitcoins.  

Table 1: List of indicators used (r is return, p is price, and v is volume).

Table 1: List of indicators used (r is return, p is price, and v is volume).



We chose a policy gradient agent which directly learns an action policy over the state space.  For the structure of our multilayer perceptron (MLP), we have one hidden layer along with an output layer as shown in Figure 1. The hidden layer contains 23 neurons with a ReLU activation along with a dropout unit of .5 to avoid overfitting. The output layer has 3 neurons and a softmax activation in order to produce action probabilities.  All layers are fully connected and contain biases.  The weights are initialized using Xavier initialization and biases are initialized to 0.  Our implementation is developed in Python using Tensorflow as the computational backend.

The reward is calculated as the sum of the discounted returns from the step in question to the end of the episode.

The loss function is -mean(log(responsible outputs)*discounted rewards) where responsible outputs is the probability from the chosen action. We then minimize this loss with an Adam optimizer.


Figure 2 shows losses and rewards through 3 million episodes of training.  Both of these metrics are smoothed by taking the running average over 100,000 episodes as they are naturally extremely noisy.  

Figure 2: Average loss on the left and average reward on the right as a function of episodes.

Figure 2: Average loss on the left and average reward on the right as a function of episodes.

To get a baseline for the out of sample performance of our model, we compare the performance against two other strategies.  The first is a buy and hold strategy that maintains 2 bitcoins.  The second is a momentum strategy that will hold 4 bitcoins if the price is above the average price over the previous 30 periods and 0 bitcoins otherwise. 

The summary statistics for the performance of the three strategies during the out of sample period from September 25, 2016 to June 14, 2017 is shown in Table 2.  The cumulative log returns for the three strategies is shown in Figure 3.  While the test period has seen bitcoin prices shoot up, thus giving buy and hold strategy very good performance, the RL strategy manages to significantly outperform the static strategy even on a risk adjusted basis.  And while RL has a higher volatility due to its use of leverage, it still has an overall better drawdown profile as shown in figure 4.  

Table 2: Summary of performance statistics by strategy.

Table 2: Summary of performance statistics by strategy.

Figure 3: Cumulative log returns for the three different strategies over the test time period.

Figure 3: Cumulative log returns for the three different strategies over the test time period.

Figure 4: Drawdown for the three strategies over the test time period

Figure 4: Drawdown for the three strategies over the test time period

Since many of the inputs to the RL agent are momentum-like signals, it’s important to note the relatively low correlation of .63 between RL and momentum returns.  This is an indication that our RL strategy is not just replicating a much simpler yet effective strategy.  It suggests that in addition to the momentum indicators, RL is able form a more complete view of the environment along with volume and volatility indicators and then be able to take the appropriate action.  


Reinforcement learning has shown to be effective in many diverse fields from robotics to beating humans at various games.  We show that RL can also be applied to algorithmic trading, producing a strategy that is both unique and outperforms common baseline techniques.  


Moody, J. and Saffell, M.: Reinforcement learning for trading. (1998)

Sutton, R. and Barto, A.: Introduction to reinforcement learning. (2016)

Vincent Poon