AcademicMachine LearningTechnology

Deep Reinforcement Learning for Stock Trading

1. Deep Reinforcement learning

This is implied from Medium site (Link).

1.1 Concepts

Reinforcement Learning is one of three approaches of machine learning techniques, and it trains an agent to interact with the environment by sequentially receiving states and rewards from the environment and taking actions to reach better rewards.

Deep Reinforcement Learning approximates the Q value with a neural network. Using a neural network as a function approximator would allow reinforcement learning to be applied to large data.

Bellman Equation is the guiding principle to design reinforcement learning algorithms.

Markov Decision Process (MDP) is used to model the environment.

1.2 Related works

Recent applications of deep reinforcement learning in financial markets consider discrete or continuous state and action spaces, and employ one of these learning approaches: critic-only approach, actor-only approach, or and actor-critic approach.

1. Critic-only approach: the critic-only learning approach, which is the most common, solves a discrete action space problem using, for example, Q-learning, Deep Q-learning (DQN) and its improvements, and trains an agent on a single stock or asset. The idea of the critic-only approach is to use a Q-value function to learn the optimal action-selection policy that maximizes the expected future reward given the current state. Instead of calculating a state-action value table, DQN minimizes the mean squared error between the target Q-values, and uses a neural network to perform function approximation. The major limitation of the critic-only approach is that it only works with discrete and finite state and action spaces, which is not practical for a large portfolio of stocks, since the prices are of course continuous.

  • Q-learning: is a value-based Reinforcement Learning algorithm that is used to find the optimal action-selection policy using a Q function.
  • DQN: In deep Q-learning, we use a neural network to approximate the Q-value function. The state is given as the input and the Q-value of allowed actions is the predicted output.

2. Actor-only approach: The idea here is that the agent directly learns the optimal policy itself. Instead of having a neural network to learn the Q-value, the neural network learns the policy. The policy is a probability distribution that is essentially a strategy for a given state, namely the likelihood to take an allowed action. The actor-only approach can handle the continuous action space environments.

  • Policy Gradient: aims to maximize the expected total rewards by directly learns the optimal policy itself.

3. Actor-Critic approach: The actor-critic approach has been recently applied in finance. The idea is to simultaneously update the actor network that represents the policy, and the critic network that represents the value function. The critic estimates the value function, while the actor updates the policy probability distribution guided by the critic with policy gradients. Over time, the actor learns to take better actions and the critic gets better at evaluating those actions. The actor-critic approach has proven to be able to learn and adapt to large and complex environments, and has been used to play popular video games, such as Doom. Thus, the actor-critic approach fits well in trading with a large stock portfolio.

  • A2C: A2C is a typical actor-critic algorithm. A2C uses copies of the same agent working in parallel to update gradients with different data samples. Each agent works independently to interact with the same environment.
  • PPO: PPO is introduced to control the policy gradient update and ensure that the new policy will not be too different from the previous one.
  • DDPG: DDPG combines the frameworks of both Q-learning and policy gradient, and uses neural networks as function approximators.

2 How to Trade

Image for post

2.1 Data

We track and select the Dow Jones 30 stocks and use historical daily data from 01/01/2009 to 05/08/2020 to train the agent and test the performance. The dataset is downloaded from Compustat database accessed through Wharton Research Data Services (WRDS).

The whole dataset is split in the following figure. Data from 01/01/2009 to 12/31/2014 is used for training, and the data from 10/01/2015 to 12/31/2015 is used for validation and tuning of parameters. Finally, we test our agent’s performance on trading data, which is the unseen out-of-sample data from 01/01/2016 to 05/08/2020. To better exploit the trading data, we continue training our agent while in the trading stage, since this will help the agent to better adapt to the market dynamics.

3.2 MDP model for stock trading:

• State 𝒔 = [𝒑, 𝒉, 𝑏]: a vector that includes stock prices 𝒑 ∈ R+^D, the stock shares 𝒉 ∈ Z+^D, and the remaining balance 𝑏 ∈ R+, where 𝐷 denotes the number of stocks and Z+ denotes non-negative integers.

• Action 𝒂: a vector of actions over 𝐷 stocks. The allowed actions on each stock include selling, buying, or holding, which result in decreasing, increasing, and no change of the stock shares 𝒉, respectively.

• Reward 𝑟(𝑠,𝑎,𝑠′):the direct reward of taking action 𝑎 at state 𝑠 and arriving at the new state 𝑠′.

• Policy 𝜋 (𝑠): the trading strategy at state 𝑠, which is the probability distribution of actions at state 𝑠.

• Q-value 𝑄𝜋 (𝑠, 𝑎): the expected reward of taking action 𝑎 at state 𝑠 following policy 𝜋 .

The state transition of our stock trading process is shown in the following figure. At each state, one of three possible actions is taken on stock 𝑑 (𝑑 = 1, …, 𝐷) in the portfolio.

  • Selling 𝒌[𝑑] ∈ [1,𝒉[𝑑]] shares results in 𝒉𝒕+1[𝑑] = 𝒉𝒕 [𝑑] − 𝒌[𝑑],where𝒌[𝑑] ∈Z+ and𝑑 =1,…,𝐷.
  • Holding, 𝒉𝒕+1[𝑑]=𝒉𝒕[𝑑].
  • Buying 𝒌[𝑑] shares results in 𝒉𝒕+1[𝑑] = 𝒉𝒕 [𝑑] + 𝒌[𝑑].

At time 𝑡 an action is taken and the stock prices update at 𝑡+1, accordingly the portfolio values may change from “portfolio value 0” to “portfolio value 1”, “portfolio value 2”, or “portfolio value 3”, respectively, as illustrated in Figure 2. Note that the portfolio value is 𝒑𝑻 𝒉 + 𝑏.

3.3 Constraints

  • Market liquidity: The orders can be rapidly executed at the close price. We assume that stock market will not be affected by our reinforcement trading agent.
  • Nonnegative balance: the allowed actions should not result in a negative balance.
  • Transaction cost: transaction costs are incurred for each trade. There are many types of transaction costs such as exchange fees, execution fees, and SEC fees. Different brokers have different commission fees. Despite these variations in fees, we assume that our transaction costs to be 1/1000 of the value of each trade (either buy or sell).
  • Risk-aversion for market crash: there are sudden events that may cause stock market crash, such as wars, collapse of stock market bubbles, sovereign debt default, and financial crisis. To control the risk in a worst-case scenario like 2008 global financial crisis, we employ the financial turbulence index 𝑡𝑢𝑟𝑏𝑢𝑙𝑒𝑛𝑐𝑒𝑡 that measures extreme asset price movements.

3.4 Return maximization as trading goal

We define our reward function as the change of the portfolio value when action 𝑎 is taken at state 𝑠 and arriving at new state 𝑠 + 1.

The goal is to design a trading strategy that maximizes the change of the portfolio value 𝑟(𝑠𝑡,𝑎𝑡,𝑠𝑡+1) in the dynamic environment, and we employ the deep reinforcement learning method to solve this problem.

3.5 Environment for multiple stocks:

State Space: We use a 181-dimensional vector (30 stocks * 6 + 1) consists of seven parts of information to represent the state space of multiple stocks trading environment

  1. Balance: available amount of money left in the account at current time step
  2. Price: current adjusted close price of each stock.
  3. Shares: shares owned of each stock.
  4. MACD: Moving Average Convergence Divergence (MACD) is calculated using close price.
  5. RSI: Relative Strength Index (RSI) is calculated using close price.
  6. CCI: Commodity Channel Index (CCI) is calculated using high, low and close price.
  7. ADX: Average Directional Index (ADX) is calculated using high, low and close price.

Action Space:

  1. For a single stock, the action space is defined as {-k,…,-1, 0, 1, …, k}, where k and -k presents the number of shares we can buy and sell, and k ≤h_max while h_max is a predefined parameter that sets as the maximum amount of shares for each buying action.
  2. For multiple stocks, therefore the size of the entire action space is (2k+1)^30.
  3. The action space is then normalized to [-1, 1], since the RL algorithms A2C and PPO define the policy directly on a Gaussian distribution, which needs to be normalized and symmetric.

3.6 Trading agent based on deep reinforcement learning

A2C

A2C is a typical actor-critic algorithm which we use as a component in the ensemble method. A2C is introduced to improve the policy gradient updates. A2C utilizes an advantage function to reduce the variance of the policy gradient. Instead of only estimates the value function, the critic network estimates the advantage function. Thus, the evaluation of an action not only depends on how good the action is, but also considers how much better it can be. So that it reduces the high variance of the policy networks and makes the model more robust.

A2C uses copies of the same agent working in parallel to update gradients with different data samples. Each agent works independently to interact with the same environment. After all of the parallel agents finish calculating their gradients, A2C uses a coordinator to pass the average gradients over all the agents to a global network. So that the global network can update the actor and the critic network. The presence of a global network increases the diversity of training data. The synchronized gradient update is more cost-effective, faster and works better with large batch sizes. A2C is a great model for stock trading because of its stability.

DDPG

DDPG is an actor-critic based algorithm which we use as a component in the ensemble strategy to maximize the investment return. DDPG combinesthe frameworks of both Q-learning and policy gradient, and uses neural networks as function approximators. In contrast with DQN that learns indirectly through Q-values tables and suffers the curse of dimensionality problem, DDPG learns directly from the observations through policy gradient. It is proposed to deterministically map states to actions to better fit the continuous action space environment.

PPO

We explore and use PPO as a component in the ensemble method. PPO is introduced to control the policy gradient update and ensure that the new policy will not be too different from the older one. PPO tries to simplify the objective of Trust Region Policy Optimization (TRPO) by introducing a clipping term to the objective function.

The objective function of PPO takes the minimum of the clipped and normal objective. PPO discourages large policy change move outside of the clipped interval. Therefore, PPO improves the stability of the policy networks training by restricting the policy update at each training step. We select PPO for stock trading because it is stable, fast, and simpler to implement and tune.

Ensemble strategy

Our purpose is to create a highly robust trading strategy. So we use an ensemble method to automatically select the best performing agent among PPOA2C, and DDPG to trade based on the Sharpe ratio. The ensemble process is described as follows:

Step 1. We use a growing window of 𝑛 months to retrain our three agents concurrently. In this paper, we retrain our three agents at every three months.

Step 2. We validate all three agents by using a 3-month validation rolling window followed by training to pick the best performing agent which has the highest Sharpe ratio. We also adjust risk-aversion by using turbulence index in our validation stage.

Step 3. After validation, we only use the best model with the highest Sharpe ratio to predict and trade for the next quarter.

3.7 Performance evaluations

We use Quantopian’s pyfolio to do the backtesting. The charts look pretty good, and it takes literally one line of code to implement it. You just need to convert everything into daily returns.

3 thoughts on “Deep Reinforcement Learning for Stock Trading

  • The DDPG structure is usually really unstable, is there any way to apply it here to over come it?

    Reply
  • Debbie Kettel

    I’m very pleased to discover this web site. I need to to thank you for ones time for this fantastic read!! I definitely loved every little bit of it and I have you saved.

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses User Verification plugin to reduce spam. See how your comment data is processed.