Download Full Report: here

Introduction

This document is a brief introduction to the deep Q learning (DQN) and some of its most important improvements. It also covers topics including learning from examples, multi-agent environment and using DQN to solve natural language processing problems. The last part of this document introduces the possible applications of DQN and its variants in trading.

Deep reinforcement learning and DQN

#### Markov Decision Process (MDP)

Markov decision process is a stochastic control process and it is usually represented by 5-tuple

Markov decision process, n.d.:

S is the state set

A is the action set

Is the transition probability (given current state and action, what’s the probability distribution of the next state)

is the immediate reward

is the discount factor

Optimization problems can be formulated by MDPs, thus can be solved using dynamic programming or reinforcement learning methods.

Deep reinforcement learning and DQN

Reinforcement learning uses an agent to interact with the environment and take actions in different states to maximize cumulative reward (Silver).

The key components of reinforcement learning include agent, action, discount factor, environment, state, reward, policy, value, Q-value or action-value and trajectory (A Beginner's Guide to Deep Reinforcement Learning, n.d.).

Agent: The agent in reinforcement learning interact with the environment and take actions. For example, in stock market, an agent can be seen as a trader making decisions in different situations.

Action ( A ) : An action is what an agent can do in the environment. For example, in the stock market, actions can be buying, selling or holding a stock. Usually, the action set is finite and discrete, we can also setup continuous actions in Q learning by modifying the model or use some policy gradient methods.

Discount factor ( Y ) : Discount factor acts on the rewards of different time. If the discount factor is large, the agent will consider more of the future rewards. Otherwise, if the discount factor is small, the agent will focus more on the immediate rewards.

Environment: The environment takes the agent’s actions as input and agent’s states and rewards as output.

State ( S ): State is the current situation the agent is facing. State can be fully observable or partially observable. For example, the current state of the stock market can be partially observed by the price, volume etc.

Reward ( R ) : A reward is given by the environment to the agent in each step. A carefully set reward will guide the agent in the exploration and make the policy converge to the optimal.

Policy ( Pi ) : Policy is what the agent’s action should be given its current state. Policy is a distribution over different actions and the agent can choose its strategy based on the distribution.

Value ( V ) : Value function is the discounted expected future reward in a certain state. It can also be defined as the discounted expected future reward in a certain state by taking a specific policy.

Q-value or action-value ( Q ) : Action value function is the discounted expected future reward in a certain state and take a certain action. It can also be defined as the discounted expected future reward in a certain state and action by taking a specific policy .

Trajectory: A trajectory is a series of actions and states taken by the agent in a time period.

The goal of the reinforcement learning can be defined as:

which is the discounted accumulated future rewards

Q Learning

Q learning is a value-based off-policy TD control algorithm to solve the reinforcement learning problems (Watkins, 1989). Q learning can be defined by:

by updating the Q value iteratively, the Q value can converge to the optimal q* , which can be used to choose the optimal policy for the agent (Barto, Reinforcement Learning: An Introduction).

After we get the optimal q*, the agent can simply choose the action with the highest Q value in a certain state. Usually, in order to encourage exploration, E-greedy method is used. The agent takes the action with the highest Q value with probability 1 - E and takes actions randomly with probability E.

Deep Q Learning (DQN)

The action value function Q(s, a) has two parameters s and a , so traditionally we can use a two-dimensional matrix to represent it. However, using a matrix means the state and action are all finite and cannot be very large. In real-world problems, the state space is usually high-dimensional. For example, in Atari games the states are the images in each game frame.

One way to address this problem is to use a function to approximate Q values with a set of parameters, then we can rewrite Q(s, a) to Q(s, a; 0) (Barto, Introduction to reinforcement learning, 1998). In order to generalize the function approximator without feature engineering, we can use a deep neural network to represent Q values, this is the so-called DQN method (Mnih, 2015).

The loss function for DQN is:

However, reinforcement learning is unstable when using a nonlinear function approximator (e.g. deep neural network) to represent Q function.

This instability has several causes (Tsitsiklis, 1997):

The correlations of the sequence of observations

Small updates to Q may significantly change the policy and change the data distribution

The correlations between values and the target values r + y max Q*( s', a' )

There are two modifications to the learning process which solves the problem and provides more stable results (Mnih, 2015):

Experience replay: This concept is inspired by biology. When people are making decisions, they will refer to their memory and use past experience to choose the best action. Experience replay stores the past experience (MDP tuples) of an agent and choose some of the experience from the memory to train the model. Experience replay can randomize over the data, remove the correlations of the sequence of observations and smooth over the changes in the data distribution, thus solving the first two problems.

Target network: Usually, a target network is a copy of the online network but updated only periodically. An online network is the original network that is updated every step. Using a target network can reduce the correlations between values and the target values