# Introduction to Deep Q Learning

Download Full Report: **here **

**Introduction**

This document is a brief introduction to the deep Q learning (DQN) and some of its most important improvements. It also covers topics including learning from examples, multi-agent environment and using DQN to solve natural language processing problems. The last part of this document introduces the possible applications of DQN and its variants in trading.

**Deep reinforcement learning and DQN**

**Markov Decision Process (MDP)**

Markov decision process is a stochastic control process and it is usually represented by 5-tuple

[if gte msEquation 12]><m:oMath><![if supportFields]><span style='font-family:"Cambria Math",serif'><span style='mso-element:field-begin'></span></span></m:oMath><![endif]><![if !msEquation]><span style='font-size:12.0pt;font-family:"Times New Roman",serif;mso-fareast-font-family: "Times New Roman";position:relative;top:3.0pt;mso-text-raise:-3.0pt; mso-ansi-language:EN-US;mso-fareast-language:ZH-CN;mso-bidi-language:AR-SA'><![if gte vml 1]><v:shapetype id="_x0000_t75" coordsize="21600,21600" o:spt="75" o:preferrelative="t" path="m@4@5l@4@11@9@11@9@5xe" filled="f" stroked="f"> <v:stroke joinstyle="miter"></v:stroke> <v:formulas> <v:f eqn="if lineDrawn pixelLineWidth 0"></v:f> <v:f eqn="sum @0 1 0"></v:f> <v:f eqn="sum 0 0 @1"></v:f> <v:f eqn="prod @2 1 2"></v:f> <v:f eqn="prod @3 21600 pixelWidth"></v:f> <v:f eqn="prod @3 21600 pixelHeight"></v:f> <v:f eqn="sum @0 0 1"></v:f> <v:f eqn="prod @6 1 2"></v:f> <v:f eqn="prod @7 21600 pixelWidth"></v:f> <v:f eqn="sum @8 21600 0"></v:f> <v:f eqn="prod @7 21600 pixelHeight"></v:f> <v:f eqn="sum @10 21600 0"></v:f> </v:formulas> <v:path o:extrusionok="f" gradientshapeok="t" o:connecttype="rect"></v:path> <o:lock v:ext="edit" aspectratio="t"></o:lock> </v:shapetype><v:shape id="_x0000_i1025" type="#_x0000_t75" style='width:4.5pt; height:12pt'> <v:imagedata src="file:///C:/Users/DEFAUL~1.ADM/AppData/Local/Temp/msohtmlclip1/01/clip_image001.png" o:title="" chromakey="white"></v:imagedata> </v:shape><![endif]><![if !vml]><![endif]></span><![endif]><span style='font-style: normal'><span style='mso-spacerun:yes'> </span>CITATION Mar \l 1033 </span><![if gte msEquation 12]><m:oMath><span style='font-family:"Cambria Math",serif'><span style='mso-element:field-separator'></span></span><![endif]></m:oMath><![endif][if !msEquation][if gte vml 1]><v:shape id="_x0000_i1025" type="#_x0000_t75" style='width:3.75pt;height:15.75pt'> <v:imagedata src="file:///C:/Users/DEFAUL~1.ADM/AppData/Local/Temp/msohtmlclip1/01/clip_image003.png" o:title="" chromakey="white"></v:imagedata> </v:shape><![endif][if !vml][endif][endif]

Markov decision process, n.d.:

*S*is the state setA is the action set

Is the transition probability (given current state and action, what’s the probability distribution of the next state)

is the immediate reward

is the discount factor

Optimization problems can be formulated by MDPs, thus can be solved using dynamic programming or reinforcement learning methods.

**Deep reinforcement learning and DQN**

Reinforcement learning uses an agent to interact with the environment and take actions in different states to maximize cumulative reward (Silver).

The key components of reinforcement learning include agent, action, discount factor, environment, state, reward, policy, value, Q-value or action-value and trajectory (A Beginner's Guide to Deep Reinforcement Learning, n.d.).

**Agent:**The agent in reinforcement learning interact with the environment and take actions. For example, in stock market, an agent can be seen as a trader making decisions in different situations.**Action (**An action is what an agent can do in the environment. For example, in the stock market, actions can be buying, selling or holding a stock. Usually, the action set is finite and discrete, we can also setup continuous actions in Q learning by modifying the model or use some policy gradient methods.*A*) :**Discount factor (**Discount factor acts on the rewards of different time. If the discount factor is large, the agent will consider more of the future rewards. Otherwise, if the discount factor is small, the agent will focus more on the immediate rewards.*Y*) :**Environment:**The environment takes the agent’s actions as input and agent’s states and rewards as output.**State (**State is the current situation the agent is facing. State can be fully observable or partially observable. For example, the current state of the stock market can be partially observed by the price, volume etc.*S*):**Reward (**A reward is given by the environment to the agent in each step. A carefully set reward will guide the agent in the exploration and make the policy converge to the optimal.*R*) :**Policy ( Pi ) :**Policy is what the agent’s action should be given its current state. Policy is a distribution over different actions and the agent can choose its strategy based on the distribution.**Value (**Value function is the discounted expected future reward in a certain state. It can also be defined as the discounted expected future reward in a certain state by taking a specific policy.*V*) :**Q-value or action-value (**Action value function is the discounted expected future reward in a certain state and take a certain action. It can also be defined as the discounted expected future reward in a certain state and action by taking a specific policy .*Q*) :**Trajectory:**A trajectory is a series of actions and states taken by the agent in a time period.

The goal of the reinforcement learning can be defined as:

which is the discounted accumulated future rewards

**Q Learning**

Q learning is a value-based off-policy TD control algorithm to solve the reinforcement learning problems (Watkins, 1989). Q learning can be defined by:

by updating the *Q* value iteratively, the *Q *value can converge to the optimal* q** , which can be used to choose the optimal policy for the agent (Barto, Reinforcement Learning: An Introduction).

After we get the optimal *q**, the agent can simply choose the action with the highest *Q* value in a certain state. Usually, in order to encourage exploration, *E*-greedy method is used. The agent takes the action with the highest *Q* value with probability 1 - *E* and takes actions randomly with probability *E*.

**Deep Q Learning (DQN)**

The action value function *Q(s, a) *has two parameters *s* and *a* , so traditionally we can use a two-dimensional matrix to represent it. However, using a matrix means the state and action are all finite and cannot be very large. In real-world problems, the state space is usually high-dimensional. For example, in Atari games the states are the images in each game frame.

One way to address this problem is to use a function to approximate *Q* values with a set of parameters, then we can rewrite *Q(s, a)* to *Q(s, a; 0)* (Barto, Introduction to reinforcement learning, 1998). In order to generalize the function approximator without feature engineering, we can use a deep neural network to represent *Q* values, this is the so-called DQN method (Mnih, 2015).

The loss function for DQN is:

However, reinforcement learning is unstable when using a nonlinear function approximator (e.g. deep neural network) to represent *Q * function.

This instability has several causes (Tsitsiklis, 1997):

The correlations of the sequence of observations

Small updates to Q may significantly change the policy and change the data distribution

The correlations between [if gte msEquation 12]><m:oMath><i style='mso-bidi-font-style:normal'><span style='font-size:12.0pt;font-family: "Cambria Math",serif;mso-fareast-font-family:"Times New Roman";mso-bidi-font-family: "Times New Roman";mso-ansi-language:EN-US;mso-fareast-language:ZH-CN; mso-bidi-language:AR-SA'><m:r>Q</m:r></span></i></m:oMath><![endif][if !msEquation][if gte vml 1]><v:shapetype id="_x0000_t75" coordsize="21600,21600" o:spt="75" o:preferrelative="t" path="m@4@5l@4@11@9@11@9@5xe" filled="f" stroked="f"> <v:stroke joinstyle="miter"></v:stroke> <v:formulas> <v:f eqn="if lineDrawn pixelLineWidth 0"></v:f> <v:f eqn="sum @0 1 0"></v:f> <v:f eqn="sum 0 0 @1"></v:f> <v:f eqn="prod @2 1 2"></v:f> <v:f eqn="prod @3 21600 pixelWidth"></v:f> <v:f eqn="prod @3 21600 pixelHeight"></v:f> <v:f eqn="sum @0 0 1"></v:f> <v:f eqn="prod @6 1 2"></v:f> <v:f eqn="prod @7 21600 pixelWidth"></v:f> <v:f eqn="sum @8 21600 0"></v:f> <v:f eqn="prod @7 21600 pixelHeight"></v:f> <v:f eqn="sum @10 21600 0"></v:f> </v:formulas> <v:path o:extrusionok="f" gradientshapeok="t" o:connecttype="rect"></v:path> <o:lock v:ext="edit" aspectratio="t"></o:lock> </v:shapetype><v:shape id="_x0000_i1025" type="#_x0000_t75" style='width:8.25pt; height:14.25pt'> <v:imagedata src="file:///C:/Users/DEFAUL~1.ADM/AppData/Local/Temp/msohtmlclip1/01/clip_image001.png" o:title="" chromakey="white"></v:imagedata> </v:shape><![endif][if !vml][endif][endif] values and the target values

*r + y max Q*( s', a' )*

There are two modifications to the learning process which solves the problem and provides more stable results (Mnih, 2015):

**Experience replay:**This concept is inspired by biology. When people are making decisions, they will refer to their memory and use past experience to choose the best action. Experience replay stores the past experience (MDP tuples) of an agent and choose some of the experience from the memory to train the model. Experience replay can randomize over the data, remove the correlations of the sequence of observations and smooth over the changes in the data distribution, thus solving the first two problems.**Target network:**Usually, a target network is a copy of the online network but updated only periodically. An online network is the original network that is updated every step. Using a target network can reduce the correlations between [if gte msEquation 12]><m:oMath><i style='mso-bidi-font-style:normal'><span style='font-size:12.0pt;font-family: "Cambria Math",serif;mso-fareast-font-family:"Times New Roman";mso-bidi-font-family: "Times New Roman";mso-ansi-language:EN-US;mso-fareast-language:ZH-CN; mso-bidi-language:AR-SA'><m:r>Q</m:r></span></i></m:oMath><![endif][if !msEquation][if gte vml 1]><v:shapetype id="_x0000_t75" coordsize="21600,21600" o:spt="75" o:preferrelative="t" path="m@4@5l@4@11@9@11@9@5xe" filled="f" stroked="f"> <v:stroke joinstyle="miter"></v:stroke> <v:formulas> <v:f eqn="if lineDrawn pixelLineWidth 0"></v:f> <v:f eqn="sum @0 1 0"></v:f> <v:f eqn="sum 0 0 @1"></v:f> <v:f eqn="prod @2 1 2"></v:f> <v:f eqn="prod @3 21600 pixelWidth"></v:f> <v:f eqn="prod @3 21600 pixelHeight"></v:f> <v:f eqn="sum @0 0 1"></v:f> <v:f eqn="prod @6 1 2"></v:f> <v:f eqn="prod @7 21600 pixelWidth"></v:f> <v:f eqn="sum @8 21600 0"></v:f> <v:f eqn="prod @7 21600 pixelHeight"></v:f> <v:f eqn="sum @10 21600 0"></v:f> </v:formulas> <v:path o:extrusionok="f" gradientshapeok="t" o:connecttype="rect"></v:path> <o:lock v:ext="edit" aspectratio="t"></o:lock> </v:shapetype><v:shape id="_x0000_i1025" type="#_x0000_t75" style='width:8.25pt; height:14.25pt'> <v:imagedata src="file:///C:/Users/DEFAUL~1.ADM/AppData/Local/Temp/msohtmlclip1/01/clip_image001.png" o:title="" chromakey="white"></v:imagedata> </v:shape><![endif][if !vml][endif][endif] values and the target values*r + y max Q*( s', a' )*, thus solving the third problem.

**Extensions to DQN**

**Double Q-Learning**

Standard DQN suffers from substantial overestimation problems in some of the Atari games (Hasselt, 2015). Double DQN can reduce the overestimation problem and lead to better performance in several games.

The main idea of double DQN is to reduce overestimation by decoupling the max operation in the action selection from action evaluation in the target.

Standard DQN target:

Double DQN target:

**Prioritized replay**

Standard DQN samples minibatches from the experience replay memory uniformly regardless of their significance. Prioritized experience replay assigns different priorities to different transitions and replay important transitions more frequently, and therefore learn more efficiently (Schaul, 2016).

The ideal criterion to prioritize replay should be how much the RL agent can learn from this period of replay memory, this can be approximated by the magnitude of a transition’s TD error , which indicates how far the value is from its target estimate.

**Stochastic Prioritization**

Stochastic prioritization makes sure that every transition has a probability to be chosen no matter how small its TD-error is:

**Annealing the Bias**

Prioritized experience replay changes the sample distribution from the true experience distribution, so we need importance-sampling to correct this bias:

**Dueling Networks**

Dueling network splits the *Q* value estimation into two streams: one for the state value function and the other for state-dependent action advantage function. This network structure can generalize learning across actions without imposing any change to the RL algorithm (ZiyuWang, 2016).

Figure 1. A popular single stream Q-network (top) and the dueling Q-network (bottom). The dueling network has two streams to separately estimate (scalar) state-value and the advantages for each action; the green output module combines them. Both networks output Q-values for each action.[if supportFields]><span style='mso-element:field-begin'></span> CITATION Ziy16 \l 1033 <span style='mso-element:field-separator'></span><![endif] (ZiyuWang, 2016)[if supportFields]><span style='mso-element:field-end'></span><![endif]

Figure 2. See, attend and drive: Value and advantage saliency maps (red-tinted overlay) on the game Enduro, for a trained dueling architecture. The value stream learns to pay attention to the road. The advantage stream pays attention when there are cars immediately in front, to avoid collisions.[if supportFields]><span style='mso-element:field-begin'></span> CITATION Ziy16 \l 1033 <span style='mso-element:field-separator'></span><![endif] (ZiyuWang, 2016)[if supportFields]><span style='mso-element:field-end'></span><![endif]

The insight of dueling network is that for many states, it is unnecessary to estimate the value of each action choice. For example, in the car driving game above, the agent should focus on the road regardless of action, while when there’s car nearby, additional focus should be put on the car to make certain action.

The dueling network can be formulated as:

However, this equation is unidentifiable in the sense that given *Q* we cannot recover *V* and *A *uniquely. For example, we can add a constant to *V* and subtract the same constant from *A* but the *Q *will remain the same. To address this issue of identifiability, we can force the advantage function estimator to have zero advantage at the chosen action:

**Multi-step Learning**

Standard DQN only use one-step reward. Alternatively, we can use multi-step reward to get a longer-term forward view and can also lead to faster learning (Barto, Introduction to reinforcement learning, 1998).

**Distributional RL**

Standard *Q*[if gte msEquation 12]><m:oMath><i style='mso-bidi-font-style:normal'><span
style='font-size:12.0pt;font-family:"Cambria Math",serif;mso-fareast-font-family:
"Times New Roman";mso-bidi-font-family:"Times New Roman";mso-ansi-language:
EN-US;mso-fareast-language:ZH-CN;mso-bidi-language:AR-SA'><m:r>Q</m:r></span></i></m:oMath><![endif][if !msEquation][if gte vml 1]><v:shapetype
id="_x0000_t75" coordsize="21600,21600" o:spt="75" o:preferrelative="t"
path="m@4@5l@4@11@9@11@9@5xe" filled="f" stroked="f">
<v:stroke joinstyle="miter"></v:stroke>
<v:formulas>
<v:f eqn="if lineDrawn pixelLineWidth 0"></v:f>
<v:f eqn="sum @0 1 0"></v:f>
<v:f eqn="sum 0 0 @1"></v:f>
<v:f eqn="prod @2 1 2"></v:f>
<v:f eqn="prod @3 21600 pixelWidth"></v:f>
<v:f eqn="prod @3 21600 pixelHeight"></v:f>
<v:f eqn="sum @0 0 1"></v:f>
<v:f eqn="prod @6 1 2"></v:f>
<v:f eqn="prod @7 21600 pixelWidth"></v:f>
<v:f eqn="sum @8 21600 0"></v:f>
<v:f eqn="prod @7 21600 pixelHeight"></v:f>
<v:f eqn="sum @10 21600 0"></v:f>
</v:formulas>
<v:path o:extrusionok="f" gradientshapeok="t" o:connecttype="rect"></v:path>
<o:lock v:ext="edit" aspectratio="t"></o:lock>
</v:shapetype><v:shape id="_x0000_i1025" type="#_x0000_t75" style='width:8.25pt;
height:14.25pt'>
<v:imagedata src="file:///C:/Users/DEFAUL~1.ADM/AppData/Local/Temp/msohtmlclip1/01/clip_image001.png"
o:title="" chromakey="white"></v:imagedata>
</v:shape><![endif][if !vml][endif][endif] function considers the expected value of the discounted future reward:

However, if we can model the distribution of the future reward instead of only the expected value, we can get better approximation for *Q* value and lead to more stable learning (Bellemare, 2017). For example, (Yu, n.d.) in the bimodal distribution below, if we only use the expected value of commute time, we will fail to get a good representation of the result.

Bellman equation of a random transition *( x,a ) -> ( X',A')*

Standard *Q*[if gte msEquation 12]><m:oMath><i
style='mso-bidi-font-style:normal'><span style='font-size:12.0pt;font-family:
"Cambria Math",serif;mso-fareast-font-family:"Times New Roman";mso-bidi-font-family:
"Times New Roman";mso-ansi-language:EN-US;mso-fareast-language:ZH-CN;
mso-bidi-language:AR-SA'><m:r>Q</m:r></span></i></m:oMath><![endif][if !msEquation][if gte vml 1]><v:shapetype
id="_x0000_t75" coordsize="21600,21600" o:spt="75" o:preferrelative="t"
path="m@4@5l@4@11@9@11@9@5xe" filled="f" stroked="f">
<v:stroke joinstyle="miter"></v:stroke>
<v:formulas>
<v:f eqn="if lineDrawn pixelLineWidth 0"></v:f>
<v:f eqn="sum @0 1 0"></v:f>
<v:f eqn="sum 0 0 @1"></v:f>
<v:f eqn="prod @2 1 2"></v:f>
<v:f eqn="prod @3 21600 pixelWidth"></v:f>
<v:f eqn="prod @3 21600 pixelHeight"></v:f>
<v:f eqn="sum @0 0 1"></v:f>
<v:f eqn="prod @6 1 2"></v:f>
<v:f eqn="prod @7 21600 pixelWidth"></v:f>
<v:f eqn="sum @8 21600 0"></v:f>
<v:f eqn="prod @7 21600 pixelHeight"></v:f>
<v:f eqn="sum @10 21600 0"></v:f>
</v:formulas>
<v:path o:extrusionok="f" gradientshapeok="t" o:connecttype="rect"></v:path>
<o:lock v:ext="edit" aspectratio="t"></o:lock>
</v:shapetype><v:shape id="_x0000_i1025" type="#_x0000_t75" style='width:8.25pt;
height:14.25pt'>
<v:imagedata src="file:///C:/Users/DEFAUL~1.ADM/AppData/Local/Temp/msohtmlclip1/01/clip_image001.png"
o:title="" chromakey="white"></v:imagedata>
</v:shape><![endif][if !vml][endif][endif] learning:* Q ( x,a ) = ER( x,a) + yEQ ( X', A' )*

Distributional *Q*[if gte msEquation 12]><m:oMath><i
style='mso-bidi-font-style:normal'><span style='font-size:12.0pt;font-family:
"Cambria Math",serif;mso-fareast-font-family:"Times New Roman";mso-bidi-font-family:
"Times New Roman";mso-ansi-language:EN-US;mso-fareast-language:ZH-CN;
mso-bidi-language:AR-SA'><m:r>Q</m:r></span></i></m:oMath><![endif][if !msEquation][if gte vml 1]><v:shapetype
id="_x0000_t75" coordsize="21600,21600" o:spt="75" o:preferrelative="t"
path="m@4@5l@4@11@9@11@9@5xe" filled="f" stroked="f">
<v:stroke joinstyle="miter"></v:stroke>
<v:formulas>
<v:f eqn="if lineDrawn pixelLineWidth 0"></v:f>
<v:f eqn="sum @0 1 0"></v:f>
<v:f eqn="sum 0 0 @1"></v:f>
<v:f eqn="prod @2 1 2"></v:f>
<v:f eqn="prod @3 21600 pixelWidth"></v:f>
<v:f eqn="prod @3 21600 pixelHeight"></v:f>
<v:f eqn="sum @0 0 1"></v:f>
<v:f eqn="prod @6 1 2"></v:f>
<v:f eqn="prod @7 21600 pixelWidth"></v:f>
<v:f eqn="sum @8 21600 0"></v:f>
<v:f eqn="prod @7 21600 pixelHeight"></v:f>
<v:f eqn="sum @10 21600 0"></v:f>
</v:formulas>
<v:path o:extrusionok="f" gradientshapeok="t" o:connecttype="rect"></v:path>
<o:lock v:ext="edit" aspectratio="t"></o:lock>
</v:shapetype><v:shape id="_x0000_i1025" type="#_x0000_t75" style='width:8.25pt;
height:14.25pt'>
<v:imagedata src="file:///C:/Users/DEFAUL~1.ADM/AppData/Local/Temp/msohtmlclip1/01/clip_image001.png"
o:title="" chromakey="white"></v:imagedata>
</v:shape><![endif][if !vml][endif][endif] learning: *Z ( x,a ) = R ( x,a ) + yZ ( X', A' )*, where is the value distribution

We can model the distribution of *Z *with discrete support* z*. *z* is a vector with

To read the remaining report click: **here **

**Important Disclaimer and Disclosure Information**

Algo Depth makes no representations of any kind regarding this report. This includes, without limitation, warranties of title, merchantability, fitness for a particular purpose, non-infringement, absence of latent or other defects, accuracy, or the absence of errors, whether or not known or discoverable. In no event shall the author(s), Algo Depth or any of its officers, employees, or representatives, be liable to you on any legal theory (including, without limitation, negligence) or otherwise for any claims, losses, costs or damages of any kind, including direct, special, indirect, incidental, consequential, punitive, exemplary, or other losses, costs, expenses, or damages, arising out of the use of the report, including the information contained herein.

This report is prepared for informational and educational purposes only, and is not an offer to sell or the solicitation of an offer to buy any securities. The recipient is reminded that an investment in any security is subject to many risks, including the complete loss of capital, and other risks that this report does not contain. As always, past performance is no indication of future results. This report does not constitute any form of invitation or inducement by Algo Depth to engage in investment activity.

Algo Depth has not independently verified the information provided by the author(s) and provides no assurance to its accuracy, reliability, suitability, or completeness. Algo Depth may have opinions that materially differ from those discussed, and may have significant financial interest in the positions mentioned in the report.

This report may contain certain projections and statements regarding anticipated future performance of securities. These statements are subject to significant uncertainties that are not in our control and are subject to change.

Algo Depth makes no representations, express or implied, regarding the accuracy or completeness of this information, and the recipient accepts all risks in relying on this report for any purpose whatsoever. This report shall remain the property of Algo Depth and Algo Depth reserves the right to require the return of this report at any time.