Introduction to Deep Q Learning

June 27, 2019

Download Full Report: here 




This document is a brief introduction to the deep Q learning (DQN) and some of its most important improvements. It also covers topics including learning from examples, multi-agent environment and using DQN to solve natural language processing problems. The last part of this document introduces the possible applications of DQN and its variants in trading.


Deep reinforcement learning and DQN


Markov Decision Process (MDP)

Markov decision process is a stochastic control process and it is usually represented by 5-tuple 




Markov decision process, n.d.:​

  • S is the state set

  • A is the action set

Is the transition probability (given current state and action, what’s the probability distribution of the next state)


is the immediate reward




is the discount factor



Optimization problems can be formulated by MDPs, thus can be solved using dynamic programming or reinforcement learning methods.


Deep reinforcement learning and DQN


Reinforcement learning uses an agent to interact with the environment and take actions in different states to maximize cumulative reward (Silver).

The key components of reinforcement learning include agent, action, discount factor, environment, state, reward, policy, value, Q-value or action-value and trajectory (A Beginner's Guide to Deep Reinforcement Learning, n.d.).


  • Agent: The agent in reinforcement learning interact with the environment and take actions. For example, in stock market, an agent can be seen as a trader making decisions in different situations.

  • Action ( ) : An action is what an agent can do in the environment. For example, in the stock market, actions can be buying, selling or holding a stock. Usually, the action set is finite and discrete, we can also setup continuous actions in Q learning by modifying the model or use some policy gradient methods.

  • Discount factor ( Y ) : Discount factor acts on the rewards of different time. If the discount factor is large, the agent will consider more of the future rewards. Otherwise, if the discount factor is small, the agent will focus more on the immediate rewards.

  • Environment: The environment takes the agent’s actions as input and agent’s states and rewards as output.

  • State ( S ): State is the current situation the agent is facing. State can be fully observable or partially observable. For example, the current state of the stock market can be partially observed by the price, volume etc.

  • Reward ( R ) : A reward is given by the environment to the agent in each step. A carefully set reward will guide the agent in the exploration and make the policy converge to the optimal.

  • Policy ( Pi ) : Policy is what the agent’s action should be given its current state. Policy is a distribution over different actions and the agent can choose its strategy based on the distribution.

  • Value ( V ) : Value function  is the discounted expected future reward in a certain state. It can also be defined as the discounted expected future reward in a certain state by taking a specific policy.

  • Q-value or action-value ( Q ) : Action value function  is the discounted expected future reward in a certain state and take a certain action. It can also be defined as the discounted expected future reward in a certain state and action by taking a specific policy .

  • Trajectory: A trajectory is a series of actions and states taken by the agent in a time period.


The goal of the reinforcement learning can be defined as:

which is the discounted accumulated future rewards


Q Learning


Q learning is a value-based off-policy TD control algorithm to solve the reinforcement learning problems (Watkins, 1989). Q learning can be defined by:

by updating the Q value iteratively, the value can converge to the optimal q* , which can be used to choose the optimal policy for the agent (Barto, Reinforcement Learning: An Introduction).


After we get the optimal q*, the agent can simply choose the action with the highest Q value in a certain state. Usually, in order to encourage exploration, E-greedy method is used. The agent takes the action with the highest Q value with probability 1 - E and takes actions randomly with probability E.


Deep Q Learning (DQN)


The action value function Q(s, a) has two parameters s and a , so traditionally we can use a two-dimensional matrix to represent it. However, using a matrix means the state and action are all finite and cannot be very large. In real-world problems, the state space is usually high-dimensional. For example, in Atari g