Download Full Report: here

In the era of big data, due to the increased volume and complexity of data, traditional statistical learning tools have been shown to have limited performance in classification and feature extraction tasks in a range of domains, including computer vision, healthcare, language processing (LeCun et al., 2015). Recently, cutting-edge deep learning methods have demonstrated remarkable performance in processing image, speech, and DNA expression data by allowing the architecture to learn representations of the data with multiple levels of abstraction. However, as the amount and complexity of financial data have also been increasing, deep learning methods to predict stock returns and generate investment returns have not been well applied and researched. This introduction will provide a detailed explanation of the basics about deep learning and how deep learning could be applied to stock performance prediction and alpha generation.

1. Supervised learning and unsupervised learning

The most basic form of statistical learning, deep or not, is supervised learning and unsupervised learning. Supervised learning is a statistical learning task of learning a function that maps an input to an output using “labeled” data, while unsupervised learning inferring a function to describe hidden structure from "unlabeled" data.

An example of the difference of supervised and unsupervised learning is demonstrated in Figure 1 below. Under supervised learning setting, the features of the objects (corn, banana, apples, etc.), as well as the object labels (vegetable/fruits), are given. The machine-learning model is trained to classify the objects into different labeled groups. Under unsupervised learning setting, only the features of the objects are given and the labels of the objects are unknown. The machine-learning model is trained to separate the objects based on the similarities and differences of their features.

Figure 1.

Supervised learning is the most common form of deep learning. Most alpha generation tasks, such as predicting if the return of a stock will move up or down, are supervised learning tasks because stock return movements are labeled before model training.

2. Definition of Deep Learning

LeCun et al (2015) define deep learning methods as representation-learning methods with multiple levels of representation, obtained by composing simple but non-linear modules that transform a representation at one level to a representation at a higher, slightly more abstract level. In plain English, deep learning models are typically deeper and far more complex than traditional machine learning models to generate new data representations that perform well in prediction tasks.

Deep learning methods have brought about breakthroughs in image recognition (e.g. whether an image contains a cat or not) (Krizhevsky et al., 2012; Redmon et al., 2016), speech recognition (e.g. recognize spoken languages into texts) (Hinton et al., 2012), machine translation (e.g. automatically translation French text into English) (Cho et al., 2014), and human gene expression prediction (e.g. predict the effects of mutations in non-coding DNA on gene expression and diseases) (Leung et al., 2014; Xiong et al., 2015). An illustration of an efficient deep learning object detection algorithm that has been applied to build self-driving cars is shown in Figure 2.

Figure 2: An illustration of YOLO object detection algorithms: The YOLO algorithms developed by Redmon et al., 2016 uses a deep convolutional neural network with a combination of filters, anchor boxes and pooling layers to detect objects as well as their boundaries in an image.

Deep learning typically means using an Artificial Neural Network (ANN) architecture containing more than 3 hidden layers. The definition of ANNs will be given in the following section:

3. Introduction to Artificial Neural Networks (ANNs)

The main component of all deep learning algorithms is Artificial Neural Networks (ANNs). Understanding how ANNs are constructed and trained is the first step to understand deep learning methods. This section provides a concise review of ANNs, training ANNs, normalizing ANN inputs, regularize ANN parameters, as well as hyper-parameter tuning.

a. ANNs

Artificial Neural Networks (ANNs) are computing systems that inspired by the biological neural networks constituting brains. Typically, an ANN is a network of nodes with multilayers (one input layer, one output layer, and several hidden internal layers) (Figure 3, Figure 4). Each node can store a value and each edge can have a weight. The value of a node on a given layer, except for the first layer (i.e., the input layer), is a function of a bias and the weighted average values of all nodes on the previous layer. The function is called activation function. Usually, activation functions, such as Sigmoid, Rectified Linear Unit (ReLU) (Nair et al., 2010), and Hyperbolic Tangent, are non-linear.

Figure 4: A multiple-layer ANN: This multiple layer ANN has 1 input layer, 2 hidden layers, and 1 output layer, with each layer connected to the previous layer. An activation function ƒ is applied to each node on the hidden layer and the output layer.

b. Training the ANNs:

A training data set and a validation set, in which the values of the nodes in the output layers are known (e.g., 1 for positive outcome and 0 for negative outcome), are needed to estimate the optimal values of the bias and edge weights (i.e., to train the ANN). The idea is to find a set of biases and edge weights that minimize the difference between the true values and predicted values of nodes in the output layer. The difference is a function of the biases and edge weights and is usually called loss function. Once an ANN is trained, it can be used to predict the values of the nodes in the output layer for a testing set, in which the values of the nodes in the output layer are unknown yet.

Back Propagation is a way to compute the loss contribution of each node. The training starts with random or predetermined weights and uses forward propagation to make a prediction. A loss measures the difference between the predicted output value and the true output value. Since the predicted value is a function of biases and edge weights, the loss is also a function of biases and edge weights. Then backpropagation can be used to compute the slope (i.e., the change rate) of the loss function with respect to each edge weight. The next step of the training is to adjust the edge weights by a small amount according to the directions of the slope and run this process recursively until a global minimum loss is achieved. Because the slope of the loss function is often called “gradient” and the global minimum of loss is often achieved when the “gradient” is flat, as is shown in Figure 5, the method of training ANNs is called “Gradient Descent.”

Figure 5: Gradient Descent Training: Loss(w) is the loss function. In Gradient Descent optimization, learning rate represents by how much the edge weights are adjusted in each step before the global minimum is achieved.

c. Normalization

Normalization means adjusting input feature values measured on different scales to a notionally common scale. Most of the times, the input features of a deep learning model are measured on different scales so that the range of their values could vary by large. For example, a financial debtto-asset ratio is always between 0 and 1 while financial earnings such as EBITDA could be thousands or millions. Since it is not reasonable to feed features on different scales into the model directly, input feature normalization must be implemented before model training. Commonly used normalizations methods include standard score and feature scaling, which are shown below

D. Regularization

One of the common problems of deep learning problem is “large variance problem”. Sometimes feeding too much irrelevant and useless features into the model will lead to over-fitting, which means that the model could not generalized well on unseen testing data, causing the performance variance of the training set and testing set to be very large. Regularization is a solution to over-fitting or “large variance problem” by increasing training set error and reducing the variance of the training set and testing set. There is a range of regularization methods to choose for deep learning models. The most classic regularization method is to add the L-2 norm of the weights (sum square of all weights) to the loss function so that irrelevant weights value could shrink to a very small value. Dropout regularization randomly drops a group of nodes in each gradient descent iteration steps to prevent overfitting (Figure 6). In addition, early stopping regularization measures the error of both the training set and the validation set. The gradient descent training stops when the validation set error started to increase as the training set error keeps decreases (Figure 7).