Financial News to Predict the Stock Market

July 22, 2016


Download Full Report: Here


Welcome to Natural Language Processing

Financial news provides information to the general public. Consumers rely on information they read or hear before buying a product. The internet makes content easily accessible and more relevant than ever.   


Remember the game where you guess how many jellybeans are in the jar? Your guess, and all your friend’s guesses probably spanned across a wide range. Wisdom of crowds states that the average of a mass populations’ guesses will be closer to reality than an individual expert guess. This phenomenon holds true for predicting jelly beans, or sentiment around a stock.   

Consumer sentiment plays an important role in financial markets. We apply natural language processing to analyze text and understand what consumers are reading or talking about. We prove that using financial articles to predict stock movements outperforms a random guess investment strategy. 


We test three methods in our analysis. First, we use deep learning to test article and sentence level prediction. Next we create features from characters, words, and sentences in each data source. Finally, we create sentiment scores from each article to then predict stock returns.  


Our research is broken up as follow: 

1.) Data explained.  
2.) Method 1: Computer vision applied to sentiment analysis 
3.) Method 2: Decomposing each article into features for stock return prediction. 

            2.1: Feature construction  
            2.2: Hyperparameter Optimization

            2.3: Model training 
            2.4: Experiment Results 

4.) Method 3: Using our features to predict sentiment of each article.

5.) Future work. 


We collect articles from ten unique financial data sources, as listed in Table 1. In this research we focus exclusively on Seeking Alpha articles.   


We filter Seeking Alpha to find news related to companies in the S&P 500 from 2010-present. Our experiment contains over a million relevant articles for analysis.  


Format Data 
From each article we extract all text from the URL, the data of the publication, and the primary stock ticker the article relates to.  


We use today and prior days’ information to predict future stock returns. We calculate returns as the change in stock price from period (Pn) to current period (P).  


Returns are then classified as either positive or negative, rather than their percent return. In this experiment we predict the direction of stock movement rather than the percent return.  


Article Collection 
News articles are released at all times during the day, but financial markets in the US are only open from 9:30am Eastern Time to 4:00pm EST. We begin the data collection process at 3:00pm EST to capture the past 24 hours of news releases. This gives enough time to organize our data, run our models, output predictions, and tailor a portfolio strategy before markets close.  


Non-Trading Days 
On the day after a non-trading day, we fall into the unique scenario that we have more than 24 hours of unused data. In order to incorporate all meaningful news releases into our predictions, we organize all news releases from the last trading day at 3:00pm and use them to predict next trading day stock returns.  


Model Testing 

Method 1: Deep Learning Algorithm 
In our first experiment we use a neural network to test for predictive power between article level information and next day stock returns.  


We select a convolutional neural network (CNN), a traditional image recognition algorithm to identify article level predictive power.  A CNN takes data, i.e., text from each news feed, and converts the data into pixels. Each character, word, and sentence is input into the CNN; patterns are identified by position of one data point (pixel) relative to the others, and then to the stock return. CNN are complex deep learning models that do not tell us why a pattern exists, only what the pattern is.  


We use TensorFlow to deploy our computation (an open source machine intelligence software initially created by Google Brain Team). 2010-2014 is selected as our training set, and 2015 is testing set.  


In deploying our model, we aim to predict tomorrow’s stock price return. During the training process, we experience training accuracy of 52.0% and testing accuracy of 50.1%, as seen in Figure 1. Our model is unstable, and does not converge during the training process. This is caused by limited graphical processing computing power and hyperparameter tuning.  

Applying our deep learning model to SeekingAlpha on an article-level does not have predictive power significantly better than a random coin flip. In the future after we increase our GPU computing power we will tune the hyperparameters and test this model further. 


Method 2: Features to Stock Return 
In this experiment, we analyze the sentiment in news by recognizing the emotional state expressed in an article. We construct different types of features.



We start our analysis by constructing natural language processing features: 

A continuous sequence of words or characters in a sentence. Metrics on syllables, letters, or words are collected and analyzed to make predictions. In our analysis we focus on Ngram words and characters. These features are used to assess the probability of word sequences on article sentiment.  Example: Google revenue increased during their quarterly earnings announcements.


An article with “revenue increased” and “quarterly earnings” as its most frequent word may exhibit a certain pattern in sentiment.