Download Full Report: Here
Welcome to Natural Language Processing
Financial news provides information to the general public. Consumers rely on information they read or hear before buying a product. The internet makes content easily accessible and more relevant than ever.
Remember the game where you guess how many jellybeans are in the jar? Your guess, and all your friend’s guesses probably spanned across a wide range. Wisdom of crowds states that the average of a mass populations’ guesses will be closer to reality than an individual expert guess. This phenomenon holds true for predicting jelly beans, or sentiment around a stock.
Consumer sentiment plays an important role in financial markets. We apply natural language processing to analyze text and understand what consumers are reading or talking about. We prove that using financial articles to predict stock movements outperforms a random guess investment strategy.
We test three methods in our analysis. First, we use deep learning to test article and sentence level prediction. Next we create features from characters, words, and sentences in each data source. Finally, we create sentiment scores from each article to then predict stock returns.
Our research is broken up as follow:
1.) Data explained.
2.) Method 1: Computer vision applied to sentiment analysis
3.) Method 2: Decomposing each article into features for stock return prediction.
2.1: Feature construction
2.2: Hyperparameter Optimization
2.3: Model training
2.4: Experiment Results
4.) Method 3: Using our features to predict sentiment of each article.
5.) Future work.
We collect articles from ten unique financial data sources, as listed in Table 1. In this research we focus exclusively on Seeking Alpha articles.
We filter Seeking Alpha to find news related to companies in the S&P 500 from 2010-present. Our experiment contains over a million relevant articles for analysis.
From each article we extract all text from the URL, the data of the publication, and the primary stock ticker the article relates to.
We use today and prior days’ information to predict future stock returns. We calculate returns as the change in stock price from period (Pn) to current period (P).
Returns are then classified as either positive or negative, rather than their percent return. In this experiment we predict the direction of stock movement rather than the percent return.
News articles are released at all times during the day, but financial markets in the US are only open from 9:30am Eastern Time to 4:00pm EST. We begin the data collection process at 3:00pm EST to capture the past 24 hours of news releases. This gives enough time to organize our data, run our models, output predictions, and tailor a portfolio strategy before markets close.
On the day after a non-trading day, we fall into the unique scenario that we have more than 24 hours of unused data. In order to incorporate all meaningful news releases into our predictions, we organize all news releases from the last trading day at 3:00pm and use them to predict next trading day stock returns.
Method 1: Deep Learning Algorithm
In our first experiment we use a neural network to test for predictive power between article level information and next day stock returns.
We select a convolutional neural network (CNN), a traditional image recognition algorithm to identify article level predictive power. A CNN takes data, i.e., text from each news feed, and converts the data into pixels. Each character, word, and sentence is input into the CNN; patterns are identified by position of one data point (pixel) relative to the others, and then to the stock return. CNN are complex deep learning models that do not tell us why a pattern exists, only what the pattern is.
We use TensorFlow to deploy our computation (an open source machine intelligence software initially created by Google Brain Team). 2010-2014 is selected as our training set, and 2015 is testing set.
In deploying our model, we aim to predict tomorrow’s stock price return. During the training process, we experience training accuracy of 52.0% and testing accuracy of 50.1%, as seen in Figure 1. Our model is unstable, and does not converge during the training process. This is caused by limited graphical processing computing power and hyperparameter tuning.
Applying our deep learning model to SeekingAlpha on an article-level does not have predictive power significantly better than a random coin flip. In the future after we increase our GPU computing power we will tune the hyperparameters and test this model further.
Method 2: Features to Stock Return
In this experiment, we analyze the sentiment in news by recognizing the emotional state expressed in an article. We construct different types of features.
We start our analysis by constructing natural language processing features:
A continuous sequence of words or characters in a sentence. Metrics on syllables, letters, or words are collected and analyzed to make predictions. In our analysis we focus on Ngram words and characters. These features are used to assess the probability of word sequences on article sentiment. Example: Google revenue increased during their quarterly earnings announcements.
An article with “revenue increased” and “quarterly earnings” as its most frequent word may exhibit a certain pattern in sentiment.
Part of Speech
The distinctive meaning of a word. There are 8 major parts of speech in English grammar: noun, pronoun, verb, adverb, adjective, conjunction, preposition, and interjection.
Noun: A name, person, place, or event. Example: Bill Achman is a hedge fund manager.
Preposition: The locator of place or time. Example: Wall Street is in New York.
Different word corpuses and their associations with eight basic emotions and sentiments.
Named Entity Recognition
Locate and classify named entities in text into pre-defined categories such as the names of persons, organizations, monetary values, or percentages.
A to B
The action between two quantitative variables separated by the word “to”.
Example: Revenue increased from 20% to 30% last quarter.
Each article’s features are input into our machine learning models, labeled against next period stock returns. The returns are classified as positive or negative. This method assumes that our features account for all future stock returns, a process we expand upon in Method 3.
We apply grid search during our validation set to identify paramters that maximize our returns. The focus in on optimizing parameters of Supper Vector Machine model, and applying these across all predictions.
There are four parameters we need to tune in SVM: kernel (“poly”, “rbf”, “linear”), C
(regularization parameter, [1, 10, 100, 1000]), γ (gamma in kernel function, [1, 2, 3, 4, 5]), d (degree of kernel function, [1, 2, 3]).
To choose a good parameter set, we conducted a 5-fold cross validation on the training set using all the parameter combinations, and obtained the best parameters. Then, we tested the holdout set by using the model with best parameters.
Set C = [2**i for i in xrange(-8, 21, 4)]
Which is [0.00390625, 0.0625, 1, 16, 256, 4096, 65536, 1048576]
RandomForestClassifier(n_estimators=55,oob_score=True,min_samples_split=500,max_fea tures='sqrt', n_jobs = 128, max_depth = 16)
Each ticker tested on Seeking Alpha has 1,000 historical news articles on average. Each article has thousands of features, causing a risk of over-fitting in our model.
We apply Principal Component Analysis (PCA) to reduce the dimensions of XOM in Figure 2.