Posted on June 3, 2017

Sentiment Analysis of Financial News Headlines Using NLP

Given the explosion of unstructured data through the growth in social media, there’s going to be more and more value attributable to insights we can derive from this data. One of particular interest is the application to finance. Many people (and corporations) seek to answer whether there is any exploitable relationships between this unstructured data and financial assets. Provided one could come up with a robust algorithm, there is likely significant scope for implementation.

Here we will look at some data provided by Kaggle, and see what we can learn through frequency analysis, TF-IDF analysis and the application of some basic prediction/regression.

Data here

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from nltk.sentiment.vader import SentimentIntensityAnalyzer # VADER https://github.com/cjhutto/vaderSentiment
from nltk import tokenize
df = pd.read_csv("D:\Downloads\Data\stocknews\Combined_News_DJIA.csv")
dj_df = pd.read_csv("D:\Downloads\Data\stocknews\DJIA_table.csv")
reddit_df = pd.read_csv("D:\Downloads\Data\stocknews\RedditNews.csv")
df.describe()
df.Date = pd.to_datetime(df.Date)
df.head()
df.index = df.Date
dj_df.describe()
dj_df.Date = pd.to_datetime(dj_df.Date)
dj_df.index = dj_df.Date
dj_df = dj_df.sortb_values(by = 'Date', ascending=True)
dj_df.head()
Date Open High Low Close Volume Adj Close
Date
2008-08-08 2008-08-08 11432.089844 11759.959961 11388.040039 11734.320312 212830000 11734.320312
2008-08-11 2008-08-11 11729.669922 11867.110352 11675.530273 11782.349609 183190000 11782.349609
2008-08-12 2008-08-12 11781.700195 11782.349609 11601.519531 11642.469727 173590000 11642.469727
2008-08-13 2008-08-13 11632.809570 11633.780273 11453.339844 11532.959961 182550000 11532.959961
2008-08-14 2008-08-14 11532.070312 11718.280273 11450.889648 11615.929688 159790000 11615.929688
reddit_df.index = pd.to_datetime(reddit_df.Date)

Frequency Analysis

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from nltk.corpus import stopwords
stop = set(stopwords.words('english'))

# Create a single string for each date (since we only want to look at word counts)
news_combined = ''
for row in range(0,len(df.index)):
    news_combined+=' '.join(str(x).lower().strip() for x in df.iloc[row,2:27])
    
vectorizer = CountVectorizer()
news_vect = vectorizer.build_tokenizer()(news_combined)
word_counts = pd.DataFrame([[x,news_vect.count(x)] for x in set(news_vect)], columns = ['Word', 'Count'])
from wordcloud import WordCloud
wordcloud = WordCloud().generate(news_combined)
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")

# lower max_font_size
wordcloud = WordCloud(max_font_size=40).generate(text)
plt.figure()
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()

png

png

word_counts_adj = word_counts
word_counts_adj = word_counts_adj .reset_index(drop=True)

for i in word_counts['Word']:
    if i in stop:
        word_counts_adj = word_counts_adj.drop(word_counts_adj[word_counts_adj['Word'] == i].index)
        
word_counts_adj.index = word_counts_adj['Word']
counts = word_counts_adj.sort_values(by='Count', ascending=False)[0:100].plot(kind='barh', figsize = (16,15))
plt.show()

png

Sentiment Analysis

A large area of research is going into understanding unstructured data, and in particular seeing if we can harvest the constant streams of social media text coming from sources like Twitter, Reddit, News Headlines etc. Here we’ll have a look at some basic sentiment analysis and then see if we can attempt to classify changes in the S&P500 by looking at changes in the sentiment.

Financial News Headlines

The data provided consists of the top 25 headlines on Reddits r/worldnews each day from 2008-08-08 to 2016-07-01.

scores = pd.DataFrame(index = df.Date, columns = ['Compound', 'Positive', 'Negative', "Neutral"])

analyzer = SentimentIntensityAnalyzer() # Use the VADER Sentiment Analyzer

for j in range(1,df.shape[0]):    
    tmp_neu = 0
    tmp_neg = 0
    tmp_pos = 0
    tmp_comp = 0
    for i in range(2,df.shape[1]):
        text = df.iloc[j,i]
        if(str(text) == "nan"):
            tmp_comp +=  0
            tmp_neg += 0
            tmp_neu += 0
            tmp_pos += 0
        else:
            vs = analyzer.polarity_scores(df.iloc[j,i])
            tmp_comp +=  vs['compound']
            tmp_neg += vs['neg']
            tmp_neu += vs['neu']
            tmp_pos += vs['pos']
    
    scores.iloc[j,] = [tmp_comp, tmp_pos, tmp_neg, tmp_neu]

scores.head()
scores = scores.dropna()

We can see that on average the news headlines in r/worldnews tend to be quite negative. This is not unexpected given the often political nature of the Subreddit, and corroborates with the frequency analysis showing words like ‘china’, ‘israel’, ‘government’, ‘war’, ‘nuclear’, ‘court’ are among the most frequently found words.

scores.index = scores.index.to_datetime()
plt.plot(scores.Compound)
plt.show()

png

We can also start to see if there’s any relationship between movements in the Dow Jones Index and our sentiment scores. WE first notice that if we look at returns, there are some extreme outliers in the sentinment analysis. This is likely a product of basically a “0” sentinment being assigned to a day and blowing out any subsequent change. For now, we’ll simply work on a difference basis to see what we can draw out of the data.

plt.plot(scores.index, scores.Compound.shift(1)/scores.Compound-1)
plt.show()

plt.plot(dj_df.Date, dj_df.Close.shift(1)/dj_df.Close-1)
plt.show()

plt.scatter(np.diff(dj_df.Close), np.diff(scores.Compound))
plt.show()

png

png

png

…and our answer is… not much. It doesn’t look like there’s any strong relationship between our sentinment changes and DJIA changes. This is not surprising, given that we’re using a very specific microcosm of the internet (Reddit r/worldnews) which likely contains biased information. Perhaps something more interesting we can explore is a time-series decomposition of the data.. but first we’ll implement some basic binary classifiers to see just how poor the relationship is.

# Import classifiers
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.linear_model import  LinearRegression,  LogisticRegression

#Import helper functions
from sklearn.model_selection import learning_curve, train_test_split

import time

# Names of classifiers we want to use
clf_names = ["KNN",
             "Linear SVM",
             "RBF SVM",
             "Naive Bayes", 
             "LDA",
             "Linear Regression",
             "Logistic Regression",
            ]

# Implementation of each classifier we want to use, large scope in here for parameter tuning etc.
clfs = [KNeighborsRegressor(2),
        SVC(kernel="linear", C=0.025),
        SVC(gamma=2, C=1),
        GaussianNB(),
        LinearDiscriminantAnalysis(),
        LogisticRegression(), 

       ]

# Create test/train splits, and initialise plotting requirements
# We won't apply on feature reduction here, but it can be explored.
merged = scores.join(df)
merged = merged.iloc[:, 0:6]
merged = merged.iloc[2:,]
merged = merged.dropna()
train = merged[merged.index < '2015-01-01']
test = merged[merged.index > '2014-12-31']

#X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.33, random_state=42)
X_train = train[['Compound']]#.reshape(-1,1)
X_test = test[['Compound']]#.reshape(-1,1)
y_train = train['Label'].reshape(-1,1)
y_test = test['Label'].reshape(-1,1)

regressor_data = pd.DataFrame(columns = ["Name", "Score", "Training_Time"])
fig = plt.figure(figsize = (15,60))
i = 0

# Iterate over each regressor (no cross validation/KFolds yet)
for name, clf in zip(clf_names, clfs):
    print("#" * 80)
    print("Fitting '%s' regressor." % name)
    
    # Time required to fit the regressor
    t0 = time.time()
    clf.fit(X_train, y_train.ravel())
    t1 = time.time()

    score = clf.score(X_test, y_test.ravel())
    print("Name: %s Score: %.2f Time %.4f secs" % (name, score, t1-t0))
    
    # Store results
    regressor_data.loc[i] = [name, score, t1-t0]
    i += 1
################################################################################
Fitting 'KNN' regressor.
Name: KNN Score: -0.50 Time 0.0020 secs
################################################################################
Fitting 'Linear SVM' regressor.
Name: Linear SVM Score: 0.51 Time 0.0350 secs
################################################################################
Fitting 'RBF SVM' regressor.
Name: RBF SVM Score: 0.47 Time 0.1041 secs
################################################################################
Fitting 'Naive Bayes' regressor.
Name: Naive Bayes Score: 0.47 Time 0.0010 secs
################################################################################
Fitting 'LDA' regressor.
Name: LDA Score: 0.48 Time 0.0010 secs
################################################################################
Fitting 'Linear Regression' regressor.
Name: Linear Regression Score: 0.48 Time 0.0010 secs

As expected, we see scores in the order of 0.50.. i.e. we’re no better off flipping a coin than using the reddit r/worldnews to try and guess the direction of the market. Importantly, we note that we’ve used the sentiment on a day to classify whether the current day has gone up/down relative to the previous day. More accurately, we should look at the previous day’s headline and see if that can be used… since then we might gain an information advantage that we could exploit.


# Create test/train splits, and initialise plotting requirements
# We won't apply on feature reduction here, but it can be explored.
merged = scores.join(df)
merged = merged.iloc[:, 0:6]
merged = merged.iloc[2:,]
merged = merged.dropna()
train = merged[merged.index < '2015-01-01']
test = merged[merged.index > '2014-12-31']

# Here we adjust our train/test sets so that we're using the current days sentiment to predict tomorrow's change
X_train = train[['Compound']].shift(1)#.reshape(-1,1)
X_test = test[['Compound']].shift(1)#.reshape(-1,1)
X_train = X_train.dropna()
X_test = X_test.dropna()
y_train = train['Label'].reshape(-1,1)
y_train = y_train[:-1]
y_test = test['Label'].reshape(-1,1)
y_test = y_test[:-1]

regressor_data = pd.DataFrame(columns = ["Name", "Score", "Training_Time"])
fig = plt.figure(figsize = (15,60))
i = 0

# Iterate over each regressor (no cross validation/KFolds yet)
for name, clf in zip(clf_names, clfs):
    print("#" * 80)
    print("Fitting '%s' regressor." % name)
    
    # Time required to fit the regressor
    t0 = time.time()
    clf.fit(X_train, y_train.ravel())
    t1 = time.time()

    score = clf.score(X_test, y_test.ravel())
    print("Name: %s Score: %.2f Time %.4f secs" % (name, score, t1-t0))
    
    # Store results
    regressor_data.loc[i] = [name, score, t1-t0]
    i += 1
################################################################################
Fitting 'KNN' regressor.
Name: KNN Score: -0.49 Time 0.0020 secs
################################################################################
Fitting 'Linear SVM' regressor.
Name: Linear SVM Score: 0.51 Time 0.0240 secs
################################################################################
Fitting 'RBF SVM' regressor.
Name: RBF SVM Score: 0.47 Time 0.1041 secs
################################################################################
Fitting 'Naive Bayes' regressor.
Name: Naive Bayes Score: 0.48 Time 0.0010 secs
################################################################################
Fitting 'LDA' regressor.
Name: LDA Score: 0.48 Time 0.0020 secs
################################################################################
Fitting 'Linear Regression' regressor.
Name: Linear Regression Score: 0.48 Time 0.0010 secs

TF-IDF Analysis

An awesome tutorial here by Steven Loria gave me the idea of seeing if we can identify what were the key global trends in r/worldnews over the past few years. The tool being used here is TF-IDF (Term Frequency, Inverse Document Frequency) which is a way of scoring/ranking words by looking at their frequency both within a “document” and across other documents. Basically, it plays off between how frequency a word occurs in a document, and then the global “uniqueness” of that word across multiple documents. The more frequent, and more unique a word, the higher a score.

import math
from textblob import TextBlob as tb

def tf(word, blob):
    return blob.words.count(word) / len(blob.words)

def n_containing(word, bloblist):
    return sum(1 for blob in bloblist if word in blob.words)

def idf(word, bloblist):
    return math.log(len(bloblist) / (1 + n_containing(word, bloblist)))

def tfidf(word, blob, bloblist):
    return tf(word, blob) * idf(word, bloblist)

docs = pd.DataFrame(index=df.Date, columns=['Comb'])
for row in range(0,len(df.index)):
    docs.iloc[row,] = ' '.join(str(x).lower().strip().replace("b'", "").replace('b"', "") for x in df.iloc[row,2:27])
    

doc_2008 = docs[docs.index < "2009-01-01"].Comb.str.cat(sep = ' ').lower()
doc_2009 = docs[(docs.index >= "2009-11-01") & (docs.index < "2010-01-01")].Comb.str.cat(sep = ' ').lower()
doc_2010 = docs[(docs.index >= "2010-11-01") & (docs.index < "2011-01-01")].Comb.str.cat(sep = ' ').lower()
doc_2011 = docs[(docs.index >= "2011-11-01") & (docs.index < "2012-01-01")].Comb.str.cat(sep = ' ').lower()
doc_2012 = docs[(docs.index >= "2012-11-01") & (docs.index < "2013-01-01")].Comb.str.cat(sep = ' ').lower()
doc_2013 = docs[(docs.index >= "2013-11-01") & (docs.index < "2014-01-01")].Comb.str.cat(sep = ' ').lower()
doc_2014 = docs[(docs.index >= "2014-11-01") & (docs.index < "2015-01-01")].Comb.str.cat(sep = ' ').lower()
doc_2015 = docs[(docs.index >= "2015-11-01") & (docs.index < "2016-01-01")].Comb.str.cat(sep = ' ').lower()
doc_2016 =docs[(docs.index >= "2016-06-01") & (docs.index < "2017-01-01")].Comb.str.cat(sep = ' ').lower()

#bloblist = [tb(doc_2008), tb(doc_2009), tb(doc_2010), tb(doc_2011), tb(doc_2012), tb(doc_2013), tb(doc_2014)
#           , tb(doc_2015), tb(doc_2016)]

bloblist = [tb(doc_2008), tb(doc_2009), tb(doc_2010), tb(doc_2011), tb(doc_2012)
           , tb(doc_2013), tb(doc_2014), tb(doc_2015), tb(doc_2016)]

for i, blob in enumerate(bloblist):
    print("Top words in document year {}".format(i + 2008))
    scores = {word: tfidf(word, blob, bloblist) for word in (set(blob.words)-stop)}
    sorted_words = sorted(scores.items(), key=lambda x: x[1], reverse=True)
    for word, score in sorted_words[:5]:
        print("\tWord: {}, TF-IDF: {}".format(word, round(score, 5)))
Top words in document year 2008
	Word: georgia, TF-IDF: 0.00244
	Word: georgian, TF-IDF: 0.0013
	Word: ossetia, TF-IDF: 0.00102
	Word: mugabe, TF-IDF: 0.00036
	Word: hindu, TF-IDF: 0.00029
Top words in document year 2009
	Word: minarets, TF-IDF: 0.00054
	Word: copenhagen, TF-IDF: 0.00045
	Word: libel, TF-IDF: 0.00039
	Word: r\n, TF-IDF: 0.00039
	Word: swine, TF-IDF: 0.00039
Top words in document year 2010
	Word: cables, TF-IDF: 0.00133
	Word: assange, TF-IDF: 0.00109
	Word: manning, TF-IDF: 0.0007
	Word: mastercard, TF-IDF: 0.00063
	Word: julian, TF-IDF: 0.00056
Top words in document year 2011
	Word: hormuz, TF-IDF: 0.00052
	Word: strait, TF-IDF: 0.00045
	Word: eurozone, TF-IDF: 0.00044
	Word: tahrir, TF-IDF: 0.00043
	Word: homs, TF-IDF: 0.00038
Top words in document year 2012
	Word: morsi, TF-IDF: 0.00062
	Word: mayan, TF-IDF: 0.00054
	Word: mali, TF-IDF: 0.0004
	Word: non-member, TF-IDF: 0.00039
	Word: halappanavar, TF-IDF: 0.00031
Top words in document year 2013
	Word: nsa, TF-IDF: 0.00237
	Word: snowden, TF-IDF: 0.0018
	Word: edward, TF-IDF: 0.00108
	Word: pussy, TF-IDF: 0.00061
	Word: fukushima, TF-IDF: 0.00058
Top words in document year 2014
	Word: isis, TF-IDF: 0.0018
	Word: ruble, TF-IDF: 0.00117
	Word: sony, TF-IDF: 0.00098
	Word: airasia, TF-IDF: 0.0005
	Word: mistral, TF-IDF: 0.0005
Top words in document year 2015
	Word: isis, TF-IDF: 0.00373
	Word: ramadi, TF-IDF: 0.0005
	Word: tpp, TF-IDF: 0.00049
	Word: downed, TF-IDF: 0.00043
	Word: daesh, TF-IDF: 0.00042
Top words in document year 2016
	Word: brexit, TF-IDF: 0.00298
	Word: isis, TF-IDF: 0.0013
	Word: ramadan, TF-IDF: 0.00071
	Word: farage, TF-IDF: 0.00071
	Word: bookseller, TF-IDF: 0.00057

Some quite interesting results. Note that we’ve only looked at November & December in each year due to the increasing computational time when looking at an entire year of data. We see that some of the key events in that time are called out:

  • 2009: Swiss-minaret referendum, Swine flu pandemic
  • 2010: Julian Assange’s 2010 extradition request, US Cables Leak,
  • 2016: Brexit, Nigel Farage

So we can see that this is a pretty interesting tool for highlighting from large sets of information, key topics and points of interest. This could have particular use in say deconstructing analyst equity reports across periods of time to pull out trends/thematic concerns.