High Frequency and Unstructured Data in Finance: An Exploratory Study of
Twitter

William Sanger; Thierry Warin

High Frequency and Unstructured Data in Finance: An Exploratory Study of Twitter

William Sanger¹, Thierry Warin^2*

Ph.D. Candidate, Polytechnique Montreal, Canada
HEC Montreal, Department of International Business, 3000 Cote-Sainte-Catherine Road Montreal, Quebec, H3T 2A7, Canada

*Corresponding Author: Thierry Warin, HEC Montreal, Department of International Business, 3000 Cote-Sainte-Catherine Road Montreal, Quebec, H3T 2A7, Canada

Visit for more related articles at Journal of Global Research in Computer Sciences

Abstract

Objective: In this paper, we investigate the question to know whether information spread over Twitter can be useful to design investment strategies on financial markets.

Methods: We compare the influence of two kinds of messages sent on Twitter over two types of returns concerning firms listed on the S&P500. We use logistic-based models to assess the probability of having certain types of returns based on messages published on Twitter.

Results: Financial tweets are positively correlated with higher intraday and overnight returns (1 to 5% returns) while being negatively correlated with lower returns (0 to 1% returns). Non-financial tweets are not significantly related to such returns.

Conclusion: From a practical standpoint, investment strategies could be designed following these findings to optimize some gain opportunities depending on the investment day, the targeted industry and live activity on Twitter.

Keywords

Social media, Stock prices, Big data, S&P500, Twitter, High frequency, Unstructured data

INTRODUCTION

“Breaking: Two Explosions in the White House and Barack Obama is injured.” (The Associated Press, 10:07, April 23rd 2013) 72 characters later, the S&P500 index lost more than 121 billion US dollars until the tweet was proven false and confirmed to be sent by a hacked Associated Press account. A 1.68 billion-dollar bill for each character written. However, this particular event shed light on the implication of spreading news on the stock markets, especially through social media.

Our research question is to know whether information running through Twitter explain some of the stock price variations on the S&P500. Twitter is a new way of spreading information, not only through short sentences, but also by allowing users to emphasize one particular piece of information by retweeting. As such, is Twitter a complement to traditional ways of spreading news, or is it a revolution? Applied to a particular market - the financial market - this question takes a whole different dimension. Indeed, information is at the core of finance. In theory, there is no way to beat the market return. However, when one investor has more information than another one, then she can beat the market. In this regard, can we extract some extra information from messages on Twitter that can lead to some investors beating the market all the time?

Since the advent of modern finance, information has taken a central role in every mechanism of the industry. While Markowitz established the theoretical framework upon which relies the Capital Asset Pricing Model, strong assumptions have been made. Among them figures the fact that an investor will act in a rational way in order to maximize her returns while minimizing her risks [1]. Twenty years later, [2] characterized the market efficiency regarding information: there should be no gain opportunity on the stock markets because they are defined by random walk patterns, while all information is known at any time because prices instantly reflect new events.

These two assumptions have shown their limitations, especially during financial crises (October 97, the Internet bubble or the US financial crisis of 2007 for naming a few examples). If crises cannot be explained by rational thinking à la Markowitz, perhaps the emotional involvement of people is a potential answer. Behavioral finance interprets financial markets as a proxy to reflect the social mood [3]: crises could be provoked by what [4] describe as “animal spirits”.

With the advent of social media and the democratization of lowcost and efficient informatics systems, the wisdom of crown has never been so accessible. Every day, 500 millions tweets are publicly sent around the world. More than a billion users are connected on Facebook. And this is only the tip of the iceberg. Big Data is considered as one of the most promising futures for finance, especially for risk management [5]. By using a robust framework inherited form the financial industry, [6] developed dashboards to manage risks based on social media scrutiny. We define Big Data as a mix of structured and unstructured data (stock prices, heartbeats and pictures for example), produced in real-time as part of longitudinal data [7,8]. Highlighted the fact that qualitative data can reflect information that is not optimally incorporated into stock prices. In that sense, Twitter messages fit the definition of unstructured data massively produced and the social media became the subject of several studies in finance and social sciences.

While many have used sentiment analysis technics or machine learning approaches, few econometric analyses have yet been published. “Twitter hedge funds” were highly mediatized but did not earn the expected returns of the studies. Here lies our challenge: how to assess efficiently the massive flow of information for the financial markets? The main objective of this research is to quantify investment opportunities following an increase in Twitter messages depending on (1) weekdays and (2) industry types. We expect that intraday and overnight returns could be correlated with the volume of tweets being published. To summarize, our overall research question is to know whether new forms of information diffusion such as Twitter can add to the current financial information flows and explain some changes in stock returns. This leads to two sub-questions and two hypotheses:

RQ1. Is there a difference between tweets exchanged after the markets are closed and tweets written during the day, when markets are open, in explaining stock price changes? For that matter, we use two dependent variables: intraday and overnight returns.

RQ2. Is there a difference between tweets written by financially literate people (using company tickers) and tweets written by the layman (using company names) in explaining stock price changes?

H1. The first hypothesis is related to the first research question in the sense that we assume that professionals tweet after they are done at work, and their comments may be interesting for investment decisions implemented on the next day.

H2. The second hypothesis is related to the second research question in the sense that we assume that tweets written by professionals (using tickers) provide more useful financial information than tweets written by the layman (using company names).

The rest of the paper is organized as follows: section (2) presents the literature review concerning the use of Internet as a source of information for the financial markets, including forums, search engines and social media in the later years. In section (3), we describe our dataset composed of stock prices and the number of messages sent on Twitter about 71 firms of the S&P500. We use logistic-based estimations, which are detailed in section (4), while results are interpreted in the last section of the paper.

LITERATURE REVIEW

Forums and Blogging Websites

In early 2000, financial blogs were used to discuss about stock performances [9] studied Yahoo! Finance comments to understand the characteristics of the most spoken firms on the website. He found that when the number of messages doubles overnight, the following daily return is on average 0.18% higher. RagingBull.com, one of the first financial forums, was also the subject of a few studies. Through event studies, the impact of messages cannot be anticipated by more than a day in advance [10]. Trading volumes have also been correlated to the number of messages written on the corresponding companies and thus Internet messages cannot be considered as noise [11]. By investigating topics shared on Engadget.com, a technological blog, [12] were able to predict stock performances’ magnitude in 78% of the cases (and 87% of the returns’ sign). Propagating rumors on the Internet have been linked to trading volumes using HotCopper.com website [13].

A second part of the studies regarding forums and blogging websites focuses on the concept of opinion leadership. It is opposed to the mass’ conformity and how false information interferes with the wisdom of crowd: more precisely, [14] interpret noise as distortion of a signal. An opinion leader is able to influence the perception of other ones by accentuating his own positions to balance the public’s inertia [15]. This capacity is a key element towards understanding the importance of opinion leaders inside a network. Identifying how connected a person is and how novel the information transmitted is determine if a person can be considered as an opinion leader on the Internet [16]. These results can be used for predicting potential sales of product or how a person’s network will be inclined to buy a similar object [17,18]. Finally, [19] differentiate two behavior of influencers: “agitator”, which is a person stimulating discussions and “summarizer”, which is a person trying to give a clear picture of a situation.

Search Engines

Search engines, by being the medium through which users access information on the Internet, have become a valuate source of financial data. Google has released Google Trends, a tool aggregating the volume of search queries to interpret its users’ behavior. Several billion queries are made every day, and thus represent a huge opportunity in terms of Big Data analysis [20]. For example, relationships linking search queries and car sales have been [21].

In Finance, trading volumes of S&P500’s firms have been positively correlated to search queries through Google Trends [22]. This result was confirmed with other stock markets (Dow Jones, CAC40, DAX and FTSE). By adding Google Trends data, predictive models have shown more accurate explanation capacities. However, search queries and stock volatility seems to be mutually influenced, since “the investors’ attention to the stock market rises in times of strong market movements [and] a rise in investors’ attention is followed by higher volatility” [23]. In fact, search engines also appear to reduce the information gap between investors: search engines could be used as a proxy for “naïve” investors’ behavior [24]. Only a few companies are widely looked on the Internet, but this non-expert point of view could be interpreted as the wisdom of crowd regarding stock markets [25].

Besides trading volumes, stock returns have tried to be anticipated using a similar methodology. Sudden announcements of firms with high growth and a narrow products’ offer are anticipated using Google Trends [26,27] found that more queries about a company leads to an increase in trading volumes, but also a decrease of the expected returns. However, when maximum (minimum) yearly values of stock prices have been reached, the predictive power of Google Trends is increased (decreased). Chinese stocks’ abnormal returns were studied using a similar tool, Baidu Index, which counts the raw number of queries made on Baidu’ servers [28].

The main limitation of Google Trends lies in the results given to its users: only aggregated data is provided, on a weekly level. Social media, by providing real-time and open information, could become a more appropriate tool for financial predictions [29].

Social Media

Even though Facebook gathers more than a billion users, only a few studies were made in finance [30] used the social network as a proxy for accessing the emotion of users. He created an index called Facebook’s Gross National Happiness (GNH), which takes into account updates from users’ status. A standard deviation of the GNH is correlated with an increase of stock returns the next day. In a now famous study, [31] showed that emotions spread between users and could be manipulated on the social network. Ethical questions aside, these results reinforce the notion of a highly connected network that could be influenced by its users, alike on the financial markets with herd behaviors.

eToro was part of several studies released by MIT’s Media Lab. Described as a “fun and accessible” way to democratized trading [32], the social network was able to shed light on how likely an information could become trending [33]. Also, researchers found that the reputation of a user is not due to his past trading performances but more to his links with other users [34].

Since its release in 2006, Twitter is the subject of studies in various fields, including election outcomes, crisis management or disease outbreaks. Few rules regulate the messages sent on the social network: message cannot be longer than 140 characters and can be referenced by the use of a “hashtag” (“#”) in order to facilitate search regarding to a particular topic. For financial purposes, using the ticker symbol preceded by the “$” symbol is commonly adopted (i.e. $AAPL in order to write a message about Apple, or $NFLX regarding Netflix).

In order to process the amount of messages sent every day, several technics have been employed, and in particular sentiment analysis [35] investigates the predictive power of emotion through tweets by using a 6-level categorization method. They found that messages associated to the emotion “calm” could influence stock returns, and thus results could be anticipated as far as 4 days in advance. Stock performances (returns and trading volumes) are also linked to Twitter metrics such as sentiment or message volume [36]. Stock prices seem not to fully reflect the information publicly available, even though the information gap tend to be quickly filled [37]. In a detailed econometric study, [38] could not reproduce [35] impressive results but have been able to predict 70% of the time the DJIA’s direction, 58.08% of the time the NASDAQ’s direction and 68.63% of the time the S&P500’s outcomes.

Due to the massive adoption of social media by Internet users, firms have never been so exposed to the “buzz”. Adopting a game-theory approach, [39] illustrated the importance of taking into account words spread on the Internet and how to react to a reputation crisis. A message could become out of control and thus harm the firm’ stock price [40]. However, researchers have also use Twitter as an opportunity for trading activities since the information is not instantly incorporated into prices [41,42]. Derwent Absolut Return, commonly called “Twitter hedge fund” at its beginning, used the methodology developed by Bollen and al. but could not transform their theoretical results into practical success [43].

This paper relies on a methodology inherited by the studies using search engines, which is a volume-based methodology instead of a sentiment analysis methodology. Our hypothesis stipulates that investment strategies could be implemented in order to maximize an investor’s gains on the stock market.

DATA

Structured Data: Stock Information

Financial data are from Yahoo! Finance database. We extracted the following information regarding the 500 firms composing the S&P500: opening price and closing price. These metrics were established on a daily basis, for 251 trading days, starting on May 1st 2012 and ending on May 1st 2013. From these data, we define 2 types of return:

(1)

(2)

We create dummy variables in order to control for weekdays (Monday to Friday) and industry types following the Global Industry Classification Standard (Energy, Materials, Industrials, Consumer Discretionary, Consumer Staples, Health Care, Financials, Information Technology, Telecommunication Services, Utilities).

Unstructured Data: Tweets

Two types of messages were collected regarding the same firms. (1) We collected the number of time a company was named on Twitter per day (i.e. “Google” for Google Inc. or “Microsoft” for Microsoft Inc.). (2) The second type of tweet is called financial tweets, characterized by the presence of the “$” symbol before the ticker of a listed firm (i.e. “$GOOG” for Google Inc. or “$MSFT” for Microsoft Inc.). Data regarding these two types were aggregated by day and also collected from May 1st 2012 and May 1st 2013 using the People Browsr website. For understanding purposes, the first type of tweets is later called Name and then Name.Index, while the second type of tweets is later called Ticker and then Ticker.Index.

From the 500 companies of the S&P500, we selected 71 of them in order to only keep firms that have on average 30 financial tweets per day. The list of queries used for the selected firms is provided on Tables 1a& 1b.

In Table 2, we provide some descriptive statistics for each variable.

Finally, we test the correlation coefficients between the variables and provide the correlation matrix in Table 3.

METHODOLOGY

Our research question is to know whether tweets can explain some of the variation in stock prices. In other words, is Twitter one of the channels of financial information? Is it an evolution or a revolution in finance? It is particularly interesting to study Twitter and its impact on the financial market, because Twitter (1) provides short messages forcing users to be to the point, (2) is real-time, (3) allows the users to emphasize a message by retweeting (hence creating a buzz) and (4) allows the users to tweet only about financial products by using a $ sign before the company name for instance. The latter point is of particular interest from a statistical perspective, because it allows us to have a great sample, covering almost all of the population tweeting about finance.

Again, our research questions are as follows:

RQ2. Is there a difference between tweets written by financially literate people (using company tickers) and tweets written by the layman (using company names) in explaining stock price changes?

H2. The second hypothesis is related to the second research question in the sense that we assume that tweets made by professionals (using tickers) provide more useful financial information than tweets written by the layman (using company names).

From an econometric perspective, the methodology used for this study relies on logistic estimations. Logistic models evaluate the impact a variable over the probability that the studied variable changes states. The dependent variable is reduced to a binary variable that can be adapted to different models. In our case, we evaluate the two different returns and implemented three models concerning the magnitude of these returns, such as:

(3)

(4)

(5)

In order to assess the effect of Twitter messages on firms, we control for two different types of fixed effects (weekdays on the one hand, and industry types on the other hand). This allows us to evaluate the specific impact of an increase (decrease) of tweets depending on the weekday or the kind of industry, such as:

(6)

Where

is the model used for the selected return

={ticker; name} the number of tweets published about a company mentioning either its ticker or its name the day before the studied return, the ωt weekdays and σ_t a value corresponding to the Global Industry Classification Standard. In terms of methodology, we compute the log-odds, odds-ratios and predicted probabilities of our models, as well as testing the validity of the impact of our variables by a Wald-test. Finally, in our logistic estimations, we use Monday and the Energy sector as our reference points.

RESULTS

We provide a summary of the logistic estimations’ results in the following tables. Again, results have to be interpreted relatively to the reference points. We should also pay attention to the low R-squared values. They should be considered in the context of an exploratory study. Also, although the values are low, they have to be interpreted as follows: they represent how much we can add to the level of information already available in the financial markets.

Overnight returns and financial tweets

In Table 4, we compare the influence of financial tweets regarding to the three models of overnight returns. Firstly, Ticker_t-1 is highly significant (as confirmed by a Wald-test on this variable) but its effect on the probability of having such returns are opposite. In fact, for a unit increase in Ticker_t-1, the log-odds of having an overnight return between 0 and 1% decrease by 2.04 x 10^-4 but for the third model (from 1 to 5%) the log-odds increase by 1.846 × 10^-4.

Secondly, holding all variables at their means (Ticker_t-1 equals to 127 tweets, Weekdays being Wednesday and Overnight_ return_t-1 being 0.00012), the predicted probabilities of having the three different overnight returns can be computed based on the industry classification (Table 5). For the first model, Industrials (54%), Health Care (51.3%) and Materials (50.8%) are the industry types with the highest probabilities of having such returns; for the second model, it is Telecommunication Services (52.4%), Information Technology (48.7%) and Consumer Discretionary (46.5%); for the third model, it is Utilities (14.8%), Health Care (11.9%) and Industrials (10.8%). Thirdly, we can compare these predicted probabilities for different level of tweeting activity. Figure 1provides such analysis, with results depending on (1) the day of the week (1 to 5), (2) the type of industry (1 to 10) and (3) the number of financial tweets published (from 0 to 10,000). Here, the more a company is tweeted about, the less it is likely to have an overnight return between 0 and 1% (while it has higher chances of having an overnight return between 1 and 5%). Finally, weekdays and industry types also influence these predicted probabilities.

Overnight returns and non-financial tweets

Our hypothesis stipulates that financial tweets (Ticker) provide more information than non-financial tweets (Name). Name_t-1 has a p-value of 0.14 for Model 1 and is not significant for Model 2 (Table 6). The Wald-test confirms these results, while Name_t-1 is significant for Model 3. Also, the magnitude of the log-odds has to be noticed between financial and non-financial tweets.

Again, holding all variables at their means (Name_t-1 equals to 16288 tweets, Weekdays being Wednesday and Overnight_ return_t-1 being 0.00012), we can identify preferential industrial sectors for having positive overnight returns, which are similar than those previously mentioned in Table 4 (Appendix 1a). When analyzing the impact of weekdays, industry types and the number of mentions of a company simultaneously (Figure 2), the impact of non-financial tweets seems to be limited compared to financial tweets.

Intraday returns and financial tweets

As a reflection to the previous results, the impact of financial tweets on intraday returns is significant (both for Models 1 and 2, as confirmed by a Wald-test). The magnitude of Ticker_t-1 on intraday returns is lower than on overnight returns, meaning that for an increase of one unit in Ticker_t-1, the log-odds of having such returns fluctuate less during trading time on the stock markets. However, the sign of Ticker_t-1 remains the same (Table 7).

In Appendix 1b, we provide results for the predicted probabilities of having the three models of intraday returns while holding means: Ticker_t-1 equals to 127 tweets, Weekdays being Wednesday and Intraday_return_t-1 being 0.00051. In Figure 3, the complete influence of weekdays, industry types and financial tweets could be assessed. For intraday returns between 0 and 1% (model 2), five industry types present higher probabilities of having such returns, namely the (1) Energy, the (2) Materials, the (3) Consumer Discretionary, the (4) Information Technology and the (5) Telecommunication Services sectors. Finally, Friday, followed by Monday, are the two days of the week presenting higher intraday returns, across all industries (Figure 3).

Intraday returns and non-financial tweets

In Table 8, we compute the log-odds and odds-ratios for each model and found that the impact of Name_t-1 is not significant for positive intraday returns (Model 1) and between 0 and 1% (Model 2). Again, the influence of non-financial tweets is lower than financial tweets, as illustrated by the amplitude of the logodds for each type of variables.

After computing the predicted probabilities of having the three returns while holding all variables at their means (Appendix 1c), Figure 4 confirms that using non-financial tweets for predicting such returns is not effective.

CONCLUSION

Two main findings of this study have been established. Firstly, financial tweets are more relevant to take into account compared to non-financial tweets regarding intraday and overnight returns. While the effect of the first ones does impact negatively on lower returns (0 to 1%) and positively on higher returns (1 to 5%), non-financial tweets did not help in explaining these returns. Secondly, overnight returns are more impacted by financial tweets than intraday returns, confirming our first hypothesis: tweets contain information that seems to be reflected on stock markets as soon as they open on the next day, which is less capture during trading hours. From these results, investment dashboard could be built, controlling for industry types, weekdays and live-monitoring of Twitter data.

Being exploratory, our research has some limitations and further work could be done in order to improve the quality of our findings. Firstly, we only selected firms that are already popular on Twitter (or at least having 30 financial messages per day on average). Secondly, the impact of a single tweet is low compared to traditional variables (returns of the previous day for example). This can be interpreted as a result that Twitter messages cannot fully explain returns on the stock market. However, these unstructured data can add value to traditional investment strategies based on more regular sources of information. In this regard, Twitter can add some information, being an evolution, but maybe not a revolution in finance.

In terms of estimation techniques, we think further work should be done, for instance using a straight OLS while identifying and correcting potential problems with unstructured data such as potential non-linearity of the interacted variables, weak extrapolation and severe interpolation. We could also implement machine learning techniques such as neural network or clusterbased analyses.