Intro — Predicting stock market activity with Reddit sentiment analysis (by "Group 4 Aixchange")

By "Group 4"

(1) First Steps

Objective:
We're curious about the degree of impact that retail investors have on the stock market. With the rise in popularity of trading platforms designed for the average joe, like Robinhood and WeBull, we've also seen unprecedented amount of retail trading volume in the market, with retail investors overtaking quant hedge funds in 2019. In line with our curiosity, we've designed the following objective for this project:

To find whether popular sentiment towards stocks on the r/wallstreetbets subreddit positively correlates with the stocks' direction of price change at market close after various periods (end-of-day, end-of-week, and end-of-month). We will focus on the US stock market, i.e. tickers available for trading on NASDAQ and NYSE, over a period of a month.

Details:
In particular, we want to look at posts with a title containing the string "What Are Your Moves Tomorrow", as user activity on this topic purportedly predicts movement in the stock market. Then, a regression is run on the actual market activity of mentioned stocks one day after the post.

Time period:
28 February 2023 00:00:00 GMT – 31 March 2023 00:00:00 GMT
Rationale: Most recently completed month. Going back further in time runs the risk of coming across mentions of companies that are no longer listed on NASDAQ or NYSE. Publicly available data on such delisted companies is sparse and unreliable.

Hypothesis:
The sentiment of the community active on r/wallstreetbets does not constitute a force coordinated enough to move the market against the currents generated by other retail investors, let alone bulge bracket firms, quantitative hedge funds, and market makers. Instead, we predict that by regressing the biggest movers of the previous day against the most commonly discussed stocks the next day, we will see that Redditors' sentiments are influenced by historical market activity.

(1 and a ½) Meet the Team (Bonus Section!)

Our team is composed of Ryan, Carson, and Sami. If I (Ryan) recall correctly, only Sami has any real programming experience in school. The rest of us were only just tangentially interested in FinTech and decided to take this course on a whim. Ironically, Ryan ended up writing the bulk of the code, which is why it is so clunky and inelegant.

Back to regular blog programming in 3, 2, 1...

(2) Data Collection

PRAW was instrumental in collecting our data. We considered other methods of obtaining Reddit data, such as pre-compiled sources of Reddit via Pushshift.io, using an abstracted scraper like Reddit Data Extractor, or building our own headless browser-based program (e.g., with Puppeteer). However, we found that PRAW was the most reliable, as the data was "live" and came directly from Reddit. ("Live in the sense that it was not a snapshot of user-generated content, unlike the archived data available through Pushshift was.) However, since Reddit does not store edit history or deleted comments, the live nature of such data meant that we encountered some instances where the comment didn't make sense, or was no longer available. Additionally, we found that there was minimal difference in time spent scraping when using PRAW versus APRAW, as we were rate-limited anyway. (We did not use multiple accounts, VPNS, proxies, as that feels to be in violation of some sort of Terms of Service.)

Let's get into our requirements and how that translated into code. For each post, we wanted to examine all stocks mentioned in the top-level comments, keeping Title, Author, Body (the actual comment), and Score (number of upvotes). With the PRAW interface, this was simple enough to do:

for submission in subreddit.search('flair_name:"Daily Discussion"', limit=None):
    if start_date <= datetime.fromtimestamp(submission.created_utc) <= end_date:
        if "What Are Your Moves" in submission.title:
            for comment in submission.comments:
                if isinstance(comment, praw.models.MoreComments):
                    continue

                titles.append(submission.title)
                authors.append(comment.author)
                comments.append(comment.body)
                upvotes.append(comment.score)

Then we wrote this data down into a .csv file, as it was small and manageable enough to forego any need for a database. Overall, data collection was pretty simple, except for the time when we decided that we also wanted upvotes in addition to the comment body. (The first set of scraped data omitted this.) We had to run the slow process of scraping again, which took about an hour to complete.

(2) Data Cleaning — Terrible, Terrible, Grunt Work (made surprisingly easy with NLTK!)

We had to remove all sorts of gunk from the comments, including emojis, GIFS, and embedded media. Redditors really like to add lots of colour to their online quips. After this was pretty smooth sailing in the pre-processing bit. Tokenization and lemmatization were simple enough with these magical tools:

from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

Seeing the tokenized and lemmatized comments was pretty impressive to me. Of course, the journey doesn't end here. Now, the data was ready for us to apply other cool processes to it, like entity recognition and sentiment analysis.

Intro — Predicting stock market activity with Reddit sentiment analysis (by "Group 4 Aixchange")

(1) First Steps

(1 and a ½) Meet the Team (Bonus Section!)

(2) Data Collection

(2) Data Cleaning — Terrible, Terrible, Grunt Work (made surprisingly easy with NLTK!)

Published

Category

Tags

Contact