Our goal of the project is to see the correlation between the price of crypto and the market sentiment. Since crypto has little physical asset as pricing basis, theoretically the price is valued based on the market consensus. Thus, the market sentiment is the major determining factor in this case. Here, we choose bitcoin (BTC) as our target because it is the most popular crypto and has the most trading volume. So, the price can be manipulated minimally. Besides, to measure the market sentiment of the BTC, we initially choose three major forums as our analyzing sample, reddit and bitcoin talk and X (twitter). However, X has employed anti-crawler since the acquisition by Elon Mask. So we finally choose reddit and bitcoin talk as the data source.

Data of BTC Price

We directly download the data of BTC price from Yahoo Finance and saved as .csv file.

Reddit

Reddit is a vast social website, founded in 2005 by Steve Huffman and Alexis Ohanian. On Reddit, users can submit content such as links, text posts, and images, which are then voted on by other users. It is also one of the most famous and active communities in the world for investors and others because of the vast number of users of the website. Thus, it is an ideal source for us to judge the market sentiment of BTC. Here, we assume that the major information is included in the headlines. So we get the data of reddit from direct search of BTC and bitcoin.

In the following code, our group scrapes the data from the website reddit. Through recursive calls and time intervals, the code is capable of navigating through multiple pages of search results. Our group scraped all the posts with the title with key words "BTC" and "Bitcoin" and the datetime of the posts for sentiment analysis model.

import requests
import re
import time
from datetime import datetime


def parse_time(gmt_time):
    dt = datetime.fromisoformat(gmt_time)
    standard_time = dt.strftime("%Y-%m-%d %H:%M:%S")
    return standard_time


def write(title_list, publish_time_list):
    assert len(title_list) == len(publish_time_list)
    for title, tim in zip(title_list, publish_time_list):
        F.write(parse_time(tim) + ':' + title)
        F.write('\n-----------\n')
    F.flush()


def extract(html):
    content_list = re.findall(r'"title":"(.*?)"', html)
    next_page_url = re.findall(r'loading="lazy" src="(.*?)">', html)
    publish_time_list = re.findall(r'<faceplate-timeago.*?ts="(.*?)"', re.sub(r'\s', '', html))
    if content_list:
        write(content_list, publish_time_list)
    if next_page_url:
        next_page_url = 'https://www.reddit.com' + next_page_url[0].replace('amp;', '')
        print(next_page_url + '\n-----------')
        return next_page_url
    else:
        print('No more pages ~')


def get_all_title(url):
    response = requests.get(url)
    next_page_url = extract(response.text)
    if next_page_url:
        get_all_title(next_page_url)
        time.sleep(1)


if __name__ == '__main__':
    query = 'bitcoin'  # 搜索内容

    with open(f'{query}.txt', 'a', encoding='utf-8') as F:
        get_all_title(f'https://www.reddit.com/search/?q={query}')

Bitcoin Talk

Bitcoin Talk is an online forum dedicated to Bitcoin and related cryptocurrencies. It is one of the oldest and most influential forums within the cryptocurrency community. The site is widely regarded as a platform for cryptocurrency enthusiasts, miners, developers, and investors to exchange information, discuss technical issues, share news, and promote new projects. Here the code shows how we scrap the data from bitcoin talk.

In the following code, our group scrapes the data from the website Bitcoin Talk. Our group gets into the Bitcoin Forum area of the website and scrapes the headlines of all the posts. The code is used to extract post titles and date information from different pages of the forum, including the information in the headlines and the date time of the posts, which are all saved in the final_list so that the texts can be used to in the market sentiment analysis model.

import requests
from bs4 import BeautifulSoup
import pandas as pd
import time

# CREATE EMPTY LIST TO ADD POSTS AND DATES TO
final_list = []

# SCRAPER BEGINS HERE
    # bitcointalk subforums end in this format, with the number before the decimal
    # represents the forum id, and the number after it represents a page
    # x.40 is page 2 on forum x, x.80 is page 3, x.120 is page 4, etc.

url_end = '1.40'

while url_end != '1.47760': #set this to whatever number the LAST page you want to scrape is (1.47760 is the first page of the forum in 2024)
    url = 'https://bitcointalk.org/index.php?board={}'.format(url_end)
    page = requests.get(url)
    page_content = BeautifulSoup(page.content,'html.parser')

    # GET POST TITLES
    all_td = page_content.find_all('td', {'class': 'windowbg'})
    cleanlist = []
    for row in all_td[::3]:  # select only every third row because some empty rows have this class
        text = row.getText()
        text = text.replace('\n', '')
        text = text.replace('«', '')
        text = text.replace('»', '')
        text = text.replace(' \xa0All ', '')
        text = text.replace(' 1 2', '')
        text = text.replace(' ... ', '')
        cleanlist.append(text)

    #GET POST DATES
    all_date = page_content.find_all('td', {'class': 'windowbg2 lastpostcol'})
    cleanlist2 = []
    for row in all_date:
        text = row.getText()
        text = text.replace('\n', '')
        text = text.replace('\t', '')
        text = text.split('by', 1)
        realtext = text[0]
        cleanlist2.append(realtext)

    # CREATE FUNCTION TO MERGE LISTS
    def merge(list1, list2): 
        merged_list = [(list1[i], list2[i]) for i in range(0, len(list1))] 
        return merged_list 
    merged = merge(cleanlist, cleanlist2)
    for row in merged:
        final_list.append(row)

    #SET TO NEXT URL TO SCRAPE
    url_end = url_end.split('.', 1)
    url_end0 = url_end[0]
    url_end1 = url_end[1]
    url_end1 = int(url_end1)
    url_end1 += 40
    url_end = url_end0 + "." + str(url_end1)

    #PRINT FEEDBACK AND PAUSE FOR SERVER RULES
    print("Processing, page scraped, new URL: " + url_end)
    time.sleep(1)

Published

Category

Progress Report

Tags

Contact