Machine Learning Model Training 2 (Group Nebula)

By Group "Nebula"

Blog 6: Machine Learning Model Training

Introduction

After extracting non-neutral sentences using FinBERT, we prepare the dataset for model training. These dataset are used to train different models. At last, we compare the performance of the models trained by different dataset.

Data for Model Training

Dataset 1: One-hot Encoding (10000 features)
The text is prepared using the approach described in Section 3.2.1. This dataset consists of binary vectors that represent the occurrence of tokens in the text as described in Section 4.2. The mapping dictionary only consists of 10000 unique tokens that appear the most in the text.
Dataset 2: One-hot Encoding (10000 features) (BERT Cleaned)
The text is prepared using the approach described in Section 3.2.2. This dataset consists of binary vectors that represent the occurrence of tokens in the text as described in Section 4.2. The mapping dictionary only consists of 10000 unique tokens that appear the most in the text.
Dataset 3: Bag-of-words (10000 features)
The text is prepared using the approach described in Section 3.2.1. This dataset consists of bag-of-words that represent the frequency of tokens in the text as described in Section 4.3. The mapping dictionary only consists of 10000 unique tokens that appear the most in the text.
Dataset 4: Bag-of-words (10000 features) (BERT Cleaned)
The text is prepared using the approach described in Section 3.2.2. This dataset consists of bag-of-words that represent the frequency of tokens in the text as described in Section 4.3. The mapping dictionary only consists of 10000 unique tokens that appear the most in the text.
Dataset 5: Sentiment Scores (Positive, Neutral, Negative)
The text is prepared using the approach described in Section 3.2.1. This dataset consists of sentiment scores (i.e. positive, neutral and negative score) that represent the text as described in Section 4.4.
Dataset 6: Sentiment Scores (Positive, Neutral, Negative) (BERT Cleaned)
The text is prepared using the approach described in Section 3.2.2. This dataset consists of sentiment scores (i.e. positive, neutral and negative score) that represent the text as described in Section 4.4.
Dataset 7: Sentiment Scores (Compound)
The text is prepared using the approach described in Section 3.2.1. This dataset consists of sentiment scores (i.e. compound score) that represent the text as described in Section 4.4.
Dataset 8: Sentiment Scores (Compound) (BERT Cleaned)
The text is prepared using the approach described in Section 3.2.2. This dataset consists of sentiment scores (i.e. compound score) that represent the text as described in Section 4.4.
Dataset 9: Tokenized (10000 features) (All words)
The text is prepared using the approach described in Section 3.2.1. This dataset consists of tokens that represent the text as described in Section 4.1. The mapping dictionary only consists of 10000 unique tokens that appear the most in the text.
Dataset 10: Tokenized (10000 features) (BERT Cleaned) (All words)
The text is prepared using the approach described in Section 3.2.2. This dataset consists of tokens that represent the text as described in Section 4.1. The mapping dictionary only consists of 10000 unique tokens that appear the most in the text.
Dataset 11: Tokenized (10000 features) (10000 words)
The text is prepared using the approach described in Section 3.2.1. This dataset consists of tokens that represent the text as described in Section 4.1. The mapping dictionary only consists of 10000 unique tokens that appear the most in the text. The length of the text is shortened to 10000 words.
Dataset 12: Tokenized (10000 features) (5949 words)
The text is prepared using the approach described in Section 3.2.1. This dataset consists of tokens that represent the text as described in Section 4.1. The mapping dictionary only consists of 10000 unique tokens that appear the most in the text. The length of the text is shortened to 5949 words.

Machine Learning model

Logistic Regression (LR)
Multi-layer Perceptron (MLP)
Recurrent Neural Network (RNN)

Experiment Results

Logistic Regression

Logistic Regression (Dataset 1)	Logistic Regression (Dataset 2)	Logistic Regression (Dataset 3)	Logistic Regression (Dataset 4)

Logistic Regression (Dataset 5)	Logistic Regression (Dataset 6)	Logistic Regression (Dataset 7)	Logistic Regression (Dataset 8)

Multi-layer Perceptron

Multi-layer Perceptron (Dataset 1)	Multi-layer Perceptron (Dataset 2)	Multi-layer Perceptron (Dataset 3)	Multi-layer Perceptron (Dataset 4)

Multi-layer Perceptron (Dataset 5)	Multi-layer Perceptron (Dataset 6)	Multi-layer Perceptron (Dataset 7)	Multi-layer Perceptron (Dataset 8)

Recurrent Neural Network

	Recurrent Neural Network (Dataset 9)
Train
Test

	Recurrent Neural Network (Dataset 10)
Train
Test

	Recurrent Neural Network (Dataset 11)
Train
Test

	Recurrent Neural Network (Dataset 12)
Train
Test

Summary

Experiment Findings

Sentiment scores from NLTK Vader Package do not work well
Too long text
Inconsistent context: NLTK Vader is for movie analysis, but we are using it to do financial sentiment analysis
Insufficient data cleaning
Limited training data
FinBERT data cleaning help boost performance
Bag-of-words is better than binary vectorization (one-hot encoding)
Recurrent Neural Network is better

Future work

Explore more on Recurrent Neural Network
Explore more on data cleaning
Prepare more data
Use financial-context sentiment analysis package

Difficulties encountered

The biggest challenge in this step is sentiment analysis runtime. As the text data is very long, sentiment analysis using NLTL Package takes some time.

Another challenge is RNN training. RNN is more complicated compare to other machine learning model. Despite Tensorflow 2 provide a very high level application for RNN, it is still more complicated that sklearn. Hence, some time is needed to explore Tensorflow 2. Moreover, the training time for RNN is very long. When we train an RNN for Dataset 9, it takes around an hour. Hence, it is hard to experiment on neural networks on normal computer without using GPU farms.