Alex Henning home about projects blog

Naive Bayesian Sentiment Analysis of Live Journal

For my science fair project as a sophomore in high school, I trained multiple naive bayes on approximately 3.5 million livejounal posts tagged with the top 100 emotions. Once the models were trained, they tried to guess the emotion of 10000 other posts they were not trained on.

Hypothesis: Different methods of tokenizing will change the accuracy of the bayes guesses.

Methods of tokenizing the posts, tested with two different scales, one with the top 100 moods, another being a positive-neutral-negative scale.

Results

Case sensitivity

Data: (Numbers represent percent right out of 10000)
1 Guess3 Guesses5 Guesses10 Guesses
Bayes ignore-case count-repeats positive-scale0.55100
Bayes case-sensitive count-repeats positive-scale0.56100
Bayes ignore-case ignore-repeats positive-scale0.57100
Bayes case-sensitive ignore-repeats positive-scale0.58100
Bayes ignore-case count-repeats mood-scale0.060.140.200.33
Bayes case-sensitive count-repeats mood-scale0.060.140.200.33
Bayes ignore-case ignore-repeats mood-scale0.070.150.220.34
Bayes case-sensitive ignore-repeats mood-scale0.070.150.210.34

Case sensitivity offers no advantage for guessing the sentiment out of a hundred distinct moods. However, on the positive-neutral-negative scale, case sensitivity provides a slight increase in it's accuracy of guessing moods. However, given that it's only 1% it may not be worth it given the significant increase in RAM and which resulted in paging and a significant slow down. That depends on use case though.

Ignoring multiple repeats of the same word in the same post

Data: (Numbers represent percent right out of 10000)
1 Guess3 Guesses5 Guesses10 Guesses
Bayes ignore-case count-repeats positive-scale0.55100
Bayes ignore-case ignore-repeats positive-scale0.57100
Bayes case-sensitive count-repeats positive-scale0.56100
Bayes case-sensitive ignore-repeats positive-scale0.58100
Bayes ignore-case count-repeats mood-scale0.060.140.200.33
Bayes ignore-case ignore-repeats mood-scale0.070.150.220.34
Bayes case-sensitive count-repeats mood-scale0.060.140.200.33
Bayes case-sensitive ignore-repeats mood-scale0.070.150.210.34

Always ignore repeats. The presence of a word more than once adds no gain in accuracy of the naive bayeses ability to guess the sentiment of the poster and actually reduce it.

Other Results

The reduced positive-neutral-negative scale was significantly more accurate, but provides less information than the mood scale.