For my science fair project as a sophomore in high school, I trained multiple naive bayes on approximately 3.5 million livejounal posts tagged with the top 100 emotions. Once the models were trained, they tried to guess the emotion of 10000 other posts they were not trained on.

**Hypothesis:** Different methods of tokenizing will change
the accuracy of the bayes guesses.

Methods of tokenizing the posts, tested with two different scales, one with the top 100 moods, another being a positive-neutral-negative scale.

- Case insensitive, counting repeated uses of the same word.
- Case sensitive, counting repeated uses of the same word.
- Case insensitive, not counting repeated uses of the same word.
- Case sensitive, not counting repeated uses of the same word.

1 Guess | 3 Guesses | 5 Guesses | 10 Guesses | |
---|---|---|---|---|

Bayes ignore-case count-repeats positive-scale | 0.55 | 1 | 0 | 0 |

Bayes case-sensitive count-repeats positive-scale | 0.56 | 1 | 0 | 0 |

Bayes ignore-case ignore-repeats positive-scale | 0.57 | 1 | 0 | 0 |

Bayes case-sensitive ignore-repeats positive-scale | 0.58 | 1 | 0 | 0 |

Bayes ignore-case count-repeats mood-scale | 0.06 | 0.14 | 0.20 | 0.33 |

Bayes case-sensitive count-repeats mood-scale | 0.06 | 0.14 | 0.20 | 0.33 |

Bayes ignore-case ignore-repeats mood-scale | 0.07 | 0.15 | 0.22 | 0.34 |

Bayes case-sensitive ignore-repeats mood-scale | 0.07 | 0.15 | 0.21 | 0.34 |

Case sensitivity offers **no** advantage for guessing the
sentiment out of a hundred distinct moods. However, on the
positive-neutral-negative scale, case sensitivity provides
a *slight* increase in it's accuracy of guessing moods. However,
given that it's *only* 1% it may not be worth it given the
significant increase in RAM and which resulted in paging and a
significant slow down. That depends on use case though.

1 Guess | 3 Guesses | 5 Guesses | 10 Guesses | |
---|---|---|---|---|

Bayes ignore-case count-repeats positive-scale | 0.55 | 1 | 0 | 0 |

Bayes ignore-case ignore-repeats positive-scale | 0.57 | 1 | 0 | 0 |

Bayes case-sensitive count-repeats positive-scale | 0.56 | 1 | 0 | 0 |

Bayes case-sensitive ignore-repeats positive-scale | 0.58 | 1 | 0 | 0 |

Bayes ignore-case count-repeats mood-scale | 0.06 | 0.14 | 0.20 | 0.33 |

Bayes ignore-case ignore-repeats mood-scale | 0.07 | 0.15 | 0.22 | 0.34 |

Bayes case-sensitive count-repeats mood-scale | 0.06 | 0.14 | 0.20 | 0.33 |

Bayes case-sensitive ignore-repeats mood-scale | 0.07 | 0.15 | 0.21 | 0.34 |

**Always** ignore repeats. The presence of a word more than
once adds no gain in accuracy of the naive bayeses ability to guess
the sentiment of the poster and actually reduce it.

The reduced positive-neutral-negative scale was significantly more accurate, but provides less information than the mood scale.