Below are the distributions of reviews in some corpora in which each text has associated with it a star rating, 1-5 stars (1 negative, 5 positive).
(This problem concerns sentiment analysis, but the underlying issues are common wherever one is dealing with naturalistic corpora.)
|English product reviews|
It's common for features in a model to have a kind of split personality due to sources of variation that have not been isolated. Very often, identifying these hidden factors can lead to better performance and increased interpretability of the model.
The following plots are derived from data at the Experience Project website. At the site, community members can post confessional texts, and others can react to them by clicking on a set of reaction categories: 'Sorry hugs' (sympathy), 'You rock' (positive enthusiasm), 'Teehee' (amusement), 'I understand' (solidarity), and 'Wow, just wow' (disapproval and shock). The plots depict probability distributions over these categories for four words: bad, angry, depressed, and arrested. You can think of the distributions as P(reaction | word): the probability of each kind of reaction given that the text contains the word in question.
Your task: focus on the rightmost plot, for arrested. The others plots are there to help you contextualize this one. The fact that the two most probable categories are 'Sorry, hugs' and 'Wow, just wow' is unusual. What might be causing the split between sympathetic and shocked reactions? (2-3 sentence response.)