# Data homework 2: Sentiment analysis

Distributed Sep 28; due before class on Oct 5

## Problem 1

Figures 1 and 2 depict the distribution of some scalar modifiers relative to categories at IMDB and Experience Project. (The leftmost panels are good outside the scope of negation.) In both, the x-axis gives the categories, and the y-axis gives the probability of the category given the word. The vertical bars mark 95% confidence intervals (tiny for IMDB), and the horizontal gray line is the value we would expect if the word were equally frequent in all categories. (For more on the calculations and visualizations, see the Sep 28 handout on classifiers.)

Your task: First, describe the basic pattern that you see, noting any linguistically interesting sub-patterns. A lot could be said, but we're imagining one medium-sized paragraph on this. Second, offer a hypothesis about what the underlying causes of these distributions are. If all is going well, this will tie in with your description. Third, speculate as to what the corresponding negative data might look like for the Experience Project corpus (say, good in the scope of negation, depressing, bad, and terrible).

## Problem 2

The file imdb-advadj-with-ratings.csv contains 91,713 adverb–adjective pairs from the short summaries attached to user-supplied movie reviews at IMDB.com, along with the associated start rating (1-10 stars). In addition, column 2 collapses the ratings into three categories: for rating R, if R ≥ 8, then Pos; if R ≤ 3, then Neg, else Neutral. Finally, column 5 gives the classification of the adjective according to the Harvard Inquirer: Positiv, Negativ, or Objectiv (if the adjective is listed as neither Positiv nor Negativ). Here's a sample to illustrate the format: