CSCI 150 - Lab 9 - Sentiment Analysis

CSCI 150 - Lab 9
Sentiment Analysis

Materials

Python dictionaries documentation.
Movie review data

Overview

Sentiment Analysis is a Big Data problem which seeks to determine the general attitude of a writer given some text they have written. For instance, we would like to have a program that could look at the text "The film was a breath of fresh air" and realize that it was a positive statement, while "It made me want to poke out my eyeballs" is negative.

One algorithm that we can use for this is to assign a numeric value to any given word based on how positive or negative that word is and then score the statement based on the values of the words. But, how do we come up with our word scores in the first place?

That's the problem that we'll solve in this assignment. You are going to search through a file containing movie reviews from the Rotten Tomatoes website which have both a numeric score as well as text. You'll use this to learn which words are positive and which are negative. The data file looks like this:

1 A series of escapades demonstrating the adage that what is good for the goose is also good for the gander , some of which occasionally amuses but none of which amounts to much of a story .
4 This quiet , introspective and entertaining independent is worth seeking .
1 Even fans of Ismail Merchant 's work , I suspect , would have a hard time sitting through this one .
3 A positively thrilling combination of ethnography and all the intrigue , betrayal , deceit and murder of a Shakespearean tragedy or a juicy soap opera .
1 Aggressive self-glorification and a manipulative whitewash .
4 A comedy-drama of nearly epic proportions rooted in a sincere performance by the title character undergoing midlife crisis .
1 Narratively , Trouble Every Day is a plodding mess .
3 The Importance of Being Earnest , so thick with wit it plays like a reading from Bartlett 's Familiar Quotations
1 But it does n't leave you with much .
1 You could hate it for the same reason .
...

Note that each review starts with a number 0 through 4 with the following meaning:

0 : negative
1 : somewhat negative
2 : neutral
3 : somewhat positive
4 : positive

Step 1: Word scoring

To start, you will write a program (sentiment.py) that simply prompts the user for a word, then reads through the entire review database and computes the average score of all reviews containing the word.

Specifically, you should write two functions: score_word(word: str) -> float and score_user_word().

The first function, score_word(word: str) -> float, should:

Open the movieReviews.txt file, loop through it, and compute the average rating for reviews containing the given word.
Return the average rating, or the special value None if the word does not occur in any reviews.

Some hints:

score_word should not prompt the user or print anything!
Use the string split() method to split each line of the file into words. If words = line.split(), then words[0] will be the review score, and words[1:] will be all the words in the review.
The average rating for a word can be computed as the total score of reviews containing the word, divided by the number of reviews containing the word.
Be careful to handle words with different capitalizations, for example, by converting everything to lowercase.
Be careful to look only for entire words. For example, the word "bad" does not occur in the review "This is a movie about a troubadour.", even though the letters "bad" do occur as a substring. The easiest way to deal with this is to split the review into a list of words, and check whether the given word exists as an element of the list (for example, using Python's in operator).
Even if the word occurs multiple times in a review, you should only count the review once. (You probably don't have to do anything special to make this work; you will get this behavior by default if you just check for the presence of the word in the review.)

The second function, score_user_word(), should:

input a word from the user.
Call score_word to compute the average score for ratings containing the user's word.
Report the average rating to the user (via print), along with an an assessment whether the word is positive (score greater than 2) or negative (score less than or equal to 2).

For example, here is some sample output from calling score_user_word four times:

>>> score_user_word()
Please enter a word: cat
Score: 3.16666666667
Positive! =)
>>> score_user_word()
Please enter a word: CAt
Score: 3.16666666667
Positive! =)
>>> score_user_word()
Please enter a word: dog
Score: 1.631578947368421
Negative. =(
>>> score_user_word()
Please enter a word: koala
Sorry, that word is not in the database.

Step 2: Phrase scoring

In this step, you will score an entire phrase instead of a single word. Again, you should write two functions:

score_phrase(phrase: str) -> float should split the given phrase into words, call score_word on each, and compute the average score of the whole phrase. (Be careful to handle words resulting in None---they should not contribute to the average.)
score_user_phrase(), similarly to score_user_word(), should prompt the user for a phrase, and tell the user its average score along with whether it is positive or negative.

For example:

>>> score_user_phrase()
Please enter a word or phrase: My helicopter is full of eels
Score: 2.12112861734
Positive! =)
>>> score_user_phrase()
Please enter a word or phrase: It made me want to poke out my eyeballs
Score: 1.76536704587
Negative. =(
>>> score_user_phrase()
Please enter a word or phrase: I am flying higher than a kite
Score: 2.42411951303
Positive! =)
>>> score_user_phrase()
Please enter a word or phrase: Koala refectory disentanglements
Sorry, none of those words are in the database.

You will probably notice a slight lag between entering a phrase and seeing its score. This is because your program has to re-open and re-read the entire movie review database for each word! As an experiment, try entering a phrase like "a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a", with many small words. The more words you enter, the longer your program will take, even if the words are all the same. There has to be a better way!

Step 3: Dictionaries to the rescue!

Instead of reading through the whole review database for each word, we would like to read the database only once, when the program starts, and pre-compute the score for every word that appears. We can then save these word scores in a dictionary, so we can quickly look up the score for any word without re-reading the entire database.

First, write a function read_movie_reviews() -> Dict[str, float], which reads the file movieReviews.txt and returns a dictionary mapping from words to their average scores. The basic idea is to loop through the reviews in the file while keeping track of two dictionaries: one maps from words to a count of how many reviews each word has appeared in; the other maps from words to the total score of reviews the word has appeared in.

Hints:

For each line in the database, you want to count each unique word only once. You may find it helpful to have a list of words that is re-initialized for each new line of input. This list will have an entry for each word from that line that has already been seen. For each new word, you can make sure it is not in the list before counting it (and adding it to the list).
Be careful when testing read_movie_reviews. Calling the function from the Python shell produces such a large output that it could crash Pycharm! Instead, store the dictionary in a variable, from which you can test it, like this:
```
>>> d = read_movie_reviews()
>>> d['cat']
3.1666666666666665
>>> d['dog']
1.631578947368421
>>>
```

After going through the whole file in this way, divide the total score for each word by the number of reviews in which it appears to derive the average score for the word.

Next, write a function score_phrase_dict(word_scores: Dict[str, float], phrase: str) -> float, which works similarly to score_phrase, but uses the dictionary word_scores of computed scores instead of looking through the review data each time. Hint: you can just copy your code from score_phrase and modify it to use the dictionary instead of calling score_word.

Here is an example of testing score_phrase_dict:

>>> d = read_movie_reviews()
>>> score_phrase_dict(d, 'My helicopter is full of eels')
2.121128617338458
>>> score_phrase_dict(d, 'It made me want to poke out my eyeballs')
1.7653670458714505

Step 4: Main

Write a main() function which calls read_movie_reviews to create a word score dictionary, then repeatedly prompts user for phrases and scores them. Include a call to main() at the bottom of your program so it runs automatically.

Here is an example of what the output from your program might look like:

Reading movie reviews...
Please enter a phrase, or quit: My helicopter is full of eels
Score: 2.12112861734
Positive! =)
Please enter a phrase, or quit: It made me want to poke out my eyeballs
Score: 1.76536704587
Negative. =(
Please enter a phrase, or quit: I am flying higher than a kite
Score: 2.42411951303
Positive! =)
Please enter a phrase, or quit: Koala refectory disentanglements
Sorry, none of those words are in the database.
Please enter a phrase, or quit: quit
Goodbye!

Step 5: Word score caching

There may still be a noticeable lag when your program starts, while it reads the movie reviews and constructs the dictionary of word scores. We can make it even faster by caching the word score data in another file so it does not have to be recomputed next time the program runs. For this step, modify your main function so it does the following:

Look to see whether the file movieReviews.cache.txt exists. You can do this by adding import os at the top of your program, and then using the os.path.isfile() function, which takes a file name as an argument and returns True if it exists and False otherwise.
If the cache file does not exist, read the movie review database and construct the word score dictionary using the read_movie_reviews() function as usual. Then use a loop to write() the contents of the dictionary to the cache file using the format
```
word1 score1
word2 score2
word3 score3
...
  
```

aided 4.0
writings 1.0
bad-boy 2.0
ryoko 2.0
yellow 1.5
pony 0.0
four 2.08695652174

On the other hand, if the cache file does exist, then read it instead of the movie reviews, putting the data directly into a dictionary of word scores.
In either case, return the created dictionary of word scores.

(Of course, you probably don't want to do all of that directly in your main function; you should decompose the behavior into more functions as appropriate.)

This procedure means that the work to compute the dictionary of word scores from the movie reviews will be done only once, the first time your program runs. On subsequent runs, the word scores will be read directly from the cache file. This should be noticeably faster, since the cache file will be smaller than the movie review file, and requires no computation while reading it.

Step 6

Examining the code you wrote in the first five steps, you may see some opportunities to reduce code duplication. Find at least two such opportunities, write appropriate functions that factor out the commonalities, and then use the functions to eliminate the duplication.

In your Lab Evaluation Document, identify the functions you wrote for the purpose of reducing code duplication. Discuss the reasons that the duplication originally occurred, and also discuss the strategy you employed for eliminating the duplication.

What to Hand In

Make sure you have followed the Python Style Guide, and have run your project through the Automated Style Checker.

You must hand in:

sentiment.py
Lab Evaluation Document

Grading

To earn a D on this lab, complete Steps 1 and 2 (score_word, score_user_word, score_phrase, score_user_phrase)
To earn a C on this lab, complete Step 3 (read_movie_reviews, score_phrase_dict)
To earn a B on this lab, complete Step 4 (main)
To earn an A on this lab, complete Step 5 (word score caching)
To earn 100 on this lab, complete Step 6 and the Lab Evaluation Document.

CSCI 150 - Lab 9Sentiment Analysis