CSCI 150 - Lab 9 - Sentiment Analysis

CSCI 150 - Lab 9
Sentiment Analysis


Materials

Overview

Sentiment Analysis is a Big Data problem which seeks to determine the general attitude of a writer given some text they have written. For instance, we would like to have a program that could look at the text "The film was a breath of fresh air" and realize that it was a positive statement, while "It made me want to poke out my eyeballs" is negative.

One algorithm that we can use for this is to assign a numeric value to any given word based on how positive or negative that word is and then score the statement based on the values of the words. But, how do we come up with our word scores in the first place?

That's the problem that we'll solve in this assignment. You are going to search through a file containing movie reviews from the Rotten Tomatoes website which have both a numeric score as well as text. You'll use this to learn which words are positive and which are negative. The data file looks like this:

1 A series of escapades demonstrating the adage that what is good for the goose is also good for the gander , some of which occasionally amuses but none of which amounts to much of a story .
4 This quiet , introspective and entertaining independent is worth seeking .
1 Even fans of Ismail Merchant 's work , I suspect , would have a hard time sitting through this one .
3 A positively thrilling combination of ethnography and all the intrigue , betrayal , deceit and murder of a Shakespearean tragedy or a juicy soap opera .
1 Aggressive self-glorification and a manipulative whitewash .
4 A comedy-drama of nearly epic proportions rooted in a sincere performance by the title character undergoing midlife crisis .
1 Narratively , Trouble Every Day is a plodding mess .
3 The Importance of Being Earnest , so thick with wit it plays like a reading from Bartlett 's Familiar Quotations
1 But it does n't leave you with much .
1 You could hate it for the same reason .
...

Note that each review starts with a number 0 through 4 with the following meaning:

Step 1: Word scoring

To start, you will write a program (sentiment.py) that simply prompts the user for a word, then reads through the entire review database and computes the average score of all reviews containing the word.

Specifically, you should write two functions: score_word(word: str) -> float and score_user_word().

The first function, score_word(word: str) -> float, should:

Some hints:

The second function, score_user_word(), should:

For example, here is some sample output from calling score_user_word four times:

>>> score_user_word()
Please enter a word: cat
Score: 3.16666666667
Positive! =)
>>> score_user_word()
Please enter a word: CAt
Score: 3.16666666667
Positive! =)
>>> score_user_word()
Please enter a word: dog
Score: 1.631578947368421
Negative. =(
>>> score_user_word()
Please enter a word: koala
Sorry, that word is not in the database.

Step 2: Phrase scoring

In this step, you will score an entire phrase instead of a single word. Again, you should write two functions:

For example:

>>> score_user_phrase()
Please enter a word or phrase: My helicopter is full of eels
Score: 2.12112861734
Positive! =)
>>> score_user_phrase()
Please enter a word or phrase: It made me want to poke out my eyeballs
Score: 1.76536704587
Negative. =(
>>> score_user_phrase()
Please enter a word or phrase: I am flying higher than a kite
Score: 2.42411951303
Positive! =)
>>> score_user_phrase()
Please enter a word or phrase: Koala refectory disentanglements
Sorry, none of those words are in the database.

You will probably notice a slight lag between entering a phrase and seeing its score. This is because your program has to re-open and re-read the entire movie review database for each word! As an experiment, try entering a phrase like "a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a", with many small words. The more words you enter, the longer your program will take, even if the words are all the same. There has to be a better way!

Step 3: Dictionaries to the rescue!

Instead of reading through the whole review database for each word, we would like to read the database only once, when the program starts, and pre-compute the score for every word that appears. We can then save these word scores in a dictionary, so we can quickly look up the score for any word without re-reading the entire database.

First, write a function read_movie_reviews() -> Dict[str, float], which reads the file movieReviews.txt and returns a dictionary mapping from words to their average scores. The basic idea is to loop through the reviews in the file while keeping track of two dictionaries: one maps from words to a count of how many reviews each word has appeared in; the other maps from words to the total score of reviews the word has appeared in.

Hints:

After going through the whole file in this way, divide the total score for each word by the number of reviews in which it appears to derive the average score for the word.

Next, write a function score_phrase_dict(word_scores: Dict[str, float], phrase: str) -> float, which works similarly to score_phrase, but uses the dictionary word_scores of computed scores instead of looking through the review data each time. Hint: you can just copy your code from score_phrase and modify it to use the dictionary instead of calling score_word.

Here is an example of testing score_phrase_dict:
>>> d = read_movie_reviews()
>>> score_phrase_dict(d, 'My helicopter is full of eels')
2.121128617338458
>>> score_phrase_dict(d, 'It made me want to poke out my eyeballs')
1.7653670458714505

Step 4: Main

Write a main() function which calls read_movie_reviews to create a word score dictionary, then repeatedly prompts user for phrases and scores them. Include a call to main() at the bottom of your program so it runs automatically.

Here is an example of what the output from your program might look like:

Reading movie reviews...
Please enter a phrase, or quit: My helicopter is full of eels
Score: 2.12112861734
Positive! =)
Please enter a phrase, or quit: It made me want to poke out my eyeballs
Score: 1.76536704587
Negative. =(
Please enter a phrase, or quit: I am flying higher than a kite
Score: 2.42411951303
Positive! =)
Please enter a phrase, or quit: Koala refectory disentanglements
Sorry, none of those words are in the database.
Please enter a phrase, or quit: quit
Goodbye!

Step 5: Word score caching

There may still be a noticeable lag when your program starts, while it reads the movie reviews and constructs the dictionary of word scores. We can make it even faster by caching the word score data in another file so it does not have to be recomputed next time the program runs. For this step, modify your main function so it does the following: (Of course, you probably don't want to do all of that directly in your main function; you should decompose the behavior into more functions as appropriate.)

This procedure means that the work to compute the dictionary of word scores from the movie reviews will be done only once, the first time your program runs. On subsequent runs, the word scores will be read directly from the cache file. This should be noticeably faster, since the cache file will be smaller than the movie review file, and requires no computation while reading it.

Step 6

Examining the code you wrote in the first five steps, you may see some opportunities to reduce code duplication. Find at least two such opportunities, write appropriate functions that factor out the commonalities, and then use the functions to eliminate the duplication.

In your Lab Evaluation Document, identify the functions you wrote for the purpose of reducing code duplication. Discuss the reasons that the duplication originally occurred, and also discuss the strategy you employed for eliminating the duplication.

What to Hand In

Make sure you have followed the Python Style Guide, and have run your project through the Automated Style Checker.

You must hand in:

Grading


© Eric D. Manley and Timothy M. Urness, Drake University (SIGCSE Nifty Assignments 2016); adapted and extended by Brent Yorgey, Hendrix College