CSCI 150 - Lab 9
Sentiment Analysis
Materials
Overview
Sentiment Analysis is a Big Data problem which seeks to determine the general attitude of a writer given some text they have written. For instance, we would like to have a program that could look at the text "The film was a breath of fresh air" and realize that it was a positive statement, while "It made me want to poke out my eyeballs" is negative.
One algorithm that we can use for this is to assign a numeric value to any given word based on how positive or negative that word is and then score the statement based on the values of the words. But, how do we come up with our word scores in the first place?
That's the problem that we'll solve in this assignment. You are going to search through a file containing movie reviews from the Rotten Tomatoes website which have both a numeric score as well as text. You'll use this to learn which words are positive and which are negative. The data file looks like this:
1 A series of escapades demonstrating the adage that what is good for the goose is also good for the gander , some of which occasionally amuses but none of which amounts to much of a story . 4 This quiet , introspective and entertaining independent is worth seeking . 1 Even fans of Ismail Merchant 's work , I suspect , would have a hard time sitting through this one . 3 A positively thrilling combination of ethnography and all the intrigue , betrayal , deceit and murder of a Shakespearean tragedy or a juicy soap opera . 1 Aggressive self-glorification and a manipulative whitewash . 4 A comedy-drama of nearly epic proportions rooted in a sincere performance by the title character undergoing midlife crisis . 1 Narratively , Trouble Every Day is a plodding mess . 3 The Importance of Being Earnest , so thick with wit it plays like a reading from Bartlett 's Familiar Quotations 1 But it does n't leave you with much . 1 You could hate it for the same reason . ...
Note that each review starts with a number 0 through 4 with the following meaning:
- 0 : negative
- 1 : somewhat negative
- 2 : neutral
- 3 : somewhat positive
- 4 : positive
Step 1: Word scoring
To start, you will write a program (sentiment.py
) that
simply prompts the user for a
word, then reads through the entire review database and computes the
average score of all reviews containing the word.
Specifically, you should write two functions: score_word(word: str) -> float and score_user_word().
The first function, score_word(word: str) -> float, should:
- Open the movieReviews.txt file, loop through it, and compute the average rating for reviews containing the given word.
- Return the average rating, or the special value None if the word does not occur in any reviews.
Some hints:
- score_word should not prompt the user or print anything!
- Use the string split() method to split each line of the file into words. If words = line.split(), then words[0] will be the review score, and words[1:] will be all the words in the review.
- The average rating for a word can be computed as the total score of reviews containing the word, divided by the number of reviews containing the word.
- Be careful to handle words with different capitalizations, for example, by converting everything to lowercase.
- Be careful to look only for entire words. For example, the word "bad" does not occur in the review "This is a movie about a troubadour.", even though the letters "bad" do occur as a substring. The easiest way to deal with this is to split the review into a list of words, and check whether the given word exists as an element of the list (for example, using Python's in operator).
- Even if the word occurs multiple times in a review, you should only count the review once. (You probably don't have to do anything special to make this work; you will get this behavior by default if you just check for the presence of the word in the review.)
The second function, score_user_word(), should:
- input a word from the user.
- Call score_word to compute the average score for ratings containing the user's word.
- Report the average rating to the user (via print), along with an an assessment whether the word is positive (score greater than 2) or negative (score less than or equal to 2).
>>> score_user_word() Please enter a word: cat Score: 3.16666666667 Positive! =) >>> score_user_word() Please enter a word: CAt Score: 3.16666666667 Positive! =) >>> score_user_word() Please enter a word: dog Score: 1.631578947368421 Negative. =( >>> score_user_word() Please enter a word: koala Sorry, that word is not in the database.
Step 2: Phrase scoring
In this step, you will score an entire phrase instead of a single word. Again, you should write two functions:
- score_phrase(phrase: str) -> float should split the given phrase into words, call score_word on each, and compute the average score of the whole phrase. (Be careful to handle words resulting in None---they should not contribute to the average.)
- score_user_phrase(), similarly to score_user_word(), should prompt the user for a phrase, and tell the user its average score along with whether it is positive or negative.
>>> score_user_phrase() Please enter a word or phrase: My helicopter is full of eels Score: 2.12112861734 Positive! =) >>> score_user_phrase() Please enter a word or phrase: It made me want to poke out my eyeballs Score: 1.76536704587 Negative. =( >>> score_user_phrase() Please enter a word or phrase: I am flying higher than a kite Score: 2.42411951303 Positive! =) >>> score_user_phrase() Please enter a word or phrase: Koala refectory disentanglements Sorry, none of those words are in the database.
You will probably notice a slight lag between entering a phrase and seeing its score. This is because your program has to re-open and re-read the entire movie review database for each word! As an experiment, try entering a phrase like "a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a", with many small words. The more words you enter, the longer your program will take, even if the words are all the same. There has to be a better way!
Step 3: Dictionaries to the rescue!
Instead of reading through the whole review database for each word, we would like to read the database only once, when the program starts, and pre-compute the score for every word that appears. We can then save these word scores in a dictionary, so we can quickly look up the score for any word without re-reading the entire database.
First, write a function read_movie_reviews() -> Dict[str, float],
which reads the file movieReviews.txt
and
returns a dictionary mapping from words to their average scores. The
basic idea is to loop through the reviews in the file while keeping
track of two dictionaries: one maps from words to
a count of how many reviews each word has appeared in; the
other maps from words to the total score of reviews the word
has appeared in.
Hints:
- For each line in the database, you want to count each unique word only once. You may find it helpful to have a list of words that is re-initialized for each new line of input. This list will have an entry for each word from that line that has already been seen. For each new word, you can make sure it is not in the list before counting it (and adding it to the list).
- Be careful when testing read_movie_reviews. Calling
the function from the Python shell produces such a large output that
it could crash Pycharm! Instead, store the dictionary in a
variable, from which you can test it, like this:
>>> d = read_movie_reviews() >>> d['cat'] 3.1666666666666665 >>> d['dog'] 1.631578947368421 >>>
After going through the whole file in this way, divide the total score for each word by the number of reviews in which it appears to derive the average score for the word.
Next, write a function score_phrase_dict(word_scores: Dict[str, float], phrase: str) -> float, which works similarly to score_phrase, but uses the dictionary word_scores of computed scores instead of looking through the review data each time. Hint: you can just copy your code from score_phrase and modify it to use the dictionary instead of calling score_word.
Here is an example of testing score_phrase_dict:>>> d = read_movie_reviews() >>> score_phrase_dict(d, 'My helicopter is full of eels') 2.121128617338458 >>> score_phrase_dict(d, 'It made me want to poke out my eyeballs') 1.7653670458714505
Step 4: Main
Write a main() function which calls read_movie_reviews to create a word score dictionary, then repeatedly prompts user for phrases and scores them. Include a call to main() at the bottom of your program so it runs automatically.
Here is an example of what the output from your program might look like:
Reading movie reviews... Please enter a phrase, or quit: My helicopter is full of eels Score: 2.12112861734 Positive! =) Please enter a phrase, or quit: It made me want to poke out my eyeballs Score: 1.76536704587 Negative. =( Please enter a phrase, or quit: I am flying higher than a kite Score: 2.42411951303 Positive! =) Please enter a phrase, or quit: Koala refectory disentanglements Sorry, none of those words are in the database. Please enter a phrase, or quit: quit Goodbye!
Step 5: Word score caching
There may still be a noticeable lag when your program starts, while it reads the movie reviews and constructs the dictionary of word scores. We can make it even faster by caching the word score data in another file so it does not have to be recomputed next time the program runs. For this step, modify your main function so it does the following:- Look to see whether the file
movieReviews.cache.txt
exists. You can do this by adding import os at the top of your program, and then using the os.path.isfile() function, which takes a file name as an argument and returns True if it exists and False otherwise. - If the cache file does not exist, read the movie review database
and construct the word score dictionary using
the read_movie_reviews() function as usual. Then use a
loop to write() the contents of the dictionary to the cache
file using the format
word1 score1 word2 score2 word3 score3 ...
For example, the first few lines of the generated cache file might look like
aided 4.0 writings 1.0 bad-boy 2.0 ryoko 2.0 yellow 1.5 pony 0.0 four 2.08695652174
This procedure means that the work to compute the dictionary of word scores from the movie reviews will be done only once, the first time your program runs. On subsequent runs, the word scores will be read directly from the cache file. This should be noticeably faster, since the cache file will be smaller than the movie review file, and requires no computation while reading it.
Step 6
Examining the code you wrote in the first five steps, you may see some opportunities to reduce code duplication. Find at least two such opportunities, write appropriate functions that factor out the commonalities, and then use the functions to eliminate the duplication.
In your Lab Evaluation Document, identify the functions you wrote for the purpose of reducing code duplication. Discuss the reasons that the duplication originally occurred, and also discuss the strategy you employed for eliminating the duplication.
What to Hand In
Make sure you have followed the Python Style Guide, and have run your project through the Automated Style Checker.
You must hand in:
sentiment.py
- Lab Evaluation Document
Grading
- To earn a D on this lab, complete Steps 1 and 2 (score_word, score_user_word, score_phrase, score_user_phrase)
- To earn a C on this lab, complete Step 3 (read_movie_reviews, score_phrase_dict)
- To earn a B on this lab, complete Step 4 (main)
- To earn an A on this lab, complete Step 5 (word score caching)
- To earn 100 on this lab, complete Step 6 and the Lab Evaluation Document.
© Eric D. Manley and Timothy M. Urness, Drake University (SIGCSE Nifty Assignments 2016); adapted and extended by Brent Yorgey, Hendrix College