CSCI 335 - Artificial Intelligence
Fall 2011
Programming Assignment #4: Spam Filtering with Decision Trees
Overview
You will implement Decision Tree Learning to achieve the goal of spam filtering. Given a set of emails labeled
as "spam" and "ham", your implementation should identify spam with a high
probability.
Programming Assignment
- Implement the decision tree learning algorithm from class. You may use any programming language that you would like.
- Test your implementation using an email archive containing quite a bit of spam. For training purposes, represent each email as a histogram of word counts.
- Experiment with the following variations:
- Use both the entropy metric and the Gini coefficient
- Use 10%, 25%, and 50% of the emails for training purposes, reserving the remainder for testing
- Try at least one representation for emails other than a histogram of word counts.
- Create your own spam data set with your own emails, and assess performance.
- Create an additional data set with at least three labels.
For each experiment you run, record:
- Tree size
- Tree depth
- Number of training examples
- Number of testing examples
- Number and percentage of correctly classified tests
Paper
When you are finished with your experiments, you will write a short paper
summarizing your findings. Include the following details in your paper:
- An analysis and discussion of your data.
- Be sure to examine several trees devised by your implementation, and
discuss how they decide to classify their data. Qualitatively assess
the degree to which each tree "understands" the concept of spam.
Submit your code as well as your paper using
Sauron.
Grading criteria
| Grade | Content |
| A | Program is working Paper is complete Analysis properly characterizes the basis of each claim |
| B | Small bugs in program The analysis is somewhat flawed |
| C | Problematic bugs and/or somewhat incomplete or moderately flawed paper |
| D | Severe bugs and/or multiple parts missing from paper |
| F | Program does not work at all and/or paper is not seriously attempted |