CSCI 335 - Artificial Intelligence

Fall 2011

Programming Assignment #7: Spam Filtering with Boosted Decision Trees

Overview

You will implement Boosting to achieve the goal of spam filtering. Given a set of emails labeled as "spam" and "ham", your implementation should identify spam with a high probability.

Programming Assignment

Implement the boosting algorithm from class. You may use any programming language that you would like, although it may be to your advantage to continue using the language you employed for Project 4.
Test your implementation using an email archive containing quite a bit of spam. For training purposes, represent each email as a histogram of word counts. This is the same archive that was used before.
Pick one of the Gini Coefficient or Entropy Metric to use throughout.
Experiment with the following variations:
- For the hypothesis representation:
  - Decision stumps
  - Pruned decision trees
- Use 10%, 25%, and 50% of the emails for training purposes, reserving the remainder for testing
- For each of your experiments, run boosting for 100 iterations. Boosting can be time-consuming, so do not be surprised if this takes a while.
- After each iteration of boosting, output the following data:
  - Number of misclassified training examples.
  - Success rate on the testing examples.
  - Minimum, maximum, and mean values for the margin
- As you perform your analysis, feel free to perform additional experiments to clarify any issues that may arise.

Paper

When you are finished with your experiments, you will write a short paper summarizing your findings. Include the following details in your paper:

An analysis and discussion of your data.
Focus on the dynamics of the learning process as the boosting iteration increases:
- How many iterations are required before all training examples are correctly classified?
- Is the improvement in performance strictly monotonic, or does it degrade at certain points?
- Once boosting has converged, how does its performance on the test data change given additional iterations?
- How does the margin change over time? What relationship does it appear to have relative to performance on the testing set?
For each of the three proportions of training data (10%, 25%, 50%):
- How does the performance of each boosting algorithm compare to the original decision tree algorithm?
- How do their execution times compare? In other words, does any improved classification performance from decision trees compensate for the additional execution time they require?
Overall, what advantages and disadvantages does boosting have over the original decision-tree learning algorithm?

Submit your code as well as your paper using Sauron.

Grading criteria

Grade	Content
A	Program is working Paper is complete Analysis properly characterizes the basis of each claim
B	Small bugs in program The analysis is somewhat flawed
C	Problematic bugs and/or somewhat incomplete or moderately flawed paper
D	Severe bugs and/or multiple parts missing from paper
F	Program does not work at all and/or paper is not seriously attempted