Due: 8:00am, Thursday, April 8. Value: 30 pts. Submit to Sauron.
A Web crawler is a program that automatically loads a Web page, discovers new pages from its links, and then loads those pages. Search engines like Google and Bing use Web crawlers to build up their database of pages. This assignment will deal with a simple Web crawler, which you should use for the starting code for this assignment. I've also built a very small Web so the crawler can crawl within a constrained space. (Actually, it's based on a small Facebook network — in fact, mostly students who were/are computer science majors at Hendrix, though I have changed their names to keep some measure of anonymity.)
You will need to be familiar with two important variables
that the crawler uses to track the URLs it discovers:
queue is a list of URLs (each a string)
that have been discovered and should be loaded in the future,
and found is a dictionary whose keys include
each URL that has been discovered, whether it has already been
loaded or it is in queue to be loaded later.
Because the Web is vast and expands rapidly, a crawler can never hope to load all the pages in the Web. So a good crawler must be careful in how it selects which pages to load. My simple crawler is quite naive about this: It simply loads the pages in the order they are discovered.
Your assignment is to modify the crawler to use a
more sophisticated page selection algorithm: Each time, it should load
the page to which the most previously-loaded pages have linked.
To do this, you will modify the first three functions,
create_found, select_from_queue,
and add_to_found. In your implementation,
the found dictionary should maps URLs to
integers rather than always mapping discovered URLs to True. The add_to_found
function will
increment the number associated with a URL if it happens already to
be in the dictionary. And select_from_queue should
return the URL in queue for which the number associated with
it within found is greatest.
If your program works properly, you should find that it loads the pages roughly in descending order of number of friends. It won't be in exact order: At the beginning, the crawler won't know who has the most friends yet. But the pages with the fewest links (christopher.html, james.html, sarah.html) should end up being the last to be loaded.