CS 201 - Assignment 4

Due: Friday, April 11th by 11:59 PM

Updated 4/7: extend due date, add hints for URL methods

Web Spider

Acknowledgment: the idea for this assignment comes from Bill Pugh at the University of Maryland.

A web spider is a program that, starting from a particular starting web page, follows all pages reachable transitively from that page.  Search engines, such as Google, use web spiders to build an index of web documents.

For example, consider the following collection of web pages:

A.html is the start page

A.html links to B.html and C.html

B.html links to D.html

C.html links to E.html

D.html links to B.html

E.html links to A.html and F.html

F.html does not contain any links

A web spider starting at A.html should be able to find all of these pages and links.

Graphs

A graph is a data structure consisting of vertices and edges.  The vertices are "locations" in the graph; the edges are "arrows" indicating how the locations (vertices) are connected to each other.

We can represent the collection of web pages above as a graph:

The web pages are the vertices and the links from one page to another are the edges.

Breadth-First Search

So, how does a web spider decide the order in which pages and links should be traversed?

A common solution for the problem of visiting vertices and edges is to use a breadth-first search.  Here is the pseudo-code for the algorithm:

s = the start vertex
q = new Queue
seen = new Collection (possibly a linked list or array list)

q.enqueue(s)

while ( q is not empty ) {
v = q.dequeue()

if ( v has not been added to seen collection ) {
seen.add(v)

visitVertex(v)

for ( each edge e originating at v ) {
visitEdge(e)
dv = the destination vertex of e (the vertex the arrow is pointing to)
q.enqueue(dv)
}
}
}

The idea is that the queue contains vertices "discovered" by the algorithm.  The algorithm processes each vertex in the order discovered, starting with the start vertex.  Each edge leading away from a processed vertex is used as an opportunity to discover more vertices, which are added to the queue.

The visitVertex and visitEdge operations represent arbitrary processing that we might want to perform as vertices and edges are discovered.  For example, the Google search engine's web spider would use the visitVertex operation to build an index of the web page (recall that we have modeled web pages as vertices).

Note that when an edge is encountered, it may lead to a vertex that has already been processed.  The seen collection is used to keep track of vertices already processed, so we can avoid processing any vertex more than once.  This is especially important if the graph contains cycles, meaning that there is a loop of edges and vertices.

So, starting from the start page, A.html, a breadth-first search would visit the vertices and edges of the graph in the following order:

Getting Started

Download CS201_Assign4.zip and import it into your Eclipse workspace.  (File->Import...->General->Existing Projects into Workspace->Archive File).  You should see a new project, CS201_Assign4, in the package explorer.

Your Task

You have two tasks:

1. Implement the methods of the URL class

2. Implement the createSpider() method in the WebSpiderFactory class

This task will require you to implement a class that implements the WebSpider interface

URLs

A URL is a string which identifies the location of a web page.

In this assignment, we will use file names as URLs.  This will allow us to implement a web spider that performs a web crawl of HTML files in the local filesystem.

Example file name URLs:

index.html

foo/bar/

foo/bar/baz.html

foo/bar/./baz.html

foo/bar/thud/../baz.html

/www/somePage.html

A URL can be relative or absolute.  A URL that begins with a "/" character is absolute.  All other URLs are relative.  For example, "index.html" and "foo/bar/" are relative URLs, while "/www/somePage.html" is an absolute URL.

A directory URL is one which ends with a "/" character, or which is the empty string "".  For any URL, we can extract its directory part as follows:

if a URL does not contain any "/" characters, then its directory part is "" (the empty string)

if a URL does contain at least one "/" character, then its directory part is the prefix of the string which ends in the last occurrence of "/"

For example, the directory part of "index.html" is "", and the directory part of "/www/somePage.html" is "/www/".

A URL is in canonical form if it does not contain any "." or ".." components.  A "." component means "the current directory", while a ".." component means "the parent directory".  Therefore the three URLs

foo/bar/baz.html

foo/bar/./baz.html

foo/bar/thud/../baz.html

all have the same canonical form, which is "foo/bar/baz.html".  Being able to convert a URL to canonical form is important, because it ensures that the web spider will be able to detect when it encounters a page visited previously.

There is a simple stack-based algorithm for finding the canonical form of a URL!

When one web page (the "base" page) contains a reference to another web page (the "referenced" page), the URL of the referenced page may be relative.  In this case, the complete URL of the referenced page is determined by extracting the directory from the base page, and appending the URL of the referenced page.  For example, if the URL "foo/bar/baz.html" is referenced from the page "/www/somePage.html", the complete URL of the referenced page is "/www/foo/bar/baz.html".

The URL class contains static methods which operate on URL strings.  The URLTest class contains JUnit tests which check whether or not these methods are implemented correctly.  A documentation comment explains precisely how each method is supposed to work.

Hints:

In the makeCanonical method, you will need to extract each component of the URL, in order.  For example, the components of the URL "foo/bar/baz.html" are "foo", "bar", and "baz.html".

You can do so with the following code

while ( ! url.equals("") ) {

int slash = url.indexOf('/');

String component;
if (slash >= 0) {
component = url.substring(0, slash);
url = url.substring(slash + 1);
} else {
component = url;
url = "";
}

...do something with the URL component, using a stack!...
}

Note that a URL can contain multiple occurrences of the ".." component: e.g.,

foo/bar/baz/../../x.html

since the ".." component means "go back to parent directory", the canonical form of this URL is

foo/x.html

A stack of URL component strings will help you find the canonical form of the URL you are processing.

Note that when converting a URL into its canonical form, any leading and/or trailing occurrences of the slash ("/") character must be preserved.

WebSpider, WebSpiderFactory

The WebSpider interface defines two methods:

public void setStartPage(String url)

Set the URL (file name) of the HTML page at which the web crawl should start

public void crawl(PageAndLinkVisitor linkVisitor) throws IOException

Performs the web crawl.  The implementation of this method should perform a breadth-first search of HTML pages reachable from the start page.  As each page is encountered, and as links within pages are encountered, the method should invoke an appropriate method on the PageAndLinkVisitor object passed as a parameter.

Notes:

You can use an instance of the HRefExtractor class to extract all of the linked URLs from an HTML document.  The code to perform the extraction is:

String url = ...file name of a page you want to extract links from...

HRefExtractor extractor = new HRefExtractor(url);
extractor.extract();

List<String> referencedURLs = extractor.getHRefs();

The referenced URLs returned by the HRefExtractor may be relative to the base page (the page containing the link).  Use the URL.getReferencedURL method to find the full URL of the referenced page in its canonical form.

The web spider should ignore all referenced URLs that begin with "http:" or "mailto:", since those are external links.  Also, if a referenced URL contains an occurrence of the "#" character, the web spider should remove all of the characters including and following the last occurrence of the "#" character from the URL.

All visited pages should be reported to the PageAndLinkVisitor using the visitPage method.  Each page should only be reported once, at the time of its visit in the breadth-first search.

All referenced URLs (to HTML documents and other kinds of files) should be reported as links to the PageAndLinkVisitor.  You should report the link as either good or broken.  A good link is one where the referenced URL exists as a file; a broken link is one where the referenced URL does not exist as a file.  You can test to see whether a link is good or broken as follows:

String url = ...a referenced URL...

File f = new File(url);
if (f.exists()) {
// link is good
} else {
// link is bad
}

Any time a good link to another HTML document is seen, it should be scheduled for processing (added to the queue).

Testing

Run the JUnit tests in the URLTest class to make sure your implementations of the methods in the URL class work correctly.

You can use the LinkChecker program to test your implementation of WebSpider.  A collection of HTML documents has been included within the project in the www directory.  Here is an (abbreviated) example run of the program (user input in bold red):

Enter filename of start page:
www/index.html
=============== Summary ===============
Number of pages found: 29
Number of good links: 793
Number of broken links: 53
=============== Details ===============
Checking page www/index.html
Broken link from www/index.html to www/findbugs.css
Page www/index.html links to www/index.html
Page www/index.html links to www/demo.html
Page www/index.html links to www/users.html
Page www/index.html links to www/factSheet.html
...many additional lines of output...

The output above is abbeviated.  The full output is 883 lines long.

Submitting

To submit, first export your project by right-clicking on the name of the project (CS201_Assign4) in the Package Explorer, and then choosing Export->General->Archive File.  Save the project as a zip file.  Upload the zip file to the Marmoset server as Project 4:

https://camel.ycp.edu:8443/

You should have received your Marmoset username and password in an email.