Homework 5: Web browser, part 2
Processing HTML
HTML (HyperText Markup Language)
is the language used to define the appearance of a web page. The primary task
of a browser is to render a document that corresponds to the downloaded HTML
instructions.
An HTML document contains tags that are interspersed with regular
text. To process an HTML document, we can proceed as follows:
- Separate the text into tag and non-tag elements.
- Determine the properties of each tag.
- Render the doucment by iterating through the processed text from start
to finish:
- Some tags have a one-time effect and may be subsequently ignored.
- Other tags change how all subsequent text is rendered, until a matching
closing tag is encountered or a different tag cancels the effect.
- Tags that the browser does not recognize may be ignored.
Tag Structure
- A tag starts with a name. It is alphanumeric text, delimited with a space
or a closing symbol.
- Arbitrary spaces may seperate elements of a tag, including the opening and
closing characters.
- A tag may contain properties. Each property is either a single word or
an assignment.
- An assignment has a word on the left-hand side, an equals sign, and either
a word or a quoted string on the right-hand side.
- If it is a quoted string, anything can appear in the quotes, even the
closing symbol.
The following tags should be supported:
- Headers: <h1>, <h2>, <h3>
- You need not change the font size; it suffices to put each header on
a line by itself, with a blank line preceding it and another blank
line following it.
- Hyperlinks: <a>
- Newlines: <br>
- Lists: <ul>, <ol>, <li>
- Comment tags (i.e. ignore all text inside them)
Additional Text Processing
- Paragraphs:
<p> tags will start a new paragraph. A new paragraph is defined
to be an end-of-line followed by a blank line in the rendered output.
- Formatted Text: Wrap text so that it does not
exceed the horizontal bounds of the browser window. Replace newline
characters from the source document with spaces; also collapse all tabs and
all sequences of spaces more than one character long into a single space
character.
- Important Exception: Two or more consecutive
newline characters should start a new paragraph.
- Special characters: Convert the escape sequences using
the ampersand to the implied characters. This will allow rendering of <,
>, &, and ". View the source for this page to see the escape
codes.
Buttons
Your browser should implement the proper behavior for the Forward, Back,
Home, Stop, and Reload buttons. Store pages visited in the current session
in an appropriate data structure. The Forward and Back buttons will then
access this data structure in order to perform their work.
Augument the user interface to enable the user to set the home page.
At your option, you may also allow the user to tell the browser to
always load this page when it starts up. Also augment the user
interface to enable the user to view the HTML source for a given page.
Hyperlinks
All hyperlinks should be visually differentiated from the surrounding text.
Clicking on a hyperlink with the mouse should load the hyperlinked page.
Relative hyperlinks should be handled properly. Also, clicking on
a hyperlink in the link window for the page should also work.
URL Bar
The user should be able to enter a URL in the URL bar. When the user
performs an appropriate action, the URL in the URL bar should be loaded
into the browser.
Threading
Create a thread that will be responsible for the network communication.
At your discretion, the thread can also handle parsing and rendering.
Deadlines
For Friday, 10/23:
- Develop and implement the data structure for storing a processed HTML
document.
- Be prepared to discuss control flow and rendering strategy in class.
For Monday, 10/26:
- Demonstrate your program during class.
- It may still have a few bugs.
EXTENDED DEADLINE
For Friday, 10/30:
- Program due at noon.
- Submit via Sauron.
- Code reviews in class:
For Monday, 11/2:
- Code review due at 8:10 am.
- Code review must cover the entire browser, including the parts
implemented in the first part of the assignment.
- Submit via Sauron.