Project 1: Pages With Style

Version 1.0 of 5 January 1999 - Subject to Revision

Introduction

This is the first of two projects that you will complete as a team in CIS 422. The primary purposes of this first project are to give you some practice in working as a team to produce a complete software system with a deadline. You may also get some practice using existing components and tools to accelerate your development by minimizing the amount of new code that you must write.

I have attempted to make the first project technically easy, so most of the issues you will face will be non-technical - how to divide up and coordinate the work, how to work together and not hate each other, how to avoid catastrophe if one person gets sick or flakes out, etc. The project is small, but the deadline is tight.

HTML Converter

You will construct a tool for converting HTML 3.2 pages with presentation markup into HTML 3.2 or HTML 4 pages with separate CSS-1 style sheets and less presentation markup.

Background

"Presentational markup" is coding like <font color="red"> or anything else that directly describes the way content should be displayed, rather than its structure or meaning. In contrast, "structural markup" describes the (surprise surprise) structure of a document. For example, tags like <chapter> and <section> would be structural markup, if they were legal HTML.

It is generally considered best to separate structural markup from presentation markup, for several reasons. It is often useful to apply different presentations to the same structure, e.g., <font color=red size=+3> is not useful when I read web pages on a Palm Pilot or other device with a small, monochrome screen, and it is particularly useless to the vision-impaired reader who uses a voice browser. In fact, if you are in doubt as to whether a certain piece of markup is structural or presentational, a good test is to ask yourself "how might that be presented in voice."

CSS-1 is a language for describing presentational markup, very much like a "style sheet" in a word processor. A web page in HTML can contain a link to a style sheet in CSS-1, and the CSS-1 style sheet describes how presentational markup will be applied to the various entities in the web page.

Here are some places where you can learn more about CSS-1 and how it is used:

The basic CSS-1 standard documents as well as several explanatory documents and pointers are maintained by the W3 consortium at [http://www.w3.org/Style/css/], with additional general background at [http://www.w3.org/Style/].
The CSS Pointers Group maintains pointers to a variety of useful references at [http://css.nu/index.html], including a table of CSS-1 equivalents for common HTML 3.2 presentational markup and lots of CSS-1 examples.
You should at least take a look at Dave Raggett's tool "Tidy", which does part of what I am asking you to do (as well as lots of other neat stuff). It is at [http://www.w3.org/People/Raggett/tidy/].

Basic requirements

You will construct a tool that transforms HTML 3.2 pages with presentational markup into HTML 3.2 or HTML 4.0 pages with less presentational markup, together with CSS-1 style sheets. The transformed pages should be equivalent to the original pages except for the "factoring out" of style information. The input to your tool is a set of pages; your tool should create a single CSS-1 style sheet for the whole set of pages.

When you are done, your project should be a high-quality freeware tool that can be distributed in source form over the internet. It must include full documentation, including installation and configuration instructions and examples.

Other requirements

This handout is not a complete statement of requirements for your product, because it is your job to complete the requirements. There are many open issues for your to resolve. Here are a few things to think about:

What platforms will your product run on?
Which particular markup will your tool translate?
Under what conditions should presentational markup be duplicated in the style sheet? Note that a style sheet is useless if there is a style description for each individual occurence of presentational markup in the original document, but in some cases there may be structurally distinct classes of elements which happen to be mapped to the same presentation.
How will names of structural entities in the HTML be managed? It isn't very useful to have styles associated with elements like ELEMENT0098 and ELEMENT0099, but you can't generate meaningful names out of thin air. Will you support some sort of iterative process in which users can provide more meaningful names?
What is the capacity of your tool, in terms of the number of pages it can handle, the size of those pages, and the ways those pages can be organized (e.g., can it walk over a set of directories containing a hierarchy of web pages in subdirectories)?

These issues barely scratch the surface. The bottom line is: Build a tool that people will want to use, and that they will be able to use easily. I predict that you will find deciding what to build at least as hard as building it. You should identify objectives, alternatives, and constraints for your project, as well as you are able, and begin to identify risks and the ways you will obtain more information to assess and control those risks.

Product Concept Document

When you have formed your team, prepare a document and presentation briefly describing your product concept, as if you were proposing it to management within your company. Your product concept should address some of the issues listed above, as well as your basic strategy for building your tool. Explicitly consider the major risks, technical and non-technical, faced by your project.

Reuse Guidelines

The cheapest, most dependable and least risky software components are those you don't build. You may find useful components that you can reuse, such as some components of Dave Raggett's tool "tidy." I strongly encourage you to scavenge and reuse code whenever you can. On the other hand, you must do so in a way that is legal and ethical, and while I won't set an upper bound on how much of your project code can be reused, you must certainly provide some "value added" and not merely repackage software available elsewhere.

To be legal, you must obey all copyright restrictions in software you use. Beware that a document or file need not contain an explicit copyright statement to be protected by copyright law; you have a right to copy or reuse something only if the author has specifically granted you that right. I am absolutely firm on this, and will not hesitate to fail an individual or a whole team for unethical conduct as regards intellectual property. If you have any questions about what you may or may not do, ask me.

Your product must be freely distributable under the Gnu copyleft agreement. In some cases this may mean that you cannot make use of some software which is otherwise perfect. In other cases it may mean that your product will depend on other software packages that you cannot directly distribute. (Be careful of such dependencies, especially on commercial software, as they can make your product more difficult to install and use.)

To be ethical, you must clearly document the original source of all software and other documents. Every source file must contain header comments clearly identifying its author(s). Derivative work (e.g., code written by you but adapted from a book) must clearly cite each source used in its creation. Falsely identifying yourself as the author of something that is actually someone else's work, or failing to properly cite a reference on which you based part of your work, is plagiarism and will be dealt with very severely.

It is entirely possible to follow these guidelines, making only legal and ethical use of other people's work, and still to avoid a lot of design and coding that would be required if you built this project "from scratch." Sometimes you will find that, even if you cannot directly reuse code (e.g., because it is written in a different programming language), you can still reuse design. You should properly cite the sources of reused design as well as reused code.

Schedule and Deadlines

Working to deadline is a key element of this project and this class. The deadline is firm. If I accept late projects at all, it will be with significant penalties that you really don't want to have imposed. If you reach the deadline and don't have a product to turn in, it's a disaster.

How can you avoid this disaster? The main techniques are explicit risk control, design-to-schedule, and iterative design and implementation.

Explicit risk control

It sounds obvious and trivial, but it's important: Spend some time up front thinking specifically about what might go wrong, and how you can minimize your risk exposure. Risk is often related to uncertainty, and is often addressed by ordering tasks to gain useful information early. For example, if there are certain aspects of the system that you are less confident in than others, you should make sure those parts of the system are built or at least prototyped near the beginning, not near the end of the project.

Design to schedule

The project schedule should be an explicit, primary consideration in your design. If you're not confident of being able to incorporate a feature within the schedule, leave it out. If it is taking too long to implement some feature, find a way to do without it. A useful technique here is "timeboxing," which essentially means that when some part of the project is taking longer than expected, rather than adjusting the schedule you find ways to scale back the design.

Iterative design and implementation

Iterative development combines risk control and design-to-schedule. Your motto should be, "build early, build often." As early as possible, you should get into a mode where you always have a working version of the product, even if it doesn't do much. In fact, you should always have two versions of the product: The one you just built, and the last one that is known to work. Rather than assembling and testing the product near the end of the project, you should be continually adding to it. I suggest that by the end of week 2 you should be on a schedule of building your product at least twice a week, and by the end of week 3 you should be building at least daily, perhaps twice a day.

Iterative development particularly addresses the risk of missing the deadline: As soon as you start building working products, you have something to turn in, even if it's not much. From then on it's just getting better, and the question is not whether you will have something to turn in but how good it will be.

Iterative development can also be used to address other kinds of risk. If part of the project is considered high-risk but essential, it should become part of the product very early so that you can either gain confidence that it is ok or you can start as soon as possible to find an alternative. If something is considered high-risk and non-essential, you may want to put it off until all of the essential parts of the product are working.

Finally, iterative development is useful for exposing problems early. This includes integration problems (getting the pieces to work together), but also schedule problems since each person's progress becomes visible to other team members. If someone says "I'm 80% done but nothing is working yet," don't believe it --- insist that the part that is done be integrated into the product.