One of the primary foundations of the World Wide Web is the HyperText Markup Language (HTML). HTML is the primary format in which documents are distributed and viewed on the Web. Many of its features, such as platform-independent formatting, structural design, and especially hypertext, make it a very good document format for the Internet and the WWW.

This chapter gives you a basic understanding of HTML and how you can create documents in this format. A brief description of the common tags and a style guide to creating good HTML documents help you on the road to getting your information onto the WWW. A few of the more advanced features, as well as a look to the future of HTML, are also covered.

Background

As one of the foundation specifications that define the Web (along with HTTP and URLs), HTML was originally developed by Tim Berners-Lee at CERN in 1989. HTML was envisioned to be a format that would enable scientists using very different computers to share information seamlessly over the network; several features were necessary. Platform independence, in which a document can be displayed similarly on computers with different capabilities (that is, fonts, graphics, and color) was vital to the varied audience. Hypertext, meaning any word or phrase in one document could reference another document, would allow for easy navigation between and within the many large documents on the system. Rigorously structured documents would allow for advanced applications such as converting documents to and from other formats, and searching text databases.

SGML and HTML

Berners-Lee chose to use the Standard Generalized Markup Language (SGML) as a pattern. As an emerging international standard, SGML had the advantages of structure and platform independence. Its status also ensured its long life, meaning that documents formatted in SGML would not need to be rebuilt a few years later.

SGML is platform-independent because it focuses on encoding the semantic structure, or meaning, of a document—not necessarily its appearance. Thus, a chapter title would be labeled, "Chapter Title," instead of "Helvetica 18pt Centered." Although the latter style breaks down if the document is viewed on a computer that doesn't have the Helvetica typeface or support for lettering of different sizes, the former style can be displayed (intelligently) on any system. Each reader defines the appearance of chapter titles in a way that is useful on his or her computer, and any text with that style is formatted accordingly.

Another feature of this structure is that semantically encoded text can be automatically processed more intelligently by the computer. For example, if every chapter title is marked with the label "Chapter Title," perhaps with the chapter number as an attribute, a reader could request to see just Chapter 18; the SGML software would automatically look for the Chapter 18 title and the Chapter 19 title and extract everything between them. This could not be done with the text marked with meaningless (to the computer) fonts and formatting codes.

A great advantage of SGML is its flexibility. SGML is not a format in its own right, but a specification for defining other formats. Users can create new formats to encode all the structure of certain types of documents (for example, technical manuals, phone books, and legal documents), and any SGML-capable software can understand it, simply by reading the definition first. A large number of Document Type Definitions (DTDs) have been created, both for common and very specialized documents. HTML is simply one DTD, or application, of SGML.

The Evolution of HTML

For several years, the use of HTML (and the WWW) grew slowly, despite these capabilities. This was primarily because it did not have enough features to do any kind of professional electronic publishing; it had some font control, but no graphics. Semantic encoding was not important to people when they couldn't make it look pretty.

Then everything changed. When NCSA first built Mosaic in early 1993, they added their own features to HTML, including inline graphics. This suddenly allowed people to attach logos, icons, photographs, and diagrams to their documents; the size and usage of the Web exploded. For the next year, the development of HTML happened on a very ad hoc basis. New pieces of HTML were introduced by one browser or another from time to time; some would catch on, and others would disappear. Some of the additions were poorly designed, and many were not even SGML-compliant.

By May 1994, it was apparent that HTML was growing out of control. At the first WWW conference in Geneva, Switzerland, an HTML Working Group was organized. Its primary task was to formalize HTML, as it was being used, into an SGML DTD known as HTML Level 2. (Level 1 was defined to be HTML as it was originally designed by Tim Berners-Lee.) Once standardized, it could then be safely extended to future levels, and still take advantage of the capabilities of true SGML and its formal structure. At the time of this writing, HTML Level 2 is nearing completion, having gone through several drafts, and is becoming the standard format that all WWW browsers can understand.

Even though it isn't standard, HTML 3.0 is already in wide use today and adds many needed features to the HTML 2.0 specification. Chapter 12, "Netscape Extensions and HTML 3.0," and Appendix H, "HTML Encyclopedia," give you the rundown on which features are in which versions. This chapter sticks to the basics of HTML 1.0 and 2.0 so you can ease into it.

HTML documents are in ASCII text format and can be created using most text editors. There are some Windows editors available specifically for HTML editing. We have included one on the CD that is very simple and intuitive: WebEdit.

A Basic Document

Let's first take a look at a simple HTML document to see how one normally appears. The easiest way to look at HTML is to let a Web browser interpret the file for us. Figure 9.1 shows a very simple HTML file as it would appear on the Web.

Figure 9.1. A simple HTML page showing text and graphics.

Listing 9.1 is the HTML code used to display Figure 9.1. As you look at the HTML code, you should notice that it isn't too difficult to match up the text with the appearance of that text in the Web browser. You can learn many things about the HyperText Markup Language from this basic document.

Listing 9.1. The HTML code from our simple Web page.

<HTML>

<HEAD>

<TITLE>Boston's Story</TITLE>

</HEAD>

<BODY>

        <H1>Welcome to Boston's Life</H1>

        Hi, my name is Boston. Here is a picture of me:<P>

        <IMG SRC="boston.jpg"><P>

        <H4>A Brief Autobiography</H4>

        <UL>

               <LI>Born in Bonsall, CA March 5, 1995.

               <LI>Got my shots and went to new home in San Diego, April 30, 1995.

               <LI>Now spend time catching Frisbees and looking out the window.

        </UL>

        <HR>

        <ADDRESS>Okay, so e-mail me: boston@xyz.com

        </ADDRESS>

</BODY>

</HTML>

It is always important to remember that HTML (as an application of SGML) encodes only the structure of the document. Much of the appearance of the document, such as type styles, color, and the window size, is under the ultimate control of the browser and the people using it. However, most browsers render things similarly; as different parts of HTML are described, their normal rendering is also given.

Basic HTML Syntax

An HTML document consists of two types of contents: normal document text and codes, or tags. Tags are text strings surrounded by a less-than and greater-than sign, such as <HTML> in the first line. Tags usually have the following structure:

<tagname attribute=value attribute=value . . . >

The tagname is the type of text being defined by the tag; the attributes (some tags have none, some have several, but most are optional) give additional information about how the element should behave.

For example, in the <HEAD> tag in the second line of the sample HTML file, HEAD is the tagname and has no associated attributes. Farther down in the file is a tag with the tagname IMG and a single attribute SRC that has the value "boston.jpg". It is important to remember that the tagname and attribute are not case-sensitive. You can use uppercase and lowercase letters as you want. The values assigned to the attributes may be case-sensitive, depending on the attribute.

The tags and text combine to form elements. Each element represents an object in the document, such as a heading, paragraph, or picture. An element consists of one or two tags and usually some associated text.

There are two types of elements: containers and empty elements. Container elements represent a section of text and consist of body text (or other elements) delimited by a tag at the beginning and the end. (The end tag is identified by a / before the tagname and never carries any attributes.) For example, in the third line of the sample file, the <TITLE> and </TITLE> tags define the text between them as a title.

On the other hand, an empty element consists of a single tag that does not alter any text; instead, it inserts something into the document. For example, the <IMG SRC=...> tag/element places the picture in the document.

Together, container elements and empty elements completely define how a document is to be formatted and displayed. Other things normally used to format text (such as tabs, extra spaces, and carriage returns) are treated as a single space in HTML. For example, the sample HTML files could have been typed with three blank lines after every tag and ten spaces between each word, but would appear exactly the same (just as it would if the entire file had been typed on a single line). Although this might make simple formatting more difficult, it enables writers to make the HTML document more readable by using programming style techniques such as extra blank spaces and tabs (as are used in the sample file), without affecting the display of the final document.

Description of Elements in a Sample Document

This section looks at the elements used in the sample document. The sample file contains the common tags used in most documents. (More thorough definitions of each element are given later in the chapter.)

First, three container elements should appear in every HTML file. You might imagine these container elements as sandwiches—like pieces of bread, each opening tag must be followed by the corresponding closing tag.

<HTML> text </HTML>: This element contains the entire file (that is, the first tag appears at the beginning of the file, and the second tag appears at the end of the file) and define the enclosed text as an HTML document. This, the largest, container element contains the following two container elements, in order.
<HEAD> text </HEAD>: This element is the header and contains information about the document (usually one to three lines) that is not part of the text. It plays the same role as the running head on each page of this book: It gives context and position to the text but is not part of the narrative.
<BODY> text </BODY>: This element contains other elements representing the body text of the document, normally almost all the file's length.

Together, these three elements create a template, which all HTML documents should follow:

<HTML>

<HEAD>

   Header Elements

</HEAD>

<BODY>

   Body of Document

</BODY>

</HTML>

The <HEAD> element can contain several unique elements; however, most documents contain only the one shown in the example:

<TITLE> text </TITLE>: This element is the title of the document. The title is normally shown in the browser separate from the text page (for example, in the window frame or in a part of the window separate from the document).

The <BODY> element in the sample file contains several common elements:

<H1> text </H1>: This element identifies the enclosed text as a major heading (for example, the title at the beginning of a document). You can have up to six levels of headings by using the tags <H1>, <H2>, and so on up to <H6>. (The lower numbers signify headings of greater importance.) Headings are normally rendered in a larger type (more important headings are in a larger type) with a blank space above and below.
<P>: This tag marks the separation between two paragraphs of body text (that is, text not part of some other element).
<IMG SRC="boston.gif">: This element places an image in the document, which can be found at the URL given in the SRC attribute. (See Chapter 1, "Internet Technology Primer," for an explanation of URLs.)
<UL> text <LI> text </UL>: This construction provides an unordered list of items; the <LI> tag begins each item. Normally, a bullet is placed at the beginning of each entry.
<A HREF="http://www.mtsmith.vt.us/">text</A>: This kind of element marks a hypertext anchor, also known as a hyperlink. The text is highlighted in some way on-screen (in color, with an underline or something similar); when that text is selected on-screen (that is, pointed at with the mouse), the document given by the URL in the HREF (Hypertext Reference) attribute is retrieved.
<HR>: This element places a horizontal rule, or line, across the window, normally with a space above and below.
<ADDRESS> text </ADDRESS>: This element marks a block of text that serves as a postal or electronic mail address. The address is normally rendered in a slightly different font than body text (for example, smaller italic type) and does not use the extra space placed between body paragraphs (formatted with the <P> element).
<BR>: This element forces a line break in the text so that any succeeding text is placed on the next line.

These elements are described in more detail, along with many other valid elements, later in this chapter.

Writing Documents

Now that you have seen an HTML document in action, you're probably wondering, "How can I make one of these?" There are several options for creating HTML files, ranging from the powerful and difficult to the easy and simplistic. Most of the current HTML tools are not as useful as they could be, but the large demand for easy and powerful HTML tools ensures that they will become more robust in the near future.

Text Editors

Because HTML documents are really plain text files, the first (and currently most common) solution is to create them using a garden-variety text editor, such as Notepad. You create the HTML document by typing it exactly as it is to appear—including typing the tags by hand—and you finish with a file that looks just like the sample file shown earlier in this chapter.

The drawback of this approach is that because these editors are ignorant of the type of file you are entering, they cannot help you at all. They cannot correct poor syntax, offer any suggestions on element usage, or show how the finished product will appear in a WWW browser. You have to be careful to get the document right and often have to edit it many times to correct mistakes. If you decide to use a text editor to create HTML, you should also have a WWW browser available to check the document often and find any problems to be fixed.

HTML Editors

Between the two of us, we have tried over a dozen methods of HTML file creation. The one we agreed was the easiest is a simple but powerful program called WebEdit by Ken Nesbitt. WebEdit is included on the CD, and we discussed the installation process in Chapter 4, "Up and Running Fast." WebEdit is shareware. There are many other shareware and freeware HTML Editors available on the Internet, but after observing the difficulty of using some of the other packages, we welcomed WebEdit as a companion for most of our HTML editing tasks.

Word Processor Templates

Tools in this category are not programs in their own right, but exist as macros or accessories that operate within your favorite word processor or desktop publishing program. The advantage of these templates is that they enable you to create HTML documents using the same tools and interface you use for creating normal documents; they output files in HTML instead of the program's normal format. The disadvantages are that the templates are not currently available for most word processing software and that using a large word processor to create a small, one-page document can be slow and cumbersome. However, these templates are probably very good for working on large HTML documents. Here are some currently available:

Internet Assistant (for Microsoft Word), from Microsoft:

http://www.microsoft.com/msoffice/freestuf/msword/download/ia/default.htm
GT_HTML.DOT (for Microsoft Word), from Georgia Tech University:

http://www.gatech.edu/word_html/release.htm

HTML Converters

Many of the documents you want to contribute to the WWW likely already exist on your computer. Most people have a large number of documents previously created using a word processor or desktop publishing program; they do not want to have to re-create the documents or convert them to HTML by hand. To assist in this process, several tools can convert existing documents to HTML. They simply take the codes from the software's internal format and convert them into HTML elements.

For these converters to work cleanly, your original document should be constructed with the same philosophy used with HTML and SGML: using a clear, semantic structure. For example, if named styles (such as Chapter Title and List Item) are used in the original document, these styles can be converted directly into corresponding HTML elements (Chapter Title = <H1>, List Item = <LI>, and so on.) On the other hand, nonsemantic markup, (such as "Helvetica 14pt centered") is difficult or impossible for the converter to interpret. Almost every word processor and desktop publishing program has a styles feature.

Two types of HTML converter tools are discussed in the following sections.

Word Processor Macros

These operate within the word processor or desktop publisher program, going through the document line by line and converting each code to an HTML equivalent. In the end, the user sees a raw HTML file that can be saved as plain text. Here is one package that does this for Word:

ANT_HTML.DOT (for Microsoft Word), by Jill Swift:

http://www.w3.org/hypertext/WWW/Tools/Ant.html

Stand-Alone Conversion Programs

These tools are used outside the originating software. They read the original document from the disk, converting it and saving the result as an HTML document. Here are a few of them; if your software is not represented, you can probably convert the file into a format that can be used by one of these tools. (For example, you can convert the file into RTF format and then convert that into an HTML file.)

RTFTOHTML (for Rich Text Format files):

ftp://ftp.cray.com/src/WWWstuff/RTF/rtftohtml_overview.html
qt2www (for Quark Xpress), by Jeremy Hylton:

http://the-tech.mit.edu/~jeremy/qt2www.html

Document Style and Organization

As you begin to write HTML documents, it is important that you keep in mind the following tips. Having your document obey these general style rules should make them better looking, better and more frequently used by readers, and easier for you to maintain:

Thoroughly plan your information. The only reason people put information on the WWW is because they hope others can use it. (Often, the contributor expects to subsequently benefit from this use.) Thus, your primary goal in organizing the documents and files you place on your server is to make your information easy for users to access.

Although this organization differs for every site, you should keep some things in mind. Use hypertext prodigiously; the more possible avenues people have to navigate through your information, the better the chance they find what they want. Create navigational pages such as directories and tables of contents to aid people in searching for information. Also, be very clear when describing links and menu choices; this decreases the number of wrong roads your users take.
Use valid HTML. In the early days of the World Wide Web, HTML was not well-defined, and neither was the way it was to be rendered. Many tags mutated into several forms, and browsers were written to be lax in parsing documents so that they could handle the several forms in which each tag appeared. Although HTML has become more structured and stable, the browsers often still allow for variant syntaxes so that they can read the large number of old documents out there.
Although many "cheater" syntaxes (for example, using <UL> without <LI> to make indented paragraphs) might produce a pleasing result on your browser, their appearance varies wildly from one browser to another (some browsers ignore the lone <UL> altogether) and might produce a very poor display on somebody else's screen. Although you cannot have complete control over what appears on each user's screen, your best bet for creating fairly uniform-looking documents is to use HTML as it was designed.
Use small files. When a document is created on paper, it normally consists of one large file and is distributed as a single stack of paper. This approach is often undesirable on the World Wide Web. People generally don't like to read large quantities of text on a screen. A reader would also be very hesitant to download a 1 MB file when he or she is looking for a single paragraph.

The great advantage of hypertext is that it allows for nonlinear text: Readers can bounce around inside and between documents, reading and understanding pieces in the order and method that best suits them individually. A good document is broken into many small files, each no more than a screen or two in length, interconnected with the <A> tag to produce hyperlinks at appropriate places. Good subdocuments for a table of contents and index allow users to find and retrieve just those pieces of the document that they need.
Keep in mind that some users will naturally want to print certain documents. If your document is contained in many separate HTML files, the reader will have to link to each of them in order to print or save the whole topic. You can decide on a case-by-case basis what is best for each document you present.
Date the page. It is a good idea to include the date a document was last edited. This lets people returning to your page know if it has been updated.
Don't overdo graphics. Although displaying graphics as part of a document is one of the most powerful capabilities of HTML, it is often abused. Images use much more bandwidth than normal text, so a page with many large graphics takes much longer to download than one without. In fact, many users, such as those connecting over slow telephone lines and those using text-only terminals (still a significant part of the Internet audience) will not even see your graphics. Graphics also increase the space your document takes on the screen, forcing people to scroll down to see the rest of the page. Here are a few good rules of thumb when dealing with graphics:

1. Concentrate your graphics where they do the most
good, such as illustrations, logos, and mastheads.
(The large images that appear on the top of home
pages to give the service a corporate image.)

2. Cut down the number of colors in each image. Most
monitors display only 256 colors at once, so the
colors of all images on the page must fit in this
number. If you don't trim the images, the browser
will, and it rarely does an acceptable job. If
you're including photographs, they should use about
50 to 100 colors each. (You can set this limit with
most graphics software.) Limiting colors also
reduces the size of the file to be downloaded.

3. Make graphics as small (in memory size) as possible.
For example, if you want to include a photograph,
put a thumbnail (a smaller replica) of the photo in
the document, which is linked to the full-size
graphic that people can download if they really want
to view it. Or link a larger graphic with a text
reference.

4. Never rely on the graphics to communicate your
message. Any important information (titles, menu
choices, and so on) that appears in the graphics
should also appear in the text. This might mean
using the ALT attribute in the <IMG> tag or having a
duplicate page that text-only browsers can use.

5. You might supply a link to the graphic that tells
the user how large it is before he or she decides to
download it.
Test your document with multiple browsers. Browsers vary markedly in how they render HTML. Also, some browsers (such as Netscape) use additional elements that are not part of "true" HTML and which are not supported by any other browsers. The Web has many documents that were obviously written with a single browser in mind because they look awful on all the rest.

If possible, gain access to at least two browsers (preferably a graphics one and a text-only one) that you can use to view your documents. Although the documents you create with this method might not look as good on your favorite browser as they could, they will look fairly good on all browsers.

Element Reference

The following sections provide a brief guide to almost all the elements used in HTML Level 2. For a more comprehensive reference, see the official HTML 2.0 specification at http://www.w3.org/hypertext/WWW/MarkUp/html-spec/index.html. Remember that the tag and attribute names are not case-sensitive and can be in uppercase or lowercase letters.

<HEAD> Elements

The following tags are allowed in the header part of the HTML document.

Document Title

This is the name of the document. The title is generally written in a larger type size than the current document in order to give the user a frame of reference. For example, if the document is a chapter of a book, the <TITLE> would probably contain the title of the book as well as the chapter title. Thus, if someone followed a hyperlink from somewhere else directly to this chapter, he or she would not be lost, but would know that this file is part of a certain book. For example:

<TITLE>text</TITLE>

External Link

This establishes a relationship between the current document and another document. The name attribute gives the link a name, such as Mail to Author. The rel attribute describes the type of link, such as "made" (the author), "parent" (a larger document of which this is a part), "next" (the succeeding section of a multifile document), and "prev" (the previous section.) The HREF attribute points to the related document. Currently, most browsers don't make use of this tag, but future browsers will likely add a new button to the screen for each <LINK> to allow users to easily jump to the related document. For example:

<LINK name="text" rel="text" href="URL">

Document Meta-Information

This allows for extra information about a document, such as its modification data, copyright, or abstract. This is done by setting a name and value, such as <META NAME="copyright" CONTENT="1995, Sams.Net Publishing">. Separate <META> tags are included for each item of information. Currently, this tag is seldom used in browsers. For example:

<META NAME="text" CONTENT="text">

Location of Current Document

This lets you specify the full URL of this document. Although it might seem redundant, this information is useful if you use relative URLs in the hyperlinks. Using this base, the hyperlinks are resolved correctly even if this document is requested with a different URL than you expect (for example, if users save it on their local disk and try to use it there). For example:

<BASE HREF="url">

Searchable Document

This places a search field either in the document or elsewhere on the screen, enabling users to enter keywords to search through this document. You can't just add this tag to any arbitrary document and expect it to work. Your server must be set up to process this query, using a back-end search engine such as WAIS. For a full discussion of this topic, see Chapter 19, "Databases and the Web." An example follows:

<ISINDEX>

Empty Elements

As stated earlier in this chapter, empty elements are elements that insert objects into the document by themselves, regardless of the surrounding text. They each consist of a single tag. For example:

<IMG SRC="graphic.gif">

Horizontal Rule

This places a horizontal line across the page, with a blank line above and below, and is normally used to separate major sections of a document (for example, before an <H1> or <H2>). Some graphical browsers give the rule a 3-D chiseled look. For example:

<HR>

Line Break

This forces subsequent text to the next line. Unlike the <P> tag, the text before and after the <BR> tag is still considered a single paragraph. The <BR> tag is normally used to create tight blocks of short-line information, such as mailing addresses. For example:

<BR>

Inline Image

This places an image within the document, as found at the URL specified in the src attribute (which is mandatory). The most common format for these images is CompuServe's Graphics Interchange Format, or GIF. If the browser doesn't support inline images (for example, the Lynx browser does not), the text given in the optional alt attribute is displayed. If no alt attribute is given, a default placeholder such as [IMAGE] may be displayed in this situation. (To ensure that nothing is displayed if the graphic cannot be shown, use the alt ="" attribute.) The optional align attribute specifies how the image is to be aligned vertically with the current line of text. (The default alignment is most often BOTTOM, but this varies by browser.)

The ISMAP attribute lets you create interactive graphics, or imagemaps. If the syntax <A href="http://URL1"><IMG src="URL2" ismap></A> is used and you point to a spot on the image, the x and y coordinates are passed to the hyperlink (for example, http://URL1?x,y). However, the HTTP server must be able to handle imagemap queries. Chapter 10 gives step-by-step details for doing this with Purveyor and/or FolkWeb. For more information on imagemaps, look for The World Wide Web Unleashed, Second Edition, published by Sams.Net. For example:

<IMG src="URL" alt="text" align=TOP/MIDDLE/BOTTOM ISMAP>

Comment

Any text inside this element is ignored. This element is used to include notes that can be read by the writer but that are not part of the text of the document (which is especially useful if several writers work on the same document). The useful programming technique of temporarily commenting out sections of code cannot be done here; many older browsers use a single > as the closing character of the comment, so any tags included in the comment (such as <BR>) cause the comment to end early and interpret any remaining comment text as body text. For example:

<!-- text -->

Character Containers

Character containers enable you to format or describe words and phrases within paragraphs. Although they can be used inside non-body blocks as well as in normal text, all but the <A> tag can produce unattractive results on some browsers.

Hypertext Links

Hypertext links are the heart of HTML. These links let you, with a single mouse click, move from place to place within a document or even to an entirely different document anywhere on the Internet. This use of hyperlinks is how the World Wide Web gets its name—links form a spider's web of documents that covers the globe.

Hypertext Anchor

This is used to mark the reference or the target of a hypertext link. Either the href or name attribute must be included. (Both are allowed, but they don't appear together very often.) The href attribute specifies a URL to which the enclosed text attribute is linked. (The text is highlighted; selecting it requests the new object.) href can reference another HTML document, an image, or anything else that can be addressed using a URL. The hypertext anchor can also enclose an <IMG> tag, allowing inline graphics (such as icons) to become links.

The name attribute gives a unique name to the enclosed tag, allowing users and other HTML documents to point directly to this part of the document. For example, a URL such as http://.../thisdoc.html#part1 loads thisdoc.html and attempts to place the text marked with <a name="part1"> at the top of the screen. For example:

<A href="URL" name="text">text</A>

Logical Styles

Logical styles let you give a real meaning to sections of text. Currently, they are used only for formatting, but they can be used for more intelligent types of processing, such as automatic footnoting.

Emphasis. Used to highlight sections of text for miscellaneous reasons. Normally rendered in italics.

<em>text</em>

Strong emphasis. Another form of generic highlighting. Normally rendered in bold.

<strong>text</strong>

Citation. Used to mark a citation to another document, such as a printed book (for example, Great Expectations). Normally rendered in italics.

<cite>text</cite>

Computer code. Used to mark text from a computer (for example, hit any key.) Normally rendered in a fixed-width font such as Courier.

<code>text</code>

Variable. Used to mark a variable used in a mathematical formula or computer program (for example, z = x + y.) Normally rendered in italics.

<var>text</var>

Keyboard input. Used to mark text that is to be typed at the keyboard by a user (for example, hit the enter key.) Normally rendered in a fixed-width font such as Courier.

<kbd>text</kbd>

Physical Styles

Originally considered cheater versions of the logical styles, physical style elements have become very popular because they are similar to the way people are used to highlighting text (that is, literally instead of semantically).

Bold:

<b>text</b>

Italics:

<i>text</i>

Typewriter text, rendered in a fixed-width font such as Courier:

<tt>text</tt>

Block Containers

In HTML, a block is defined as a piece of marked text that by itself occupies a certain amount of vertical space in a document, such as a paragraph or a heading. The following elements can be adjacent to each other, but cannot be nested (that is, you can't have a <P> inside an <H1>—because they represent different types of blocks).

Headings (1 Through 6)

This acts as a title for a section of the document. The lower-number headings represent more important headings and are generally rendered in larger text. Because of a mixup in the distributed default settings, some browsers erroneously display <H5> and <H6> smaller than the body text. Until these two elements are displayed more consistently, they should probably be avoided when possible. Following is an example:

<H#>text</H#>

Paragraph

In most current browsers, this tag is used in the first form as a paragraph separator. Thus, it marks the boundary between two paragraphs of normal body text. You should not use this tag between body text and another element. (For example, do not use ...<p><h1>....) Because the second element implies a line break, some browsers put too much space between the elements. The second form (a container for each paragraph) represents a more valid SGML structure and will soon be the standard. However, the end tag </P> will be optional, so most documents that have been created using the first form will still work.

text<P>text

<P>text</P>

Extended quotation

Used for long quotations that exist as separate paragraphs. This is normally rendered similarly to a normal paragraph, but with both margins indented.

<BLOCKQUOTE>text</BLOCKQUOTE>

Mailing address

Specifically targeted to postal addresses, this tag is commonly used to mark bylines (name of the author) and e-mail addresses. It is normally rendered in a smaller font or in italics, and usually uses the <BR> tag to separate the individual lines of the address.

<ADDRESS>text</ADDRESS>

Preformatted text

Because extra spaces and tabs are ignored in HTML, some kinds of text, such as poetry, tables, and computer program listings, are difficult to encode. The <PRE> element is used with those types of text by formatting everything it contains exactly as it appears, including spaces, tabs, and line feeds. This is also useful for getting fields to line up in forms.

<PRE>text</PRE>

Lists

There are several HTML tags, which makes it convenient to display lists of items. Lists can be ordered (numbered), unordered (graphically displayed as bullet items), or appear as columns of terms and definitions. Also, list items can be hyperlinks to other documents on the Web.

Itemized List

This creates a list containing several items, each beginning with <LI> and normally indents each item one tab position. There are four types: <UL> is an unordered list (each entry is normally preceded by a bullet); <OL> is an ordered list (each entry is numbered); <MENU> is a menu of choices (similar to but sometimes rendered more compactly); <DIR> is a directory (designed to be a list broken into 2 or 3 columns like a disk directory; in most current browsers, the <DIR> element is rendered the same as <UL>). These lists can be nested within each other, allowing for complex list hierarchies such as outlines.

<TYPE>

        <LI>text

        <LI>text

        <LI>text

        ...

</TYPE>

Definition List

This syntax builds a list in which each entry has two parts, as in a glossary: a term (which follows the <DT>) and a definition (which follows the <DD>). It is normally rendered exactly the same as this section of this chapter, with the definition indented below the term. The optional COMPACT attribute was designed to produce a more vertically compact list in which the terms and definitions are placed in side-by-side columns, but it is ignored by most current browsers.

<DL COMPACT>

        <DT>term text

               <DD>definition text

        <DT>term text

               <DD>definition text

        ...

</DL>

Forms

The forms feature of HTML is one of the things that gives the Web real power for doing live, interactive applications. The HTML form, however, is only half of this feature. After the user fills out the form, it is submitted to a specialized program, or script, which takes the information and does something useful with it (for example, e-mail it to you). You must either write the script yourself (that means programming) or find a prewritten script that will suit your needs. This gets into the topic of the Common Gateway Interface (CGI), which is explored in detail in Chapter 11 and all through Part V. In this chapter, we stick to the HTML side of the process.

Form

The <FORM> element encloses the entire form and gives some basic definitions. The form might take up only part of the HTML document; in fact, a single document can contain several separate forms that perform different functions. The method attribute specifies the way in which information is sent to the HTTP server; the action attribute gives the URL of the script that is to process the submitted information (usually http://.../cgi-bin/scriptname).

<FORM method="[GET|POST]" action="URL">form body</FORM>

Form Input

This empty tag is used to place different fields in the form to enable users to enter information. The name attribute gives a unique name to the field; the optional value attribute gives a default value for this tag. When the form is submitted, the information is returned as a set of name-value pairs separated by ampersands, such as http://.../cgi-bin/script?name=me&address=here&time=now. The type attribute gives the style of object to be used. (See the following bulleted list.)

<INPUT name="text" type="" size=## value="text" CHECKED>

CHECKBOX uses a simple on or off button. The value is ON or OFF.
RADIO is similar to CHECKBOX, but allows you to pick one choice from many by having several radio tags with the same name but different values. RADIO returns the value attribute of the checked tag. (value is not optional with this type.)
TEXT places a one-line window to allow users to type in something. The returned value is the text entered.
IMAGE places an image in the form, allowing users to point to it, returning the x and y pixel coordinates of the selected location. IMAGE operates similarly to <IMG ISMAP>, but within a form. For this type, the SRC and ALIGN attributes from the <IMG> element are included.
SUBMIT places a button on the form that submits the form to the action URL in the <FORM> tag. The label for the button is specified by the value attribute.
RESET clears the form, returning all fields to their default values.
HIDDEN does not display anything on the form. It allows you to pass nonchangeable information along with the rest of the form (using the name and value attributes).

The CHECKED attribute is used with the CHECKBOX and RADIO types to signify whether the button is selected by default or not. The size attribute is used to set the window size of a text field (in characters).

<SELECT name="text" multiple>

        <OPTION value="text" selected>text

        <OPTION value="text">text

        ...

</SELECT>

Choice Selection

This presents a list of possible values for the field, itemized by the <OPTION> tag; normally it is displayed as a pull-down menu. The name and value fields are the same as for <INPUT>. The text following each <OPTION> tag is displayed in the menu. If no value attribute is given, text is returned, if that option is selected. The multiple attribute allows more than one option to be selected, and the selected attribute identifies the default choice. For example:

<SELECT name="text">

        <OPTION value="OPT1" selected>Option 1

        <OPTION value="OPT2">Option 2

</SELECT>

Multiline Text Input

This is similar to <INPUT TYPE="text">, but allows for many lines. The name attribute is the same as for <INPUT>, whereas the number values for the rows and cols attributes define the size. The text contained in the element is shown in the window by default.

<TEXTAREA name="text" rows=## cols=##>text</TEXTAREA>

Entities

Many characters that appear in documents can be impossible to enter in an HTML file, including characters that have special meaning to HTML (for example, the < and > characters) and international and typographic characters not found on most keyboards.

These characters can be included in documents using entities, pieces of text that together signify a single character. The general syntax includes an ampersand, a unique name for the character, and a semicolon. For example, Gröning produces Gr[am]oning. There are two general types, as described in the following sections. For a complete list, please see the Appendix.

Reserved Characters

Reserved characters are normal characters used for other purposes in HTML that can cause confusion if entered by themselves.

Entity	Displayed As
<	Less-than sign (<)
>	Greater-than sign (>)
&	Ampersand (&)
"	Quotation mark (") (usually not necessary)

International Characters

International characters are characters used in most European languages other than English, referenced by names from the ISO Latin 1 character set. A few examples follow:

Entity	Displayed As
Á	Capital A with acute accent (Á)
ô	Small o with circumflex accent (ô)
Æ	Capital AE ligature (Æ)
ç	Small c with a cedilla (ç)

The Future of HTML

By the time you read this, the specification for HTML Level 2 should be complete and most browsers should be using this specification as a standard. However, Level 2 does not represent the final form of HTML. This language will continue to evolve, adding new capabilities, for years to come.

Although the current version of HTML has many powerful features, it also has its disadvantages. Suggestions are constantly being given to the HTML working group, which considers them for inclusion into the standard. Enhancements will likely allow a larger variety of documents to be put on the Web, make documents look better, and easier to manage and use.

The Presentation Versus Structure Debate

The primary area currently evolving is in the formatting of documents. The debate is raging over how much control of the appearance of the document should rest in the hands of the user and how much should be decided by the publisher. Years of research have gone into graphic design and typography, and there are varied methods of using the appearance of text and graphics to communicate a particular message. To designers and publishers who have become experts in this art, it is important that the information contributor have a large degree of control over document appearance.

However, on the World Wide Web, the user can choose fonts, window sizes, colors, and many other presentation variables. Although this is a frustration to many publishers, it is an important part of the Web. Not all users have the same typefaces, colors, and screen area available, and must be able to make the WWW page fit their constraints. In addition, physical differences in users place special needs on the appearance of pages; for example, sight-impaired people might want to use very large type; a blind user does not see anything at all and has the document read aloud by the computer.

A compromise must be reached. Information providers need the capability to dictate a large part of the appearance of the document when it is important. On the other hand, users need to be able to override or alter this appearance when necessary. The primary goal of the Web is the dissemination of information; the content of the documents should always be more important than their appearance. Whatever can be done to improve comprehension by users, including both dictated and alterable appearance, is important to that dissemination as well.

For more information about proposed solutions to some of these problems, see Chapter 12, "Netscape Extensions and HTML 3.0."

Alternatives to HTML

It is doubtful that HTML will ever be able to provide all the creative design and functionality that true electronic publishing demands; it was never intended to do so. New file types have emerged to address these weaknesses. These file types are generally geared toward more specialized applications, and very little software currently exists for using them. However, some of them will likely become major (perhaps equal or superior to HTML) parts of the World Wide Web.

Portable Document Format (Adobe).PDF, the format used in Adobe Acrobat, has almost every page layout capability that can be imagined. (It is based on PostScript.) Acrobat readers will soon become Internet-savvy and be able to include URL hyperlinks just as HTML does. The disadvantages of using PDF are these: PDF is a closed, proprietary format owned by Adobe; PDF files are much larger than the equivalent HTML files; and PDF is complex, making it difficult for automatic generation by scripts and other software. Other commercial electronic document formats, such as Folio and WordPerfect Envoy, will probably also add these capabilities. For more information, address http://www.adobe.com/Acrobat/Acrobat0.html.
Hyper-G. This distributed hypertext system is very similar in purpose to the WWW; in fact, it is probably dumb luck that one caught on and not the other. Hyper-G has a document format called HTF that in some ways is more powerful than HTML.
Simple Vector Format/HyperCGM. Many fields (including Engineering, Graphic Design, and Cartography) need a common format for distributing vector (object-based) graphics. Whether to create a new format from scratch or to alter an existing format such as CGM to allow for hyperlinks is still under debate. Conceivably, a strong vector graphics format could allow for completely graphics-based (rather than document-based) information systems on the Web. For more information, contact http://www.niiip.org/svf/.
Virtual Reality Modeling Language. There is also a niche that would like to be able to distribute virtual reality over the Web. The specification is nearing completion and will allow for the design and distribution of "scenes" that give a 3-D look to objects and places on the Web. For information, contact http://vrml.wired.com/.
The JavaScript Language. This object-oriented language has been co-developed by Sun Microsystems and Netscape Communications to give developers a method to create interactive Web applications. The language is an extension of C++ and has been designed with network security in mind. Currently, you must run either the HotJava Web browser from Sun or Netscape Navigator 2.0 to reap the benefits of the interactive capabilities of JavaScript. For more information, see http://java.sun.com or Presenting Java, by John December, published by Sams.Net. Also, Chapter 24 "Interactive Web Programming with Java" presents a very useful Web application written in Java.
Web-savvy word processors. When you read this, there should be at least two commercial word processors with Web support: Microsoft Word (with Internet Assistant) and WordPerfect (with Internet Publisher). These add-ons will include several new capabilities:

HTML creation and conversion within the word processor.

The word processor can act as a Web browser, reading HTML (so you don't need to get a separate browser like Mosaic or Netscape).

Small stand-alone viewers, which can be freely distributed, so that documents can be distributed in their native format (including hypertext links and all the formatting) so people can view them without buying the full word processor.

Although few of these alternatives are available today, they will soon be around, increasing the flexibility (and confusion) of the WWW. They will probably become most popular in niche markets that require very specialized information types (such as maps, diagrams, and technical illustrations) and with professional publishers who need detailed presentation control that can't, or shouldn't, be part of HTML.

What's Next

Now that we understand the basics of HTML, we are ready to move into some more advanced topics. The next chapter will cover graphics in much more detail and attempt to make sense of all the hype about multimedia on the Web. We'll have much more to say about HTML in Chapter 12, "Netscape Extensions and HTML 3.0" and in Chapter 13, "Putting HTML to Work Building a Sample Site."