Chapter 21

Indexing Your Intranet with WAIS


CONTENTS


By now, you've made a good deal of data available on your Intranet-or at least you have some ideas about what you want to put on the Intranet. In all likelihood, your Intranet will eventually accumulate a substantial volume of data. The obvious next question is how are your customers going to be able to find anything among all your data? The equally obvious answer is for you to provide searchable indexes on your Intranet. You'll learn in this chapter how to enable your customers to both search your indexes and retrieve documents (or other data files) using their Web browsers.

This chapter talks about how WAIS works, how to install it, how to use it, and how to search with it. At the end of the chapter, I'll go over a few alternative indexing/searching technologies, including Excite for Web Servers. In case you haven't heard of Excite already, it is now available for Windows NT, it is coming on strong, and it's free!

Wide Area Information Server (WAIS)

WAIS, which stands for Wide Area Information Server, is a system for indexing large amounts of data and making them searchable over a TCP/IP network. It's misnamed, though, because it works just as well on a local network as it does over the Internet.

WAIS server software indexes data and responds to requests from WAIS clients to search the indexes and return a list of documents that match the search. Based on an ANSI standard for indexing library materials in computer systems (Z39.50), WAIS can form an important part of your Intranet. Because WAIS uses the Z39.50 protocol, you might hear the two terms used synonymously.

WAIS supports not only simple keyword searches, but also Boolean queries (for example, thiskeyword and thatkeyword) and even plain English searches. In addition, WAIS can do relevance searching-you can select part or all of a document that your WAIS search has found and ask for a new search based on the selection. In other words, WAIS will find more documents like the one it found.

WAIS was originally developed as free software at a company called Thinking Machines, Incorporated. Then the software was commercialized by WAIS, Inc., which is now owned by America Online (the Internet buying frenzy continues). At the time of this writing, AOL has not yet announced plans for WAIS, Inc. and its technology. Fortunately, WAIS software for Windows NT is available for free through EMWAC, and the good folks at Sams.net have arranged to include it on the CD-ROM with this book.

Note
In addition to the WAIS Toolkit, EMWAC has also developed a freeware Web server (HTTPS), Gopher server (GS), and SMTP server (IMS) for Windows NT. You can find more information about their highly regarded server software at this URL:
http://emwac.ed.ac.uk/

Although ncSA Mosaic has built-in WAIS client support, Netscape and Explorer don't. As a result, you must run a WAIS gateway on your Intranet to support users of those browsers. Fortunately, the EMWAC WAIS Toolkit on the CD-ROM serves this purpose nicely. You can use an HTML page as a front end to the WAIS search engine. The EMWAC WAIS Toolkit returns the results of the search in HTML format with matched documents as clickable hyperlinks. Once you learn how to set this up, it works beautifully. Web searching is definitely a slick feature to add to your Intranet. (And you will soon see it is not hard to set up at all.)

Figure 21.1 shows a demonstration WAIS search result for the keyword address. The results page is nicely formatted in HTML with hyperlinks to each of the located documents. WAIS has applied a best-guess score (maximum 1000) to each document for its potential value to the user searching for the keyword. Documents containing fewer occurrences of the word address are given lower scores. The highest scoring documents appear at the top of the list. The document with the search keyword contained in the HTML <TITLE> tag is given a perfect score. WAIS also displays the file size in bytes of each document, as that may help the user determine which hyperlink jump to take.

Figure 21.1: The results of the WAIS search for the word address.

The relative weighting of the found documents in the WAIS search results is based on a number of useful criteria, such as word frequency within the individual documents and the index as a whole. With multiword and Boolean searches, the weighting takes all the search words into account, so a document containing all your search words would get more weight than one that contained multiple instances of just one of them, for example.

If your Intranet is like most others, much of the data you'll want to index is in (or can be put into) plain text files of one kind or another. WAIS understands a wide variety of text formats. WAIS also knows about several kinds of image formats and can be coaxed into indexing them (or at least their filenames).

In addition, the package has special features that make it easy to integrate your data indexing into your Intranet, with a focus on Web-related capabilities. One major source of data that you may want to index is the data on your Web server itself. Finally, as if these capabilities weren't enough, you can teach WAIS to recognize and index new data formats.

Building a WAIS index is rather simple. The following is a quick overview of the steps involved in using WAIS (this process will be covered in more detail later in this chapter):

  1. Build a WAIS index of all the HTML files at your site by using the program named waisindex.exe. WAIS creates several files that comprise your index. If you name the index myindex, for example, you will end up with files named myindex.*.
  2. When using IIS, enable automatic WAIS searching by setting CheckForWAISDB = 1 in the Registry.
  3. Create a search page written in HTML in the same directory as the WAIS index and using the same base filename (for example, myindex.htm).
  4. Include the <ISINDEX> tag in the <HEAD> section of the HTML search page and provide a link from your home page to this HTML document.
  5. When a user loads the search page, he is prompted for a search keyword. After the user enters the word, the server automatically invokes waislook.exe and returns a list of matching documents. WAIS is that simple, and your Intranet users will love you for the added functionality it provides.

Installing WAIS

The EMWAC WAIS Toolkit included on the CD-ROM will help you create a database of all the text at your Web site so that users can search it by keyword. The creators of HTML designed the <ISINDEX> tag with this feature in mind. The <ISINDEX> tag causes the Web server to invoke a program named waislook to search a WAIS database and return links to the pages containing the search keyword. (The WAIS database is also referred to as an index.)

Note
The European Microsoft Windows NT Academic Centre (EMWAC) has developed several excellent freeware programs for Windows NT. Programs that are written for Windows NT on the Intel platform will usually run on Windows 95 also. This is because both Windows NT and Windows 95 support the common Win32 API, which enables programs to call functions in the operating system in a consistent manner using 32-bit parameters for integers and resource handles.

Follow these steps to install the EMWAC WAIS Toolkit:

  1. The WAIS Toolkit is distributed in four versions for the different architectures that Windows NT supports. Select the appropriate WAIS ZIP file on the CD-ROM for your processor. For example, the WAIS Toolkit for Intel is contained in the file wti386.zip.
  2. Decide which directory you are going to put the tools in so you can unzip the .EXE programs directly from the CD-ROM to your hard disk using the WinZip program. Ensure that the directory you chose is on the path so that the commands may be executed from the command line.
  3. Unzip the WAIS Toolkit. This action should leave you with the following files:

  4. If you have installed a previous version of the WAIS Toolkit, remove it by deleting the old files or by moving them to another directory (which is not referred to by the PATH environment variable) for deletion after you have validated that the new version works correctly.
  5. Determine which version of the WAIS Toolkit you have by typing these commands at the DOS Prompt:
waisserv -v
waisindx -v
waislook -v

The version number for each program will be displayed. Two version numbers will be shown for waisindx and waisserv; the first refers to the version of the freeWAIS code from which the programs were ported, and the second is the number of the Win32 version. As you can see in Figure 21.2, which shows the execution of those commands on my system, I am running version 0.73 for Windows NT. If the programs report a later version number on your system, you will find an updated manual in the files you unpacked from the ZIP archive (the information in this chapter would still be expected to work with few or no changes).

Figure 21.2: The results of checking the WAIS version numbers.

Indexing Your Intranet with WAIS

To create a WAIS database of the HTML files at your site, follow these steps: (Assume for the purposes of this discussion that d:\http is the home directory of your Web site.)

  1. Make d:\http, or the HTML root, the current directory.
  2. Execute waisindx (or waisindex, if you have renamed it to use long filenames), giving it parameters as shown in the following code. The -d parameter is used to name the index files which are created. The default name if no parameter is given is index, which I will assume is in use for the remainder of this chapter. The -r parameter tells WAIS to search all subdirectories. The -t (lowercase) parameter indicates the type of files being indexed. WAIS handles text files and HTML with ease. If you know all the files are HTML, WAIS will use the <TITLE> tags for the file headlines. The last parameter specifies the files that you want to search, which are, in this case, all HTML files in the HTML root directory.
    waisindx -d index -r -t html *.htm*
  3. Observe the messages from waisindx to check that there are no errors.
  4. Execute a dir index.* command on the d:\http directory to check that waisindx has created the seven index files, named index.* and described in the following text.

The following text describes the files created by waisindx:

Using <ISINDEX> with WAIS

Now that the WAIS index files are created, you need to modify your HTML code to take advantage of them. This is where the HTML <ISINDEX> tag enters the picture. Remember, the HTTP server is designed to automatically invoke waislook whenever it receives an <ISINDEX> request from the client.

This automatic invocation of waislook should not be taken for granted. I've only seen this work with three Web servers: Process Purveyor for NT, EMWAC HTTPS, and Microsoft IIS. Other Web servers, such as Alibaba, require a different procedure to take advantage of <ISINDEX>. I won't get into the details of that procedure in this chapter, but I can point you to Richard Graessler's home page, which contains thorough information about the topic. Mr. Graessler has written a very nice batch script that can be used on Windows NT to pass <ISINDEX> search parameters to waislook on Alibaba or other Web servers. He kindly provides the source code free of charge on his Web site at this URL:

http://rick.wzl.rwth-aachen.de/rickg/IsIndex/isindex.html

Because much of this book is based upon Microsoft IIS, I will assume you are using that Web server. In that case, there is a Registry setting that you must ensure is set properly. Follow these steps:

  1. Start the Registry Editor and drill down to the following key:
    HKEY_LOCAL_MAchINE\SYSTEM\CurrentControlSet\Services\W3SVC\Parameters
  2. Look in the right-side window pane to see if you already have a value named CheckForWAISDB. If so, and if it has a value of 1, then IIS is ready to invoke waislook.
  3. If you don't have a value named CheckForWAISDB, choose Edit | Add Value. Type in the Value Name and choose REG_DWORD for the Data Type.
  4. After you choose OK, the DWORD Editor dialog box will prompt you for the initial value. Enter a value of 1, and choose OK again.
  5. Check that the value is entered correctly, and then exit from the Registry Editor. Now IIS will support <ISINDEX> searches using the WAIS Toolkit.

Note
These steps are not necessary with the EMWAC HTTPS Web server because it is capable of automatically invoking waislook.exe.

The next step is to create a new search page named index.htm that contains the <ISINDEX> tag in the <HEAD> section. Figure 21.3 shows how the <ISINDEX> tag is interpreted by Microsoft Explorer. The user is preparing a search for the keyword address, the results of which were shown earlier in the chapter. Listing 21.1 contains the HTML code for the sample page shown in Figure 21.3. You can find this HTML file on the CD-ROM.

Figure 21.3: The <ISINDEX> tag as it appears in Microsoft Internet Explorer 2.0.


Listing 21.1. The code for index.htm, a sample HTML file that uses <ISINDEX>.

<HTML>
<HEAD>
<TITLE>Search the Intranet</TITLE>
<ISINDEX>
</HEAD>
<BODY>
<H1>Search the Intranet</H1>
</BODY>
</HTML>

Now you just need to provide a link from your Intranet home page to the new index file you just created. After you do that, your site will be searchable by keyword. Use your Web browser and give it a try.

Examining waisindex

Table 21.1 lists the waisindex command-line options, annotated to indicate which are required and which are optional. After this list, you'll find a bit more detail about each option that isn't self-explanatory.

Table 21.1. waisindex command-line options.

OptionDescription
-a Adds to existing WAIS index. Optional.
-d database Specifies database name for WAIS index. Optional; defaults to index.* if -d not present.
-r Recursively indexes subdirectories. Optional.
-mem mbytes Specifies the amount of memory in megabytes to use in creating the database. Optional.
-register The Windows NT version of waisindex cannot automatically register the database with the master Internet directory of servers. This option displays instructions on how to do it manually. Optional.
-export Makes database network accessible, outside the Intranet. Optional.
-e filename Logs errors in filename. Optional.
-l number Sets log level (0 through 10). Optional.
-v Prints the version of the software. Optional.
-stdin Reads filenames to be indexed from standard input. Optional.
-pos or -nopos Includes (or doesn't include) word position information. Optional.
-nopairs or -pairs Doesn't include (or includes) word pairs. Optional.
-nocat Doesn't create catalog files. Optional.
-contents Indexes the contents, even if the document type is not normally subject to such indexing. Optional.
-nocontents Indexes only the filename, not the contents, even if the contents are normally indexable. Optional.
-keywords string Uses string as keyword(s) in indexing. Optional.
-keyword_file filename Takes indexing keyword from filename. Optional.
-x filename1[,f2,...] Does not index these files.
-T type Announces "TYPE" of the document. Optional.
-M type,type Specifies multitype documents. Optional.
-t type Specifies actual type of the files. Optional.

The first two options (-a and -d) control whether a new database is created or an existing one is appended to.

By default, waisindex will index only the files you specify. If you use the -r switch, it will recursively index all the subdirectories and files underneath the starting directory.

You can speed up waisindex by giving it a large -mem parameter. Expressed in megabytes, this parameter is the amount of your system's virtual memory (not physical RAM) to be used in creating index databases. Using too high a number here might interfere with the computer's other tasks, so be careful if your system is busy. Running your indexing jobs in off hours, when the system is less busy, can enable you to use more memory. If the default memory utilization (with no -mem specification at all on the waisindex command line) slows your system down, use this argument to limit the amount of memory used rather than to increase it. This option should only be necessary for indexing large Intranets.

The two related options -register and -export might seem similar, but they do entirely different things. In order for you to make your WAIS index database fully searchable by WAIS clients, you must use the -export option. This option modifies the database.src file, making it possible for stand-alone WAIS clients to access the database over a network. Web browsers or CGI gateway scripts like waislook don't need this information.

Using the -export option does not advertise your index database to the Internet. This is what -register does. In freeWAIS, the -register option creates and sends an e-mail message to two main WAIS index registries on the Internet. In EMWAC WAIS, this option only tells you how to advertise your index database. The effect is a public Internet announcement that your index is available to be searched from anywhere. If you don't want to make your index universally available, don't use this option. Also, if your network is not connected to the Internet or is behind a network security firewall, the -register option is unlikely to be of any use.

Two options, -e and -l, enable you to control whether your WAIS server will create logfiles of its transactions (all the searches that are done) on your index. In addition, you can control how much logging takes place. The first option (-e logfile) tells the server that you want a log kept in the file logfile.

By default, if you have logging enabled, the most verbose logging is done. To reduce the amount of information that's logged, use the -l option with a number between 0 and 9. (Level 10 logging is the default if -e is used alone.) The lower the number, the less verbose the logging. If you use default logging, watch the size of your logfiles to ensure that they don't fill up your disk.

Rather than typing in a list of filenames on the waisindex command line, you may want to use other command-line utilities to prepare a list of files for you based on some criteria. You can then feed that list to waisindex using the -stdin option.

One of the files created by waisindex is known as the catalog file. This file contains the headline of every document in a WAIS index database. If your database is large, this file can get quite large. It's really nothing more than a long list of the files in your database, annotated with a descriptive headline. Failed searches may result in the headline file being returned to your customer, and a long list of headlines may or may not be helpful. The catalog file is not required for the WAIS server to function or for your customers to do searches, so you can dispense with it if you're short on disk space by running waisindex with the -nocat option.

Ordinarily, waisindex knows that there are some kinds of files whose contents can't usefully be indexed. Examples include image files and other kinds of binary data. Based on the -t option, for example, waisindex will index the contents of several kinds of text files that it knows about. If, on the other hand, you'd like to inhibit content indexing of ordinarily indexable files, use -nocontents.

If you want to make sure that your WAIS index database contains specific keywords, even if some or all of the documents don't contain them, use -keywords string and specify the keywords on the command line, or use -keyword_file filename and specify them in a file. Your extra keywords will be added to the normal indexing. This feature is useful when indexing image filenames and other binary data.

The -T and -t options are confusing because they both appear to specify a document type. The difference is subtle but important. You can think of the two as specifying a document format and a document type, respectively.

The waisindex program has a built-in list of the document types it recognizes. You can get this list by entering the waisindex command with no options at all on your command line. For the most part, these are types of plain text files whose internal file format waisindex understands and can interpret. Examples include Usenet news articles and e-mail messages. The program expects such files to conform to the standard format of those kinds of files, with a certain layout and structure. Thus, the -t option deals with the format of documents-how they're laid out, what divides records, and the like.

As you'll also recall, Web servers and browsers know about a list of MIME data type/subtypes. This is where the -T option to waisindex comes in. Because WAIS is built to integrate into a Web, it has MIME hooks built in. When you index data with waisindex, you can use the -T command-line option to specify a MIME type that will be announced when your index is searched by a Web browser or CGI script. When a Web browser or CGI gateway script retrieves the document, the MIME type is returned, and your Web browser deals with it appropriately. Thus, if you index JPEG image files using -T JPEG on the waisindex command line, your customers' Web browsers will know to open the files they retrieve from your WAIS server as JPEG images.

Note
I*n some instances, the -T and -t options appear to have the same file type specified. For example, because waisindex knows about GIF images, you might specify -T gif and t gif on the same command line when indexing GIF files. Because the two options mean different things, their use isn't redundant.

Tip
When using both the -t and -T options with waisindex, always put the -t option first on your command line. In some cases, -t may imply a -T because the overall default for T is TEXT, so you may not need both options.

In connection with MIME types, the -M option to waisindex enables you to specify multiple file types in a single WAIS index database. Suppose you maintain copies of common word processing documents in several formats, including Microsoft Word, WordPerfect, rich text, and plain text. Using the -M option, you can index all these documents at once using a waisindex command line-something like the following:

waisindex -d mywords -M MSWORD,WORDPERFECT5.1,RTF,TEXT *.*

In this line, the multiple file types correspond to some of the additions you have made to the mime.types file on your Web server over the course of the last several chapters. Note that you must specify them on the waisindex command line in uppercase letters.

Indexing Images and Other Document Types

Most Web servers have more than just plain text documents on them. In particular, Web servers have HTML documents, images, and other multimedia files on them. So why not extend your WAIS index database to add these important files?

Suppose your Web server includes a directory tree containing not only the text files you've already indexed, but also one or more subdirectories containing HTML files. If the contents of your images aren't indexed, you may wonder what image indexing will add to your WAIS index database. Well, all the filenames of the GIF images in your Web server's file tree are indexed (along with any associated keywords you've added with the -keywords option) so that now you and your Intranet's customers can search for image files the same way you do keyword searches.

As you create more and more HTML documents, you'll collect more and more images; running waisindex on them enables you to manage them better. CGI gateway programs like waislook can help you and your Intranet's customers search for image files just as they can help you look for text files. If you have multiple Webmasters including customers setting up Web servers of their own for your Intranet, having a searchable, retrievable collection of images can be a boon. Everyone can share the same set of images, preventing duplicate work and giving your Intranet a common look.

As you've probably guessed, the technique of indexing filenames without indexing their contents, as just discussed with images, can be used for almost any kind of binary data on your Web server. You can index any set of binary files for easy search and retrieval, saving you the time and trouble of maintaining Tables of Contents as documents change.

Tip
When using waisindex to index word processor, spreadsheet, or other data files, be sure to use the -keywords option to add key search words to your index. These documents' contents may not get fully indexed, so you'll want to use this important feature.

Using the same search form shown in Figure 21.3 (where the keyword address was used), you could obtain results showing not only the plain text versions of each file found, but also the original .doc file. Because your customers' Web browsers are already configured to use Word as a helper application (see Chapter 13, "Word Processing on the Web"), they can click the document they want and load it directly into Word.

Commercial Index-and-Retrieve Packages

A growing number of companies are coming out with commercial software packages for creating Web-searchable index databases for Intranets. The following sections sample several of the other commercial packages.

Fulcrum Surfboard

A long-time maker of full-text search technologies, Fulcrum, Inc. now has a Web-based product called Surfboard. Surfboard 2.0 for Windows NT can search both local and network indexes and can search multiple indexes in a single pass. You can use natural language, multiword phrases, fielded searches, wildcard word matching (such as comput* to match computers, computing, computation, and the like), and Boolean constructs. It also supports relevance searching. In addition, you can specify the kind of output you'd like from your search, with choices including listing or tabular arrangement, HTML, plain text, or document native format, and you have several choices for sorting. You'll find more information about Surfboard and other interesting Fulcrum products for Windows NT at this URL:

http://www.fulcrum.com/english/products/prodhome.htm

Verity Topic

Topic is another product suite consisting of eight products and including both an Enterprise and an Internet indexing/search engine. The former supports major office applications' data file formats, and the latter adds support for HTML documents on a Web. Both search engines support so-called fuzzy-logic searches, as well as concept, weighted, and Boolean searches. Following the overall structure of the Topic system, the Topic client is not a Web browser, but a stand-alone application, and is available for Windows NT. Figure 21.4 shows a demo of Topic searches. You can run the demo at this URL:

http://www.verity.com/demo/d/Topic_Demos/tisdemo.html

Figure 21.4: The Topic Internet Server search demo.

Architext Excite

Excite for Web Servers (EWS) enables users to search multiple database indexes and includes both concept and keyword searches. Queries can be natural language, with search results sorted by what Excite calls Confidence (similar to other weighted relevance searching). It provides a user-friendly fill-in search form (as do other packages mentioned in this chapter).

Excite's primary distinction is that it's available for no-cost download. You can retrieve it from this URL:

http://www.excite.com/navigate/download.cgi

Excite is available for Windows NT and several UNIX systems. The licensing document that comes with the downloadable package indicates that Excite can be used internally without any charge, although you are requested to register the package. (You need only supply your e-mail address to download it.) No support comes with the free package, but support contracts, which include future upgrades, e-mail, and phone support, are available for purchase. Currently, maintenance agreements for EWS are sold for $995 per year.

Excite supports "concept-based searching," which is a technology made possible by the way EWS goes through its indexing process. It uses probabilistic techniques to analyze the interrelationships between words within a collection of documents. This index supports concept-based capabilities such as finding relevant documents that do not even contain the words used in the query statement and improving the ranking of the returned documents so that the most important documents are shown to the user first, even when thousands of documents are found.

Currently EWS only supports ASCII and HTML documents, but Architext has stated that this restriction will be lifted in the near future. With what you now know about document conversion, that limitation can be considered an inconvenience, but not a show-stopper.

PLWeb

Another index-and-retrieval package for Windows NT, Personal Library Software's PLWeb, is available for no-cost 45-day evaluations to registered users. See http://www.pls.com for details of the offer. A demonstration is online there, but you may want to look at what some of PLS's customers are doing with the package. For example, Figure 21.5 shows AT&T's searchable Toll-Free Internet Directory at http://www.att.net/.

Figure 21.5: The AT&T Toll-Free Internet Directory gives Web junkies a quick way to search for 800 numbers.

Summary

Focusing on indexing and retrieving data on your Intranet, this chapter has covered general-purpose indexing packages that can be accessed using a Web browser. You've learned how to index your data and how to provide Web-browser interfaces to enable your customers to search and retrieve data from them. In addition, you've learned about a specialized database package that you can use to maintain an online corporate telephone directory for your customers. Finally, you surveyed the market of commercial software providing index-and-retrieval features.

The next part of the book, "Sample Applications," is geared toward typical business uses of an Intranet. The next several chapters will pull together all that you have learned about Web technologies and Web tools.