- 4 -
Site Design Considerations

Most organizations that want to add content indexing to their existing site do not have the luxury of designing a complete site from scratch. Luckily, there are many ways to provide a site with a functional capability that is both secure and easy to manage. In this chapter, you will learn how new and existing site configurations can leverage the Index Server and Internet Information Server technologies to provide a flexible architecture that is well designed.

Physical and Logical Design

The most overlooked component of a site's design is probably proper planning for future growth. As more data is made accessible via networked computer resources (such as local area networks, wide area networks, and the Internet) planners should prepare for dramatic increases in the need for computer resources to service this data. Another key component of a well-designed system is the proper analysis and addressing of operations that may degrade your Index Server implementation. As you will learn in the following sections, the resources that are most commonly affected by the implementation of indexing-engine capabilities are your disk drives. Proper planning can result in a well-designed and fine-tuned functional implementation.

Handling Site Growth

When designing your Index Server site, you must plan for future site growth. The effects of site growth beyond its hardware limitations can include any of the following:

Shortages of disk space for documents and indexes
Insufficient memory to support large catalog searches
Insufficient CPU cycles to support large numbers of query requests
Limited server-expansion capabilities for additional disk drives or memory

Each of these areas can be addressed in any number of ways, but its best to address these concerns during the design phase of your Index Server implementation. For CPU issues, simply make sure your system includes expansion capabilities to add CPUs or to upgrade to a faster CPU. To plan for memory expansion, you should provide support for adding memory as well as for exchanging memory for faster chips. Disk-drive issues require a little more attention because more options are available to support growth as well as the tuning of existing system architectures. The following sections address disk-related Index Server implementation issues.

Disk Capacity Issues

When designing your Index Server site, you'll probably wonder how much disk space you'll need to store the indexes created by Index Server for the documents that you will be servicing. The general rule of thumb is to plan for about 40 percent of the space required to store the documents.

Depending on the filter(s) used to index a group of documents, the actual size of the indexes may be less than the standard 40 percent. For example, if you used the Ifilter Software Developers Kit to write a filter for indexing large documents (such as satellite imagery documents), you would only require that the first few hundred bytes (generally used to identify header information about the image) or the properties of the document be indexed. Like the NULL filter, which is used to index binary files, any filter that does not index the entire contents of the document requires much less space.

Disk-space issues can be addressed in any of the following ways.

Add disk drives to existing system
Replace existing disk drives with drives of larger capacities
Utilize network-accessible file servers to support file-space growth.

Properly planning for disk-space requirements (for existing as well as projected site growth) is important when you design site. While you don't have a crystal ball to tell you how much disk space you will need to support a document site forever, a good site design can allocate enough disk space for projected growth and can include the appropriate notifications for when disk-space utilization has reached its pre-defined threshold.

If documents must be moved to other disk resources to overcome space limitations, indexes may still return data for documents that no longer exist under their original document location. Therefore, to guarantee accurate indexes, the documents must be re-indexed as soon as possible.

For sites that include hundreds of megabytes or even gigabytes of data, proper disk-capacity planning can prevent you from spending hours re-indexing documents should files need to be moved to overcome space limitations. However, when you must re-index entire document sets, plan on executing the indexing process during non-peak access times such as weekends or after normal business hours.

Striping Disks for Input/Output Operations

As many seasoned systems managers can tell you, the proper layout and partitioning of your disk drives can save you painful hours of rebuilding disk drives to correct disk input/output operation bottlenecks. The term bottleneck identifies a situation in which an action or actions cannot be taken until some other action has completed. In the case of disk drives, bottlenecks can occur when a read or write action against a disk drive cannot be executed until a currently active read or write operation against that same disk drive completes.

One way of overcoming bottlenecks in input/output-bound disk operations is to utilize a feature known as disk striping. Disk striping is a built-in feature of Windows NT that allows partitions across multiple disks to be configured as a single partition equally distributed among the disk drives. For example, if you have three disk drives that are all one gigabyte in size, you can create one 3GB partition with 1GB partitions on each drive.

What advantage does disk striping offer when addressing disk-bound I/O operations? For one thing, because each write operation requires a subsequent read operation to verify that the data was successfully written, read operations require fewer disk operations and occur much more quickly than do write operations. You can access (read) and retrieve information from partitions that are striped across multiple disk drives much more quickly than from a single disk drive because there are multiple disk heads seeking and retrieving the data.

Because the majority of Index Server operations are read requests (searching the indexes, returning the result set, and accessing the actual documents), disk striping is a good way to reduce disk-bound I/O operations.

Seeking and returning data with multiple heads reduces the time required to complete the disk-read action, thus reducing the occurrences of disk I/O bottlenecks.

Disk striping does have some drawbacks. Write operations against a disk drive consume more CPU time than do read operations. If the documents you will be indexing require many write operations, disk striping can degrade performance.

Striping and Mirroring Disk Partitions for Fault Tolerance

Fault tolerance is built into Windows NT through partition striping with parity or through partition mirroring. The primary purpose of designing a site to include fault tolerance is to reduce the chances of experiencing system downtime because of a failed disk drive. However, fault-tolerance capabilities may have a minimal impact on disk-write transactions.

Partition mirroring is implemented by mirroring (duplicating) the contents of one disk partition across a partition of equal size on another disk drive. The operating system handles all write-transaction mirroring so that the implementation of fault tolerance is completed behind the scenes and without end-user or system-manager intervention.

Because disk mirroring requires a second partition of equal size to the partition that is to be mirrored, the actual usable physical size of your disk capacity is reduced by 50 percent.

Partition striping with parity is implemented similarly to normal partition striping except that fault tolerance is enabled through the use of parity information stored within each partition. Parity information is used to rebuild the affected data when a failed disk drive is replaced with a new disk drive. Consider a scenario where three disk drives (diskA, diskB, and diskC) each contain a 500MB partition that is part of a striped partition. The parity information stored on diskA and diskB will be used to rebuild diskC in the event that diskC fails and is replaced. Likewise, the parity information on diskB and diskC will be used to rebuild the data contained on diskA should diskA fail.

Partition striping does not come without associated costs. Having multiple disk drives involved in a write operation requires extra system CPU time and may degrade write operations. Additional disk space is also required to support the parity information when fault-tolerance capabilities are added to standard disk striping.

The type of documents your site will service and the ratio of read and write operations against those documents can determine how you design your fault-tolerance capability. For example, you might want to use disk mirroring for documents that are constantly being updated or written, but use multiple striped partitions with parity for documents that are generally retrieved (read) and not updated. In most cases, the benefits of adding disk-drive fault-tolerance capabilities outweigh the associated costs.

Disk Fragmentation Issues

Whenever a file is created or updated, the operating system must acquire the needed disk space within the specified partition. In the case of a document that is being updated, if the additional space is not contiguous with the existing document space, space must be used elsewhere within the partition. For newly created documents, if contiguous space is not available to contain the entire document, multiple partition spaces are used to store the contents of the document. This dispersal of data is called disk fragmentation.

Each time a file (document, index, system file, and so on) is accessed by the operating system, the disk heads must seek the location of the data, read the data, and return it to the calling program. If the information is spread across a disk partition in multiple locations because of fragmenting, the disk drive must work harder to find and return the requested data. If you have disk fragmenting, the only solution is to de-fragment the disk using a de-fragmenting tool. Windows NT includes de-fragmenting capabilities that you may need to utilize on a periodic basis.

Security and Intranet Versus Internet Issues

The most definitive difference between intranet and Internet issues is probably access privileges. If your Internet Information Server implementation will be supporting internal (intranet) and external (Internet) access, the design and layout of your Index Server implementation will require greater attention to security concerns. For example, you wouldn't want Internet users to be able to query and retrieve human-resource documents such as information related to notices of discharge, salaries, and so on.

By implementing a design that takes full advantage of NT features such as NTFS Security, you can restrict document access to authorized end-users.

To take full advantage of the security features built into Windows NT, you must ensure that all documents serviced by Internet Information Server reside on NTFS partitions and not FAT partitions.

To further restrict specific query forms, you can specify query scopes that target only specific documents or groups of documents. For this reason, it may be beneficial to define the physical locations within your document site that would be used for querying by specific query forms. For example, you might define a virtual directory within IIS that points to documents that only corporate personnel can retrieve. Then, using NTFS Security, you can limit access to those documents or to the forms that query those documents to only NT domain users.

Summary

As with any system, proper planning and understanding of the system design can alleviate many of the pains associated with the growth of the system. With the exception of defining specific virtual directories within IIS that will be used to manage query-form access, many design issues related to Index Server are pertinent to any Windows NT installation.

- 4 - Site Design Considerations