- D -
Index Server Frequently Asked Questions (FAQ)

This appendix contains answers to some of the questions most frequently asked by Index Server administrators and users. This is not an all-encompassing list of questions, but you'll find answers to some of the most common basic questions and a few advanced issues as well.

The questions answered in this appendix include

What querying capabilities are provided and performed by Index Server?
What is a virtual root?
What is a catalog and how does it differ from an index?
Is it possible to prevent documents from specific directories from being included with a result set returned from a query?
How do I establish indexing of remote UNC shares?
What is a corpus? How does it differ from a scope?
What is the main difference between scanning, indexing, and filtering?
What are word breakers used for?
Can the number of messages Index Server writes to the NT event log be limited somehow?
Does Index Server support fuzzy queries?
What are some steps I can take to improve performance?
How is it possible that the Files to be filtered counter shows a value greater than the Total # documents counter?
Why do unreadable files show up in query results? Can this be avoided?
Why is the Files to be filtered counter non-zero even though my system is sitting idle?
Why don't documents that are known to exist show up in result sets as expected?

Resource

Microsoft maintains a newsgroup dedicated to Index Server in which you can peruse threaded discussions of many issues relating to Index Server problems, workarounds, questions and answers, and unique implementation details. This can be an invaluable resource because it gives you have the opportunity to learn from the successes and experiences of other administrators and developers. The news group name is microsoft.public.inetserver.iis.tripoli.

Note that tripoli refers to the code name given to Index Server prior to its public release by Microsoft.

Q&A

Q: What querying capabilities are provided and performed by Index Server?

A: Index Server provides extensive querying capabilities and functionality. With Index Server, you can perform queries against the content of documents within your corpus as well as properties of those documents. Complex query restrictions can be developed that employ content and property queries simultaneously. To support these querying capabilities, Index Server provides:

The capability to develop query and administrative request scripts in the form of .idq and .ida files, respectively. These scripts can be invoked from HTML forms in a manner similar to invoking .cgi scripts
The capability to limit the scope of the query to specific virtual roots indexed by the server
The capability to customize result set reports through the use of .htx template files

Additionally, Index Server provides the capability to perform complex queries through the use of a very complex query language. Using the query language, you can:

Perform searches for words and phrases within document contents
Perform proximity searches for words or phrases near another word or phrase
Perform searches for words and phrases within textual properties (such as @DocAuthor Drew)
Perform searches for properties using relational operators such as <, <=, =, =>, and > against a constant (DATE > 8/31/96, for example)
Perform searches using the boolean operators AND, OR, and NOT
Perform searches using wildcards and regular-expression constructs.

Q: What is a virtual root?

A: A virtual root is simply an alias name for a physical path to a directory on disk. For example, the virtual root /e_books could point to the physical directory E:\e_books. Note that virtual roots always start with a /. Virtual roots are also known as virtual directories.

Q: What is a catalog and how does it differ from an index?

A: A catalog is a directory (named Catalog.wci) of indexes and other files used internally by Index Server to locate documents that meet a query restriction. An index is a data structure used to store words and information extracted from files during filtering. Indexes can be non-persistent, in-memory, lightly compressed structures (wordlists), or they can be persistent, on-disk, highly compressed structures (shadow indexes and the master index). There are typically several indexes in a catalog.

Q: Is it possible to prevent documents from specific directories from being included with a result set returned from a query?

A: Because Index Server indexing and query scopes are based on virtual roots, it is not possible to explicitly exclude certain directories from indexing. You can, however, structure your .idq files so that documents from specific directories are excluded from the result set returned to the user.

Suppose the virtual root /e_books points to the physical directory E:\e_books. You want to exclude subdirectories E:\e_books\TeachHTML32 and E:\e_books\TeachVBScript from the result set. To do so, modify the query restriction passed to the .idq file as follows:


CiRestriction=%UserRestriction% AND NOT #path E:\e_books\TeachHTML32AND NOT #path E:\e_books\TeachVBScript.

Q: How do I establish indexing of remote UNC shares?

A: Virtual roots that point to UNC shares are automatically indexed by Index Server. Index Server utilizes automated change notifications if they are supported by the remote share. Otherwise, the remote share is scanned periodically for changes based on the value of the registry parameter ForcedNetPathScanInterval. Also, be sure to specify the user ID and password properly.

Q: What is a corpus? How does it differ from a scope?

A: Corpus refers to the entire set of documents that are indexed and represented in a catalog. A scope, on the other hand, refers to a set of documents that will be searched during a query. A scope is specified by a virtual root. The virtual root can be defined to include the entire document corpus if desired. Likewise, scopes can be defined to include only a portion of the corpus.

Q: What is the main difference between scanning, indexing, and filtering?

A: Scanning, filtering, and indexing are closely related steps in the process of building the index used by Index Server to satisfy query requests. Scanning is the process by which Index Server identifies files within indexed virtual roots that have been modified. Filtering is a two-stage process by which (1) the CiDaemon process determines which filters are appropriate for use on a changed document and (2) the filters are used to extract information (words) for use in the index. Indexing is simply the process by which information extracted from documents is stored in wordlists and shadow indexes and eventually merged into the master index.

Q: What are word breakers used for?

A: Word breakers are language-dependent modules that Index Server uses during the process of filtering to identify words in a document.

Q: Can the number of messages Index Server writes to the NT event log be limited somehow?

A: Specific events can have their messages enabled or disabled through the use of bit-field masking. See Appendix A, "Index Server Registry Parameters," for details.

Q: Some indexing engines can search for text that is not an exact match, but is similar to the text of the query restriction. These type of queries are sometimes referred to as fuzzy queries. Does Index Server support fuzzy queries?

A: Index server supports fuzzy queries by searching for words and text similar to those in the query restriction. Rather than looking for only exact matches, the query engine modifies the words in the query and looks for these modified forms. Fuzzy query support is provided in one the following ways:

Wildcard matching
Regular-expression matching against textual properties
Linguistic stemming, or matching inflected and base forms of words in the query. For example, the word swim in a query restriction would be matched by swimming, swam, swum, and so on.

Q: What are some steps I can take to improve performance?

A: Several things can be done to improve Index Server performance:

Limit query scopes to only those parts of the corpus that need to be searched by given users
Some queries can be optimized so that they are not enumerated (CiForceUseCi=TRUE, for example), while others can be optimized so that report templates make efficient use of bookmark parameters and do not needlessly perform queries again. See Chapters 6–8 for additional information.
Multiple catalogs can be created to distribute queries. Additionally, catalogs can be placed on drives other than those that are used to store indexed documents.
Make sure there is always adequate disk space for the catalog to perform merging operations.
Add RAM. The use of an additional processor can also improve performance.
Review registry entries pertaining to wordlist behavior, merge intervals, and so on to be certain that they are set to optimize the resources on your system.

Q: How Is It Possible That the Files To Be Filtered Counter Shows a Value Greater Than the Total # Documents Counter?

A: The Files To Be Filtered counter value represents the number of documents that have been changed and need to be filtered. It is simply a list of changed documents. It is possible that some files were modified more than once and thus have multiple entries in the changed-documents queue.

Q: Why do unreadable files show up in query results? Can this be avoided?

A: Because Index Server indexes roots that do not have read permissions but are located under a root that does, you will need to employ a workaround to prevent files in the unreadable root from showing up in a user's result set. This is easily done within .idq files. Suppose you have an unreadable root named /_myroot. To prevent any documents or files in this directory from showing up in a user's result set, append the CiRestriction parameter in the .idq file as follows:


CiRestriction=(%UserRestriction%) &! #vpath *\_myroot\*

This tells Index Server to append the query restriction passed by the user with a query language directive to not include any results with the string /_myroot as part of its virtual path.

Q: Why is the files to be filtered counter non-zero even though my system is sitting idle?

A: This occurs when some files have failed to filter, which typically happens when files that are to be filtered are in use by some other process when the CiDaemon process attempts to filter the document's contents. When this occurs, the file is relegated to a lower priority queue to be filtered at a later time. The time interval between retries on these files is controlled by the registry parameter .

Q: Why don't documents that are known to exist show up in result sets as expected?

A: There are several circumstances where documents that are known to exist do not show up in a result set. The most obvious circumstance is when a query restriction is used that prevents the document(s) from appearing the result set. Assuming this in not the case, however, the following list details some other instances where this problem may occur:

It is possible that the document has not yet been filtered and the index is not up-to-date. You can look at the value of the CiOutOfDate variable to determine whether it is set to TRUE. If so, this indicates files the existence of files that need to be filtered. The NT Performance Monitor or an administrative query can also be used to determine the number of documents waiting to be filtered.
Sometimes errors occur when documents are filtered. This can happen for a variety of reasons, such as a poorly constructed filter DLL. Administrative queries can be used to list which files are unfiltered by including CiRestriction=@unfiltered=TRUE in the .idq file . This way, you can see if the file you expected in the result set indeed had problems being filtered. You should also get in the habit of checking the NT event log to see whether any error messages were entered by Index Server.
Some complex queries (such as enumerated queries) can consume a great deal of CPU time. If the amount of time spent on resolving a query exceeds the administrator-imposed limit specified in the registry parameter MaxQueryExecutionTime, the query will time out. In these instances, the CiQueryTimedOut variable is set to TRUE.
Some complex queries (such as enumerated queries) can consume a great deal of system resources. However, many of these types of queries are turned off by setting the CiForceUseCi variable to TRUE. Under these circumstances, some expected results may not be returned because the query engine only used information in the content index to resolve the query . You can check to see whether this is the case by inspecting the value of the CiQueryIncomplete variable to see if it is set to TRUE.
Make sure the user actually has permission to read the document he expects to see returned in a result set. If the document resides on an NTFS drive (highly recommended!), query results will not be returned for any document the user is not authorized to see.
There may be a mismatch between the language used to filter the document and the language used to issue a query. Document filtering and querying are both language-dependent processes. Therefore, a query issued in one language against a document filtered in another language can cause unpredictable results. Most well-designed filters use the language specified by a file (formats such as Word are marked with a language) to filter the file. However, some file formats, such as text, do not contain any language information. In these cases, filters typically default to the system locale for the files. The query locale, on the other hand, is specified by the CiLocale variable (if it is set). Otherwise, the browser locale or the default server locale is used for the query.

- D - Index Server Frequently Asked Questions (FAQ)

Q&A

- D -
Index Server Frequently Asked Questions (FAQ)