Previous Page TOC Next Page



- 6 -
Index Server Query Language


End-users construct queries using a special query language understood by Index Server. These queries are used to target specific documents that will be included in a result set. Index Server's query language is similar to that of other search engines in that it supports the standard query operators. In addition, Index Server supports the use of document properties in query strings. Through document properties, end-users can query document properties such as the document author, modification time, file size, and so on.

In this chapter you will learn how to use query language operators in conjunction with document properties to target the specific set of documents for your result set. Additionally, you'll learn about other query capabilities available in Index Server. These capabilities include the use proximity searches, full-text searches and vector-space queries.



All queries in this chapter use the Index Server Sample Query Form, which comes installed with Index Server. Additionally, all query strings target documents that are known to be in the Index Server online documentation, so you can test each query using your own Index Server installation.


Query Language Operators


The ultimate goal in designing a query string is to target the exact result set of documents that interests you. In most cases, a query cannot be fully satisfied with a single query term. For example, you might want to target documents with multiple text strings, a specific document author, a specific creation time, specific update times, and so on. To combine these query terms, you need to use query language operators.

Query language operators provide a means for joining test conditions in a query. These test conditions are used to validate the existence of text as well as to validate document properties against a query term. Each query operator serves a specific purpose. This section introduces each type of operator: boolean, proximity, relational and wildcard.

Using Boolean Operators


In Chapter 1, "Overview of Document Indexing, Queries, and Result Sets," you learned that boolean operators can be expressed as keywords or symbols (see Tables 1.1 and 1.2). Although these symbols are standard across all Index Server-supported languages, keywords continue to be more commonly used. This is probably due to the ease of understanding that is gained when using language-like query strings. However, for standardizing sites that require support for multiple languages, using symbol operators may well be a better solution.

As mentioned in Chapter 1, the three boolean operators are AND, OR and NOT.



Language-specific boolean operator keywords can be found in Table 1.2 of Chapter 1.

The three boolean operators can be combined in any order, use other query-restriction operators, and use parentheses to reduce the result set to the specific target documents. In this section you'll learn how to use each of these operators when building your query restrictions as well as how to use parentheses to override the normal query-processing logic.



The boolean-expression evaluator built into Index Server is smart enough to stop processing ANDs if a term evaluates to FALSE, and ORs if a term evaluates to TRUE.


The AND Operator

The AND operator is used to specify that the terms on both sides of the expression must evaluate to TRUE for the AND term itself to evaluate to TRUE. Consider a query in which you want to target all documents that include the words damaged and corruption. The query term


damaged AND corruption

specifies to Index Server that both terms (damaged and corruption) must be included in the document for the document to be included in the result set. If a document is found that includes the word damaged but does not include the word corruption, the query term damaged AND corruption evaluates as FALSE.



Just because a term evaluates to FALSE does not mean that the document will not be included in the result set. As you will learn in the following sections, other operators exist that require a term on either side of the operator to evaluate to FALSE to evaluate the operator to TRUE.

Let's now execute a boolean AND query using the Index Server Sample Query Form. When you installed the Index Server software, a program group was created that includes the Sample Query Form. Access this form by selecting Start[hr]|[hr]Programs[hr]|[hr]Microsoft Index Server (Common)[hr]|[hr]Index Server Sample Query Form.

With the query form now displayed, enter the query string


damaged AND corruption

Figure 6.1 displays the query form with the query string entered.

Figure 6.1. Path to the Index Server Sample Query Form.

Next, select the Execute Query button to launch the query. Figure 6.2 displays the query result set and resulting page.

Figure 6.2. Query results of boolean AND operator search.

Minimally, your result set should include three documents included in the Index Server online documentation.

One of the new features of Index Server Release 1.1 is the capability to highlight (using colored text, bolding, and the like) the matching strings in your result set documents. Developers have the ability to control how highlighting is presented to the user. You'll learn more about this in Chapter 8, "HTML Extension Files."

For now, let's look at the highlighted terms for the selection Microsoft Index Server Guide: Filtering by clicking the Highlight Hits hypertext link associated with that result set document. Figure 6.3 displays the Summary Hit Highlighted Page.

Figure 6.3. Summary Hit Highlighted Page for boolean AND operator search.

Boolean operators are not limited to simple text terms like damaged and corruption. As you will see in later sections, boolean operators can be used in conjunction with other query restrictions, such as property queries, full-text queries, and vector space queries. Multiple boolean operators are simply concatenated onto the query string to further restrict the result set.

Using the previous query, let's further restrict the result set by adding the restriction to search for documents that also include the term warning. Using the sample query form, enter the query


damaged AND corruption AND warning

Figure 6.4 displays the result set of this query.



Depending on the documents indexed at your site, your result set may vary. However, at a minimum, your result set should include the result displayed in Figure 6.4.

Figure 6.4. Query results using multiple boolean AND operators.

As you can see in Figure 6.4, your result set is now reduced to just one document. Prior to adding the additional restriction AND warning, your result set included three documents. By adding the additional restriction, you've reduced the result set to a single document.

The OR Operator

The OR operator is used to specify that one of the terms on either side of the expression must evaluate to TRUE for the OR term itself to evaluate to TRUE. For example, say you want to target all documents that contain either of the words scanning or incremental. The query term


incremental OR scanning

specifies to Index Server that documents containing either of the query terms (incremental or scanning) should be included in the result set.

Using the sample query form, enter the following query to see the result-set (see Figure 6.5):


incremental OR scanning

Figure 6.5. Query results using the boolean OR operator

Again, depending on the documents you have indexed at your site, your result set may vary. However, you should have at least six documents from the Microsoft Index Server online documentation in your result set.

To see the difference between using the OR operator and the AND operator, enter the query for the terms incremental and scanning, as you did in the previous example. However, change the OR operator to an AND operator, as in


incremental AND scanning

As you can see in Figure 6.6, the number of matching documents is dramatically reduced to just a single document.

Figure 6.6. Query results using the boolean AND operator

The NOT Operator

The NOT operator is used to specify documents that do not meet a specific restriction. In the case of content queries, the NOT operator must follow an AND operator and a previously defined restriction. For instance, in the preceding OR example, you used the query string


incremental OR scanning

to define the result set of documents that include either of the terms incremental or scanning. Using the NOT operator, you can specify an additional restriction on the content. For example, say you want to restrict documents from your result set that contain the word corruption. Just add the NOT restriction like so:


(incremental OR scanning) AND NOT corruption

This restricts your result set to only those documents that satisfy all other terms but do not include the term corruption.



Note that you added parentheses around the original query string incremental OR scanning. This was to ensure that the result set was not altered because the AND operator has greater precedence than the OR operator. Had you not used the parentheses, the query string would have been interpreted as:

incremental OR (scanning AND NOT corruption)

In the next section, you'll learn how and why adding parentheses can alter your result set.

Figure 6.7. Query results using the boolean NOT operator.

By adding the NOT restriction, your result set is reduced from the original six online documentation documents down to just four documents.



Content queries using the NOT operator must be prefaced with another query restriction. For example, you cannot use the query string


NOT corruption

to search for all documents that do not include the word corruption. The correct syntax would be to preface the NOT restriction with another restriction, as in

(incremental OR scanning) AND NOT corruption

However, when specifying a property query, the AND operator does not prefix the NOT operator when it is the first term of the query string. For example, both of the following query strings are valid. Note that the AND does not preface the NOT in the first example, but it does in the second example.

(NOT @size > 5000) AND (incremental OR scanning) AND NOT corruption


(incremental OR scanning) AND NOT corruption AND NOT @size > 5000


Using Proximity Operators


Proximity searches allow users to search for terms within a specified number of words of another term. This is most useful when searching for documents in which pertinent words might not be located beside each other.

A good example of a proximity search would be to look for documents that include the terms President and Clinton. Sure, you could use a simple AND operator and construct your query as


President AND Clinton

However, your result set would include all documents with the word President as well as all documents with the word Clinton. This would include a cross-section of documents that you want, but would also include documents that you do not want in your result set.



Proximity searches in Index Server are similar to those of other search engines. However, they are limited in the fact that the user cannot specify the proximity (in words) of the two terms. For example, some search engines support queries that allow the user to specify that one word be found within a given number of words of the other.

Your next thought might be to modify your query to search for documents that include the term President Clinton enclosed in double quotes as in the query


"President Clinton"

This query would include many of the documents for which you are searching. However, the result set of this type of query would not include documents with text such as President Bill Clinton, President and Mrs. Clinton, or President William Jefferson Clinton. Proximity searches can satisfy most queries in which you are uncertain of the exact text.

Through the use of the NEAR operator, Index Server provides end users with the ability to target documents that contain user-specified words.

The NEAR Operator

The NEAR operator not only specifies documents that contain relevant words on each side of the NEAR operator, but also provides a rank value based on how close the two words are to each other. The closer the two words within the document, the higher the rank assigned to that document.



A rank of 0 is assigned to proximity searches where the two relevant words are more than 50 words apart. If a NEAR operator returns a rank value of 0, the document is still returned as a member of the result set.

Using the sample query page, launch an Index Server query with the following as your query string:


full text NEAR index

Figure 6.8 displays the results of this query.

Figure 6.8. Query results using the proximity NEAR operator.

As you can see in Figure 6.8, two documents are returned. Select the Highlight Hits hypertext link and the words that matched the query will be highlighted within the text. The default query-page results are sorted in descending rank order, meaning that documents in which the matching words appear closer together receive a higher rank (and thus sort at the top of the result set).

Looking at the default results page, there is one thing missing—the rank assigned to the document. You'll learn in Chapter 8 how to modify the formatted output to display the document properties. However, for now you're going to make a minor modification to the sample query document hypertext extension (.htx) file so that you can see the ranking assigned to NEAR operator result sets.

Find the file queryhit.htx in your Internet Information Server directory structure. If you chose all the defaults when you installed Index Server, it should be located in the directory <inetsvr_root>\scripts\samples\Search. Open the queryhit.htx file and search for the string


<b>Highlight Hits</b></a>

Modify this line to append the ranking information as follows:


<b>Highlight Hits</b></a><BR>Rank: <%rank%>

Save the file and re-launch the query (if you still have your previous query page active, simply select the refresh button for your specific Web browser).

Figure 6.9. Query results with ranking using the proximity NEAR operator.

Figure 6.9 displays document rankings. If you select the Highlight Hits hypertext link of the first result set entry, you'll find that the term full text appears just two words apart from the term index. For this entry, a rank value of 11 was assigned to the document. In the second-result set entry, the two words are found more than 50 words apart, so a rank value of 0 is assigned to the entry.

Using Parentheses to Override Operator Precedence


Because the boolean operators maintain an order of precedence in which some operators have higher precedence than others, you must pay close attention to the specific ordering of multiboolean operator queries. For example, the AND operator maintains a higher precedence than the OR operator. However, the use of parentheses can guarantee the proper processing of the query string.



The use of parentheses does not force a particular evaluation order because the query is converted to a normalized internal representation before processing.

Look at the query used earlier in the chapter to target documents containing either of the words incremental and scanning but not containing the word corruption. The query string


(incremental OR scanning) AND NOT corruption

would return the four specific online documentation documents. To see the effect of not using parentheses, remove them and run the query like so:


incremental OR scanning AND NOT corruption

Figure 6.10 displays the incorrect result set returned for the incorrect (missing parentheses) query string.

Figure 6.10. Query results with parentheses missing.

After executing the query, notice that six documents are returned in the result set. This is because Index Server interpreted the query to be


incremental OR (scanning AND NOT corruption)

which was not your intended query string.

The Use of Double Quotes Around Keywords and Noise Words


Index Server maintains a list of keywords and noise words, which are handled differently than normal terms. Noise words are common language-specific words that appear throughout text. The list of keywords includes all boolean operator keywords (AND, OR, NOT) and the proximity keyword (NEAR). Table 6.1 displays the list of default noise words configured into Index Server.

Table 6.1. Index Server default English noise words

1 2 3 4 5
6 7 8 9 0
$ about after all also
an and another any are
as at be because been
before being between both but
by came can come could
did do each for from
get got has had he
have her here him himself
his how if in into
is it like make many
me might more most much
must my never now of
on only or other our
out over said same see
should since some still such
take than that the their
them then there these they
this those through to too
under up very was way
we well were what where
which while who with would
you your


When searching for terms that include operator keywords or noise words, be sure to enclose the entire term string within double quotes (". . ."). For example, to search for the phrase text AND properties, your query would read


"text AND properties"

Without the double quotes, the AND would be interpreted as an AND operator, and documents containing both the term text and the term properties would be included in the result set.

To see how double quotes can affect the outcome of a query, use the Sample Query Form to enter the following query string (note the lack of double quotes) and launch your query:


text AND properties

Your result set should be similar to that of Figure 6.11.

Figure 6.11. Incorrect query results due to lack of double quotes.

In this example, twenty-four documents matched the query and were returned as the result set. This is because the query was actually searching for documents containing both words rather than the text string text AND properties.

Now try executing the query with double quotes around the entire query string, like so:


"text AND properties"

Figure 6.12 displays the correct results, with just two documents in the result set.

Figure 6.12. Correct query results with double quotes.

As you can see, proper placement of double quotes around strings containing either keywords or noise words is very important when designing your query string.

Using Relational Operators


Relational operators evaluate the condition of a specific document property in relation to a specified query term (generally specified by the end user). Because relational operators evaluate a document property against a specified value, they are not used in content queries (queries against the text content of the document). This does not mean that content queries and property queries cannot be combined in a single query string. In many cases, content and property queries are combined to target the intended result set of documents. Later in this chapter, you'll learn more about property queries. But for now, let's focus on using relational operators (<, <=, =, !=, >=, >) and understanding how they are used to test the conditions of a document property.

Consider a query that must identify documents with the query terms text and properties and must have rank value greater than 50. To identify the documents with the two text strings, construct the query like so:


text AND properties

Using the sample query document, execute the simple content query above. Figure 6.13 displays the result set containing the 24 matching documents.

Figure 6.13. Simple content query results.

To add the restriction of a rank greater than 50, modify the query string to add the greater-than (>) symbol and test condition values as follows:


(text AND properties) AND @rank > 50


You'll learn more about document properties, including the rank property, later in this chapter.

With the query string modified to include the relational greater-than operator, launch the query. Figure 6.14 displays the result set that now includes only documents with a ranking of greater than 50.

Figure 6.14. Content query results using the > relational operator.

One administrative use of relational operators in query strings might be to target documents that have been created during a specified period of time. Using the document property create, you can construct a query as follows:


(@create >= 96/09/15) AND (@create <= 96/09/22)

As you can see from this query, it makes use of the >= and <= relational operators to target documents created during the period from 96/09/15–96/09/22. Figure 6.15 displays a sample result set for this type of query.



Your query results may differ depending on when you installed your Index Server software and which other additional virtual directories you have configured in your IIS World Wide Web service implementation.

Figure 6.15. Specifying time-frame queries using relational operators.

Using Wildcard Operators and Word Stemming


The wildcard operator (*) is used to search for words that begin like a specified word. For example, the wildcard string congress* matches words like congress, congressman, congresswoman, congressional, and so on. The word stem operator (**) takes the wildcard operator one step further by allowing users to target words that have the same stem as another word. For example, the word stem string pry** matches words like prying, pried, and so on.

Say you want to target documents containing words that begin with the string quer. Adding the wildcard operator, your query string would look like the following:


quer*

Figure 6.16 displays the results of this query.

Figure 6.16. Query results using the wildcard operator.

Select the Highlight Hits hypertext link for one of the result-set documents to see the matching words for the query. Figure 6.17 displays a sample of the highlighted hits from the preceding query.

Figure 6.17. Highlighted hits of a wildcard query operator.

Say you want to target documents that have the same word stem as the word seeing. Using the sample query document, enter the word stem query for the text string as follows:


seeing**

A result set of at least six documents is returned that includes matching words such as seen, sees, and of course, seeing. Figure 6.18 displays the result set for the preceding query.

Figure 6.18. Query results using the word-stem operator.

Types of Queries


Index Server queries come in several different flavors, each having its own purpose. While you could just use a simple query-term search when targeting documents for your result set, other types of queries exist. In this section you'll learn how to build more flexible query strings that will allow you to use wildcards, document properties, and English-like sentences.

Building Free-Text Queries


Index Server supports free-text queries. In free-text queries, nouns and noun phrases of a free-form text string are identified and documents that include those terms are targeted. To construct a free-text query string, build a free-form query string in sentence-like form and prefix the string with the identifier $contents. For example, say you want to find information about .htx and .idq files and how they work with Index Server. Using the sample query document, generate the query string


$contents How do idq and htx files work with Index Server

and launch the query. The Index Server engine parses out the key words (idq, htx, files, Index, and Server) and looks for documents that contain as many of the matching terms (phrases) as possible. Figure 6.19 displays the result set of documents from this query string. The abstract for the result set's first entry shows that many relevant terms are found in the document.

Figure 6.19. Query results using a free-text query.



When specifying a free-text query, boolean, proximity, and wildcard operators are ignored.


Building Vector-Space Queries


Vector-space queries allow the end user to assign ranking values to any number of query terms. As in all other document queries, the rank returned for each document is an indicator of how well the document matched the query terms. When specifying multiple vector terms, separate each term with a comma.



By adding rank weights (enclosed in brackets) to your vector-space query and ordering your result set in rank order, you can greatly increase the chances of identifying the specific document(s) at the top of your result-set listing.

Say you want to target documents that deal with Index Server scopes and catalogs. The following query uses rank weights and produces a result set of many documents that contain the specified vector terms:


scope*[100], catalog*[300], "master merge"[75]

Figure 6.20 displays a result set that shows a higher rank value assigned to documents that are specific to catalogs.

Figure 6.20. Query results using a Vector-Space query.

By simply changing the rank weight assigned to the vector terms, you can easily adjust the order of the result set.

Building Property-Value Queries


Property-value queries are used to find documents that match a specified set of parameters (property values). Table 1.9 in Chapter 1 displays a full table listing of all available property entries that can be queried.

The two types of property queries include relational property queries and regular-expression property queries. Relational property queries are prefixed with an at character (@), followed by a property name, a relational operator, and a property value. Regular expression queries are prefixed with a number sign (#), followed by a property name and a regular expression for the document property.



The order of the terms in the query string have no relevance on the processing of the query. In most cases, property terms are evaluated after other terms.

Say you want to target hypertext markup (.htm extension) documents that include the terms htx and idq in close proximity of each other, that were created before or on 96/09/30, and are greater than 50,000 bytes in length. This may sound like a fairly complex query, but it's really not. The following query string shows the use of proximity searches, the implementation of relational and regular property queries, and the use of boolean and relational operators:


(htx NEAR idq) AND @create <= 96/09/30 AND NOT @size < 50000 AND #filename *.htm

Figure 6.21 displays the result set for this query.

Figure 6.21. Complex query results using a property query.

Summary


This chapter covers a great deal of material related to the Index Server query language and its syntax. You've learned how to use query language operators, such as boolean and proximity operators. The use of parentheses to guarantee proper processing of a query string was briefly discussed. You also learned how to construct free-text queries, vector-space queries, and property queries. With this knowledge, you should be prepared to begin building Internet data query and HTML extension files, which will help you build customized query-search forms for your Index Server document repository.

Previous Page Page Top TOC Next Page