Filter DLLs

A filter DLL “understands” one or more document formats and is capable of extracting text and properties out of those document types. A filter DLL implements the IFilter ActiveX interface. The CiDaemon process uses the IFilter interface to extract the text out of a document. To track down a problem with a filter DLL, an administrator needs to know where to look to determine the filter DLL for a particular document. Editing the registry is also a good way to avoid filtering documents with no useful content.

This topic contains:

Pre-installed Filters

The list of document types for which filters are pre-installed is given below:

HTML Filter

The HTML filter will not index any of the contents or properties of an HTML file if the HTML file contains the following meta tag:

<meta name="robots" content="noindex">

A Webmaster can add this meta tag to selectively avoid indexing certain HTML files.

If an HTML file contains the following meta tag, the content field specifies the language code:

<meta name="ms.locale" content="EN">

The file is filtered by the language resources for that particular language (if available).

The content field in the tag can also specify the locale by a decimal number, such as 1033, which is the locale ID for U.S. English.

Some meta tag properties are mapped onto the Microsoft® Office property sets to allow users to mark HTML pages with the same properties in the Office property set. The list of properties that are mapped are:

Property example Mapped to
<meta name="author" content="ruth"> The author property in the summary information property set.
<meta name="subject" content="word processing"> The subject property in the summary information property set.
<meta name="keywords" content="fonts, serif"> The keyword property in the summary information property set.
<meta name="ms.category" content="fiction"> The category property in the document summary information property set.

The HTML filter extracts text from the content field of a meta element. For example, if an HTML file has this line:

<META NAME="DESCRIPTION" CONTENT="Sample query form for Microsoft Index Server">

Then a user can query the information in the content field, namely “Sample query form for Microsoft Index Server”, by using the HTML meta property. The GUID for the meta property is D1B5D3F0-C0B3-11CF-9A92-00A0C908DBF1 and the property name is specified by the name field, or the HTTP-EQUIV field. In the above example, the property name is DESCRIPTION. Thus a friendly name, for example MetaDescription, for the meta property can be defined as

MetaDescription(DBTYPE_WSTR|DBTYPE_BYREF) = D1B5D3F0-C0B3-11CF-9A92-00A0C908DBF1 description

The GUID for meta property is a registry parameter located at

HKEY_LOCAL_MACHINE
 \System
  \CurrentControlSet
   \Control\HtmlFilter
    \MetaTagClsid

The HTML filter emits scripting code embedded in an HTML page as a script property with the GUID 31F400A0-FD07-11CF-B9BD-00AA003DB18E. The property name of the script is specified by the language field of the script tag, for example:

<script language="vbscript">

In this example, the property name is vbscript. If no language field is specified, then the language field of an earlier script tag in the HTML page is used. If no earlier script tag is specified, then the property name defaults to javascript. The GUID for the script property is a registry parameter located at

HKEY_LOCAL_MACHINE
 \System
  \CurrentControlSet
   \Control\HtmlFilter
    \ScriptTagClsid

Document types and the associated filter DLL entries are specified in the registry under the \HKEY_LOCAL_MACHINE\Software\Classes tree. To find out the filter DLL associated with a particular document type, navigate through the registry entries in the \HKEY_LOCAL_MACHINE\Software\Classes tree.

Binary Files — NULL Filter

When a registered binary file is encountered, the NULL filter is used. The NULL filter retrieves only the system properties. The contents of a binary file are not filtered. Examples of system properties are the FileName, last Write time, file Size, Attributes, and so on.

For more information about binary files, see Registering File Types as Binary Files

Default Filter

In Index Server, a default filter filters both the system properties (such as file name) and the contents of a file. The default filter does not “understand” any document formats; when filtering the contents of a file, it treats the file as a sequence of characters. Index Server uses the default filter when the file-name extension of a file has no association in the registry, and if the value of the registry setting FilterFilesWithUnknownExtensions is 1.

Note   The default filter filters plain text and files of unknown origin. It assumes all text to be in the default code page of the server.

Corrupted Files

If a file is corrupted, the filter may not be able to properly interpret the contents of that file. To learn how to get a list of files that could not be filtered, see Unfiltered Files. An event is also written to the event log. Sometimes a file cannot be filtered because of a defective third-party filter. After verifying the contents of a file, an administrator should report the problems to the filter vendor. Files protected by passwords are not filtered.

Maximum Retries

If a document cannot be filtered, it will be retried a certain maximum number of times. If the document still cannot be filtered, then it will be considered to be an unfiltered file. The registry key FilterRetries controls the maximum number of retries for a document.

To get a list of all the files that could not be filtered
  1. Click Start, point to Programs, point to Windows NT 4.0 Option Pack, point to to Microsoft Index Server, and click Index Server Manager (HTML).
  2. In the View unfiltered documents field, click Start.

Unknown Extensions

A file with an extension that does not have an association in the registry is treated as an Unknown Extension. The behavior of Index Server depends upon the registry setting FilterFilesWithUnknownExtensions. If this value is set to 0, then the NULL Filter is used to filter those files. Otherwise, the default filter is used to filter the contents.

Filtering Directories

By default, directories are not filtered and will not appear in query results. To filter directories, set the registry key FilterDirectories to 1. When directories are filtered, their system properties are filtered.

Characterization

CiDaemon process is capable of automatically generating a summary or characterization (also called abstract) for each document. If the registry key GenerateCharacterization is set to 1, the characterization will be automatically generated. The maximum number of characters in the generated characterization is controlled by the registry key MaxCharacterization.

If the characterization is set to be generated automatically, Index Server takes by default the first 320 characters of a document and copies that block of text for the summary. You can override this automatic selection by inserting a meta tag in each document with your own customized summary. Put all meta tags within the header of an HTML file, as shown in the following example.


<head>
<META NAME="DESCRIPTION" CONTENT="This text will appear on the results page 
as the document's summary.">
</head>

Adding Filter DLLs

To add new filter DLLs, please refer to the documentation provided with the filter DLLs. You can register and unregister DLLs with the registry utility (Regsvr32.exe).

Removing Filter DLLs

To remove a filter DLL, the IFilter PersistentHandler entry associated with a document type and the filter DLL entry must be deleted. See Finding the Filter DLL for a Document. Once you have found the correct IFilter PersistentHandler entry, you can unregister it with the following syntax:


Regsvr32.exe /u

For an example, see Removing a Filter.

Finding the Filter DLL for a Document

The following example shows how to find out the filter DLL for a document. This example is for HTML files.

Step 1: Determine the CLSID

Find the CLSID associated with the document type under the registry key \HKEY_LOCAL_MACHINE\SOFTWARE\Classes. Let this be <Value1>.

\HKEY_LOCAL_MACHINE\SOFTWARE\Classes
    htmlfile
        = Class for WWW HTML files
        CLSID
            = {25336920-03F9-11CF-8FD0-00AA00686F13}

Step 2: Determine the Persistent Handler

Using <Value1> found out in Step 1, find the PersistentHandler value for the \HKEY_LOCAL_MACHINE\SOFTWARE\Classes\CLSID\<Value1> key. Let this be <Value2>.

\HKEY_LOCAL_MACHINE\SOFTWARE\Classes\CLSID
        {25336920-03F9-11CF-8FD0-00AA00686F13}
            = WWW HTML files
            PersistentHandler
                = {EEC97550-47A9-11CF-B952-00AA0051FE20}

Step 3: Determine the IFilter Persistent Handler GUID

Using <Value2> determined in Step 2, find the IFilter Persistent Handler GUID for the document type. The value under the key \HKEY_LOCAL_MACHINE\SOFTWARE\Classes\CLSID\<Value2>\PersistentAddinsRegistered\
89BCB740-6119-101A-BCB7-00DD010655AF yields the IFilter Persistent Handler GUID for this document type. Let this be <Value3>. 89BCB740-6119-101A-BCB7-00DD010655AF is the IFilter interface GUID.

\Registry\Machine\Software\Classes\CLSID
      {EEC97550-47A9-11CF-B952-00AA0051FE20}
           = REG_SZ HTML File Persistent Handler
        PersistentAddinsRegistered
            {89BCB740-6119-101A-BCB7-00DD010655AF}
                = REG_SZ {E0CA5340-4534-11CF-B952-00AA0051FE20}

Step 4: Determine the Filter DLL

Using <Value3> determined in Step 3, the filter DLL can be found under the entry \HKEY_LOCAL_MACHINE\SOFTWARE\Classes\CLSID\<Value3>\InprocServer32.

\Registry\Machine\Software\Classes\CLSID
     {E0CA5340-4534-11CF-B952-00AA0051FE20}
        = REG_SZ HTML Filter
        InprocServer32
            = REG_SZ nlhtml.dll

In this example, the filter DLL for HTML documents is nlhtml.dll.


© 1997 by Microsoft Corporation. All rights reserved.