A filter DLL understands one or more document formats and is capable of extracting text and properties out of those document types. A filter DLL implements the IFilter ActiveX interface. The CiDaemon process uses the IFilter interface to extract the text out of a document. To track down a problem with a filter DLL, an administrator needs to know where to look to determine the filter DLL for a particular document. Editing the registry is also a good way to avoid filtering documents with no useful content.
This topic contains:
The list of document types for which filters are pre-installed is given below:
The HTML filter will not index any of the contents or properties of an HTML file if the HTML file contains the following meta tag:
<meta name="robots" content="noindex">
A Webmaster can add this meta tag to selectively avoid indexing certain HTML files.
If an HTML file contains the following meta tag, the content field specifies the language code:
<meta name="ms.locale" content="EN">
The file is filtered by the language resources for that particular language (if available).
The content field in the tag can also specify the locale by a decimal number, such as 1033, which is the locale ID for U.S. English.
Some meta tag properties are mapped onto the Microsoft® Office property sets to allow users to mark HTML pages with the same properties in the Office property set. The list of properties that are mapped are:
Property example | Mapped to |
---|---|
<meta name="author" content="ruth"> |
The author property in the summary information property set. |
<meta name="subject" content="word
processing"> |
The subject property in the summary information property set. |
<meta name="keywords" content="fonts, serif"> |
The keyword property in the summary information property set. |
<meta name="ms.category" content="fiction"> |
The category property in the document summary information property set. |
The HTML filter extracts text from the content field of a meta element. For example, if an HTML file has this line:
<META NAME="DESCRIPTION" CONTENT="Sample query form for Microsoft Index Server">
Then a user can query the information in the content field, namely Sample query form for Microsoft Index Server, by using the HTML meta property. The GUID for the meta property is D1B5D3F0-C0B3-11CF-9A92-00A0C908DBF1 and the property name is specified by the name field, or the HTTP-EQUIV field. In the above example, the property name is DESCRIPTION. Thus a friendly name, for example MetaDescription, for the meta property can be defined as
MetaDescription(DBTYPE_WSTR|DBTYPE_BYREF) = D1B5D3F0-C0B3-11CF-9A92-00A0C908DBF1 description
The GUID for meta property is a registry parameter located at
HKEY_LOCAL_MACHINE \System \CurrentControlSet \Control\HtmlFilter \MetaTagClsid
The HTML filter emits scripting code embedded in an HTML page as a script property with the GUID 31F400A0-FD07-11CF-B9BD-00AA003DB18E. The property name of the script is specified by the language field of the script tag, for example:
<script language="vbscript">
In this example, the property name is vbscript. If no language field is specified, then the language field of an earlier script tag in the HTML page is used. If no earlier script tag is specified, then the property name defaults to javascript. The GUID for the script property is a registry parameter located at
HKEY_LOCAL_MACHINE \System \CurrentControlSet \Control\HtmlFilter \ScriptTagClsid
Document types and the associated filter DLL entries are specified in the registry under the \HKEY_LOCAL_MACHINE\Software\Classes tree. To find out the filter DLL associated with a particular document type, navigate through the registry entries in the \HKEY_LOCAL_MACHINE\Software\Classes tree.
When a registered binary file is encountered, the NULL filter is used. The NULL filter retrieves only the system properties. The contents of a binary file are not filtered. Examples of system properties are the FileName, last Write time, file Size, Attributes, and so on.
For more information about binary files, see Registering File Types as Binary Files
In Index Server, a default filter filters both the system properties (such as file name) and the contents of a file. The default filter does not understand any document formats; when filtering the contents of a file, it treats the file as a sequence of characters. Index Server uses the default filter when the file-name extension of a file has no association in the registry, and if the value of the registry setting FilterFilesWithUnknownExtensions is 1.
Note The default filter filters plain text and files of unknown origin. It assumes all text to be in the default code page of the server.
If a file is corrupted, the filter may not be able to properly interpret the contents of that file. To learn how to get a list of files that could not be filtered, see Unfiltered Files. An event is also written to the event log. Sometimes a file cannot be filtered because of a defective third-party filter. After verifying the contents of a file, an administrator should report the problems to the filter vendor. Files protected by passwords are not filtered.
If a document cannot be filtered, it will be retried a certain maximum number of times. If the document still cannot be filtered, then it will be considered to be an unfiltered file. The registry key FilterRetries controls the maximum number of retries for a document.
To get a list of all the files that could not be filteredA file with an extension that does not have an association in the registry is treated as an Unknown Extension. The behavior of Index Server depends upon the registry setting FilterFilesWithUnknownExtensions. If this value is set to 0, then the NULL Filter is used to filter those files. Otherwise, the default filter is used to filter the contents.
By default, directories are not filtered and will not appear in query results. To filter directories, set the registry key FilterDirectories to 1. When directories are filtered, their system properties are filtered.
CiDaemon process is capable of automatically generating a summary or characterization (also called abstract) for each document. If the registry key GenerateCharacterization is set to 1, the characterization will be automatically generated. The maximum number of characters in the generated characterization is controlled by the registry key MaxCharacterization.
If the characterization is set to be generated automatically, Index Server takes by default the first 320 characters of a document and copies that block of text for the summary. You can override this automatic selection by inserting a meta tag in each document with your own customized summary. Put all meta tags within the header of an HTML file, as shown in the following example.
<head> <META NAME="DESCRIPTION" CONTENT="This text will appear on the results page as the document's summary."> </head>
To add new filter DLLs, please refer to the documentation provided with the filter DLLs. You can register and unregister DLLs with the registry utility (Regsvr32.exe).
To remove a filter DLL, the IFilter PersistentHandler entry associated with a document type and the filter DLL entry must be deleted. See Finding the Filter DLL for a Document. Once you have found the correct IFilter PersistentHandler entry, you can unregister it with the following syntax:
Regsvr32.exe /u
For an example, see Removing a Filter.
The following example shows how to find out the filter DLL for a document. This example is for HTML files.
Find the CLSID associated with the document type under the registry key \HKEY_LOCAL_MACHINE\SOFTWARE\Classes. Let this be <Value1>.
\HKEY_LOCAL_MACHINE\SOFTWARE\Classes htmlfile = Class for WWW HTML files CLSID = {25336920-03F9-11CF-8FD0-00AA00686F13}
Using <Value1> found out in Step 1, find the PersistentHandler value for the \HKEY_LOCAL_MACHINE\SOFTWARE\Classes\CLSID\<Value1> key. Let this be <Value2>.
\HKEY_LOCAL_MACHINE\SOFTWARE\Classes\CLSID {25336920-03F9-11CF-8FD0-00AA00686F13} = WWW HTML files PersistentHandler = {EEC97550-47A9-11CF-B952-00AA0051FE20}
Using <Value2> determined in Step 2, find the IFilter Persistent Handler GUID for the document type. The value under the key \HKEY_LOCAL_MACHINE\SOFTWARE\Classes\CLSID\<Value2>\PersistentAddinsRegistered\
89BCB740-6119-101A-BCB7-00DD010655AF yields the IFilter Persistent Handler GUID for this document type. Let this be <Value3>. 89BCB740-6119-101A-BCB7-00DD010655AF is the IFilter interface GUID.
\Registry\Machine\Software\Classes\CLSID {EEC97550-47A9-11CF-B952-00AA0051FE20} = REG_SZ HTML File Persistent Handler PersistentAddinsRegistered {89BCB740-6119-101A-BCB7-00DD010655AF} = REG_SZ {E0CA5340-4534-11CF-B952-00AA0051FE20}
Using <Value3> determined in Step 3, the filter DLL can be found under the entry \HKEY_LOCAL_MACHINE\SOFTWARE\Classes\CLSID\<Value3>\InprocServer32.
\Registry\Machine\Software\Classes\CLSID {E0CA5340-4534-11CF-B952-00AA0051FE20} = REG_SZ HTML Filter InprocServer32 = REG_SZ nlhtml.dll
In this example, the filter DLL for HTML documents is nlhtml.dll.
© 1997 by Microsoft Corporation. All rights reserved.
See:
Questions:
hello, I'm developing a template webpage for an Intranet site which i want to be indexed. When retriving the characterization is there any way in which i can tell index service to only grab content from between 2 points