Character Classes - Full-Text Retrieval (FTR) - Help

Full-Text Retrieval (FTR) Help

Language
English
Product
Full-Text Retrieval (FTR)
Search by Category
Help

Character classes form the rule base that the FTR engine uses to divide a document into individual terms. They affect the way terms are stored in the index files, the way terms are highlighted, and the way lists of words from the index look. They do not affect search results in any way. No matter which character class is used, the search engine breaks a given search term in the same manner that was used during indexing, and the term will be found if it exists in the document.

Character classes are defined in the stopword file. Any change to a stopword file for an existing collection requires that the entire collection be re-indexed. The addition of a character class definition does not affect the stopwords or how they are interpreted in the rest of the stopword file.

Modify character classes with extreme caution. Search performance and collection file sizes can be affected by an improperly formed character classes. Any modifications to the classes in the delivered stopword files should be tested on a sample, nonproduction collection before being implemented on a system-wide collection.

There are six character classes. Each character can be added to only one class. Any characters that are not included in one of these classes fall into a default class. Any character in the default class causes a word break. The character classes are as follows:

Class Name

Description

AC

Accents that are permitted in alphabetic terms but ignored during indexing.

ADJ

Punctuation marks that are permitted to exist between two characters in the AL or DI class. Multiple consecutive ADJ characters are allowed as long as the group is surrounded by one or more AL or DI characters. ADJ characters at the beginning or the end of a term are not indexed as part of the term.

AJ

Punctuation marks that are permitted to exist between two characters in the AL class. Consecutive AJ characters are not allowed in a term and cause a word break. AJ characters at the beginning or the end of a term are not indexed as part of the term.

AL

Alphabetic characters that can form terms.

DI

Numeric characters (digits) that can form terms.

DJ

Punctuation marks that are permitted to exist between two characters in the DI class. Consecutive DJ characters are not allowed in a term and cause a word break. DJ characters at the beginning or the end of a term are not indexed as part of the term.

See Also

Using Character Classes in Stopword Files
Unicode Values for the ASCII Character Set