Skip to content

filter.jsoup.undesirable_text-source (collection.cfg setting)

Description

This setting controls where 'undesirable text' is listed for detection in content auditor.

The format allows for setting several sources to be defined, each with a key name (allowing collections to override the defaults).

filter.jsoup.undesirable_text-source.(key_name)=(file_path)

The format of the file at the given path is expected to be a list of undesirable word sequences, with newlines separating each sequence. Where multi-word sequences are used, each word should be separated by a single space character. Text versions of HTML entities (e.g. instead of —) should be used where applicable.

Undesirable text files can be created from the administration interface file manager by selecting undesirable-text.*.cfg from the create menu.

Default values

filter.jsoup.undesirable_text-source.default-misspellings=$SEARCH_HOME/conf/common-misspellings.txt.default

This default setting provides a list of commonly misspelled words in English based on Wikipedia's list of common misspellings for machines.

Examples

The following overrides the misspellings with a custom file, and also includes an additional set from 'more_undesirable_text.txt'.

filter.jsoup.undesirable_text-source.default-misspellings=$SEARCH_HOME/conf/collection_name/custom_misspellings.txt
filter.jsoup.undesirable_text-source.additional=$SEARCH_HOME/conf/collection_name/more_undesirable_text.txt

more_undesirable_text.txt contains:

—
etc.
e.g.
aluminum
purple monkey

See also

top

Funnelback logo
v15.16.0