Skip to content

filter.jsoup.undesirable_text-source.[key_name]

Specify sources of undesirable text strings to detect and present within content auditor.

Key: filter.jsoup.undesirable_text-source.[key_name]
Type: String
Can be set in: collection.cfg

Description

This setting controls where 'undesirable text' is listed for detection in content auditor.

The format allows for setting several sources to be defined, each with a key name (allowing collections to override the defaults).

filter.jsoup.undesirable_text-source.(key_name)=(file_path)

The format of the file at the given path is expected to be a list of undesirable word sequences, with newlines separating each sequence. Where multi-word sequences are used, each word should be separated by a single space character. Text versions of HTML entities (e.g. \u2014 instead of —) should be used where applicable.

Undesirable text files can be created from the administration interface file manager by selecting undesirable-text.*.cfg from the create menu. To make use of this file, the file_path must be set to $SEARCH_HOME/conf/$COLLECTION_NAME/undesirable-text.<name>.cfg.

The key_name can be any string as long as it is unique per collection.

Default values

filter.jsoup.undesirable_text-source.default-misspellings=$SEARCH_HOME/conf/common-misspellings.txt.default

This default setting provides a list of commonly misspelled words in English based on Wikipedia's list of common misspellings for machines.

Examples

The following overrides the misspellings with a custom file, and also includes an additional set from 'undesirable-text.additional.cfg'.

filter.jsoup.undesirable_text-source.default-misspellings=$SEARCH_HOME/conf/$COLLECTION_NAME/undesirable-text.default-misspellings.cfg
filter.jsoup.undesirable_text-source.additional=$SEARCH_HOME/conf/$COLLECTION_NAME/undesirable-text.additional.cfg

more_undesirable_text.txt contains:

\u2014
etc.
e.g.
aluminum
purple monkey

See Also

top

Funnelback logo
v15.24.0