filter.jsoup.undesirable_text-source.[key_name]
Specify sources of undesirable text strings to detect and present within content auditor.
Key: filter.jsoup.undesirable_text-source.[key_name]
Type: String
Can be set in: collection.cfg
Description
This setting controls where 'undesirable text' is listed for detection in content auditor.
The format allows for setting several sources to be defined, each with a key name (allowing collections to override the defaults).
filter.jsoup.undesirable_text-source.(key_name)=(file_path)
The format of the file at the given path is expected to be a list of undesirable word sequences, with
newlines separating each sequence. Where multi-word sequences are used, each word should be separated
by a single space character. Text versions of HTML entities (e.g. \u2014
instead of —
) should
be used where applicable.
Undesirable text files can be created from the administration interface file manager by selecting undesirable-text.*.cfg
from the create menu. To make use of this file, the file_path
must be set to $SEARCH_HOME/conf/$COLLECTION_NAME/undesirable-text.<name>.cfg
.
The key_name
can be any string as long as it is unique per collection.
Default values
filter.jsoup.undesirable_text-source.default-misspellings=$SEARCH_HOME/conf/common-misspellings.txt.default
This default setting provides a list of commonly misspelled words in English based on Wikipedia's list of common misspellings for machines.
Examples
The following overrides the misspellings with a custom file, and also includes an additional set from 'undesirable-text.additional.cfg'.
filter.jsoup.undesirable_text-source.default-misspellings=$SEARCH_HOME/conf/$COLLECTION_NAME/undesirable-text.default-misspellings.cfg
filter.jsoup.undesirable_text-source.additional=$SEARCH_HOME/conf/$COLLECTION_NAME/undesirable-text.additional.cfg
more_undesirable_text.txt
contains:
\u2014
etc.
e.g.
aluminum
purple monkey