noindex_expression

Optional regular expression to specify content that should not be indexed.

Key: noindex_expression
Type: String
Can be set in: collection.cfg

Description

This parameter defines an optional regular expression that will be used to specify content that should not be indexed. If defined and non-empty the given expression will be used by the webcrawler to insert no index tags around matching content in the copy of the document source that the crawler stores. This will have the following effects:

The content that matches the expression will be ignored when deciding if two files are duplicates based on their extracted text during a web crawl. This can be used to exclude dynamic content on a page which may hinder duplicate detection.
The PADRE indexer will take note of the noindex directives. (See controlling indexable content in PADRE for details)

Note

Any links inside the matching content will still be extracted and followed during the crawl (assuming they pass the include/exclude rules).

The noindex_expression configuration option should only be used if inline filtering is not able to be applied to a collection. The use of the InjectNoIndexFilterProvider is recommended instead.

Default Value

noindex_expression=

Examples

Ignore some "breadcrumb" navigation elements in a page:

noindex_expression=<div class=\"BreadCrumb(.*?)</div>

Ignore a HTML footer:

noindex_expression=<table id=\"footer\"(.*?)</table>

Ignore a set of different <div> elements using a single regular expression:

noindex_expression=(<div id=\"nav.*?<\/div>|<div id=\"skip.*?<\/div>|<div id=\"header.*?<\/div>)

The expression above uses the | character as an OR (alternation) operator e.g. match (pattern 1 | pattern 2 | pattern 3).

⚠ Caveats

You will probably want to use a non-greedy match (see ? in the pattern above), to ensure that the regular expression doesn't match (and ignore) more than you need.

Note:: You should test that the expression you have written does not have a performance impact on the crawl. For example, if some of your content has badly formed HTML then the regular expression may match more than you need, or potentially result in the parser timing out and the page not being downloaded at all. If the latter occurs you may see "Parser timed out" messages in the url_errors.log file in the collection's log directory.