Skip to content

crawler.eliminate_duplicates (collection.cfg setting)

Description

This parameter controls whether duplicate documents identified during a crawl should be eliminated. The default behaviour is true i.e. delete all duplicates when they are found.

Duplicate detection is performed by extracting the human-readable text from a file, ignoring any markup and tags. The intent is to detect files which look the same to a human. This means that differences in embedded metadata etc. will be ignored during this process.

The webcrawler will detect most HTML duplicates during the crawl, but it will not detect duplicate binary files (e.g. PDF or Office files).

Default value

crawler.eliminate_duplicates=true

Example

Turn off in-crawl duplicate detection:

crawler.eliminate_duplicates=false

See also

top

Funnelback logo
v15.18.0