crawler.eliminate_duplicates

Whether to eliminate duplicate documents while crawling.

Key: crawler.eliminate_duplicates
Type: Boolean
Can be set in: collection.cfg

Description

This parameter controls whether duplicate documents identified during a crawl should be eliminated. The default behaviour is true i.e. delete all duplicates when they are found.

Duplicate detection is performed by extracting the human-readable text from a file, ignoring any markup and tags. The intent is to detect files which look the same to a human. This means that differences in embedded metadata etc. will be ignored during this process.

The webcrawler will detect most HTML duplicates during the crawl, but it will not detect duplicate binary files (e.g. PDF or Office files).

Default Value

crawler.eliminate_duplicates=true

Examples

Turn off in-crawl duplicate detection:

crawler.eliminate_duplicates=false

crawler.eliminate_duplicates

Description

Default Value

Examples

See Also

Contents