crawler.eliminate_duplicates (collection.cfg setting)
This parameter controls whether duplicate documents identified during a crawl should be eliminated. The default behaviour is true i.e. delete all duplicates when they are found.
Duplicate detection is performed by extracting the human-readable text from a file, ignoring any markup and tags. The intent is to detect files which look the same to a human. This means that differences in embedded metadata etc. will be ignored during this process.
The webcrawler will detect most HTML duplicates during the crawl, but it will not detect duplicate binary files (e.g. PDF or Office files).
Turn off in-crawl duplicate detection: