crawler.classes.RevisitPolicy (collection.cfg setting)
This parameter controls what revisit policy the web crawler uses, where revisit means using a network call (HTTP HEAD and/or GET request) when processing a URL.
A revisit policy might look at a URL in the URL store and decide that since it hasn't changed in the last 5 times we downloaded it we will assume that it hasn't changed this time and not perform a revisit. Instead we will use a copy from the previous crawl, and avoid any HEAD or GET requests for that URL.
Note: The revisit policy is only used during incremental crawls.
Revisit every document every update.
Change to use a revisit policy which implements the following:
- Initially, everything is crawled
- If after crawler.revisit.num_times_unchanged_threshold crawls, the page has never changed, then the page will not be crawled for the next crawler.revisit.num_times_revisit_skipped_threshold crawls.
- The URL will then have to be crawled
crawler.revisit.num_times_unchanged_thresholdtimes again without any changes before it will be skipped again.
- A full crawl will force everything to be crawled, but the values recorded for revisits skipped and num_times_unchanged will be preserved.