Skip to content

crawler.classes.RevisitPolicy (collection.cfg setting)

Description

This parameter controls what revisit policy the web crawler uses, where revisit means using a network call (HTTP HEAD and/or GET request) when processing a URL.

A revisit policy might look at a URL in the URL store and decide that since it hasn't changed in the last 5 times we downloaded it we will assume that it hasn't changed this time and not perform a revisit. Instead we will use a copy from the previous crawl, and avoid any HEAD or GET requests for that URL.

Note: The revisit policy is only used during incremental crawls.

Default value

Revisit every document every update.

crawler.classes.RevisitPolicy=com.funnelback.common.revisit.AlwaysRevisitPolicy

Examples

crawler.classes.RevisitPolicy=com.funnelback.common.revisit.SimpleRevisitPolicy

Change to use a revisit policy which implements the following:

  1. Initially, everything is crawled
  2. If after crawler.revisit.num_times_unchanged_threshold crawls, the page has never changed, then the page will not be crawled for the next crawler.revisit.num_times_revisit_skipped_threshold crawls.
  3. The URL will then have to be crawled crawler.revisit.num_times_unchanged_threshold times again without any changes before it will be skipped again.
  4. A full crawl will force everything to be crawled, but the values recorded for revisits skipped and num_times_unchanged will be preserved.

See also

top

Funnelback logo
v15.16.0