crawler.classes.RevisitPolicy

Specifies the Java class used for enforcing the revisit policy for URLs.

Key: crawler.classes.RevisitPolicy
Type: String
Can be set in: collection.cfg

Description

This parameter controls what revisit policy the web crawler uses, where revisit means using a network call (HTTP HEAD and/or GET request) when processing a URL.

A revisit policy might look at a URL in the URL store and decide that since it hasn't changed in the last 5 times we downloaded it we will assume that it hasn't changed this time and not perform a revisit. Instead we will use a copy from the previous crawl, and avoid any HEAD or GET requests for that URL.

Note: The revisit policy is only used during incremental crawls.

Default Value

Revisit every document every update.

crawler.classes.RevisitPolicy=com.funnelback.common.revisit.AlwaysRevisitPolicy

Examples

crawler.classes.RevisitPolicy=com.funnelback.common.revisit.SimpleRevisitPolicy

Change to use a revisit policy which implements the following:

Initially, everything is crawled
If after crawler.revisit.num_times_unchanged_threshold crawls, the page has never changed, then the page will not be crawled for the next crawler.revisit.num_times_revisit_skipped_threshold crawls.
The URL will then have to be crawled crawler.revisit.num_times_unchanged_threshold times again without any changes before it will be skipped again.
A full crawl will force everything to be crawled, but the values recorded for revisits skipped and num_times_unchanged will be preserved.

crawler.classes.RevisitPolicy

Description

Default Value

Examples

See Also

Contents