Skip to content

crawler.revisit.edit_distance_threshold

Threshold for edit distance between two versions of a page when deciding whether it has changed or not.

Key: crawler.revisit.edit_distance_threshold
Type: Integer
Can be set in: collection.cfg

Description

This parameter specifies a threshold to use when deciding whether the content of a URL has changed compared to a previous version. The edit distance is the number of operations (add, edit, delete) that would be required to transform one string into the other.

If the edit distance is less than this threshold then the page is marked as "unchanged" and this information will be fed into the crawler's revisit policy. Pages that don't change very often may not be revisited as often and a copy of their content may be used instead.

Default Value

crawler.revisit.edit_distance_threshold=20

See Also

top

Funnelback logo
v15.24.0