crawler.use_sitemap_xml

Specifies whether to process sitemap.xml files during a web crawl.

Key: crawler.use_sitemap_xml
Type: Boolean
Can be set in: collection.cfg

Description

This parameter controls whether sitemap.xml files listed in robots.txt are used during a web crawl.

Default Value

crawler.use_sitemap_xml=false

The default value is false i.e. do not try to process sitemap.xml files during a crawl.

Examples

Specify that sitemap.xml files should be used:

crawler.use_sitemap_xml=true

With this setting the Funnelback web crawler will check the robots.txt files for each web server that passes the crawl include patterns. If the robots.txt file contains any Sitemap: directives these will be processed, including any sitemap index files and compressed sitemap.xml files.

Notes

The Funnelback webcrawler will not currently take note of any <lastmod> elements in the sitemap.xml files - standard incremental crawling can still be used to avoid downloading any content which has not changed, based on its content length.
If the crawler.max_individual_frontier_size parameter is defined and non-empty then this will be used as a limit on the total number of URLs that will be extracted from the sitemap file(s) for any individual site.
All URLs extracted from sitemaps will be processed in the same way as links extracted while crawling normal web pages e.g. they will be run through the loading policy. This means they will be checked against relevant robots.txt rules, include and exclude patterns etc.

crawler.use_sitemap_xml

Description

Default Value

Examples

Notes

See Also

Contents