include_patterns

Specifies the pattern that URLs must match in order to be crawled.

Key: include_patterns
Type: List<String>
Can be set in: collection.cfg

Description

This option is a comma-separated list of URL patterns that are used by the crawler to determine whether it will process a page. If the page's URL matches one of these patterns, then the crawler will process it. URLs which match exclude_patterns will not be crawled even if they match the include pattern, except for start urls.

See: include and exclude patterns for a description on how include and exclude patterns work, and details on using regular expressions if required.

Default Value

(none, set when the collection is created)

Examples

If you were crawling http://www.funnelback.com and wanted to download just the support directory (and nothing else), then you would use the following include pattern:

include_patterns=www.funnelback.com/support

If you wanted to crawl the entire http://www.funnelback.com site then you would use:

include_patterns=www.funnelback.com/

You can include a protocol (http or https) in the pattern, but it is not usually necessary.

If you wanted to crawl every webserver in the Australian National University and University of Sydney domains:

include_patterns=anu.edu.au,usyd.edu.au

Note: You should specify some form of include pattern for the webcrawler, otherwise it will start downloading content from the global web and fill up the hard disk.

include_patterns

Description

Default Value

Examples

See Also

Contents