Skip to content

Web collections

Introduction

Web-collections.png

A web collection is a collection of documents obtained from one or more web sites. Web collections contain HTML, PDF and Microsoft Office files that are gathered using a web crawler which discovers content by following the links it finds.

In order to avoid crawling the entire Internet the crawler uses a number of configuration options to determine which links it will follow and what web sites or domains it should limit its crawl to.

Web collection basics

The web crawler works by accessing a number of defined URLs in the seed list and extracting the links from these pages. A number of checks are performed on each link as it is extracted to determine if the link should be crawled. The link is compared against a set of include/exclude rules and a number of other settings (such as acceptable file types) determining if the link is suitable for inclusion in the index. A link that is deemed suitable is added to a list of uncrawled URLs called the crawl frontier.

The web crawler will continue run taking URLs off the crawl frontier, extracting links in the pages and checking the links against the include/exclude rules until it runs out of links in the crawl frontier, or an overall timeout is reached.

Include/exclude rules

The crawler processes each URL it encounters against the various options in collection.cfg to determine if the URL will be included + or excluded from further processing:

  1. + crawler.protocols
  2. + include_patterns
  3. + crawler.accept_files
  4. exclude_patterns
  5. crawler.reject_files

Example

Assuming you had the following options:

include_patterns=/red,/green,/blue
exclude_patterns=/green/olive

Then the following URLs will be included or excluded...

URLSuccess?Comments
/orangeFAILfails include
/green/emeraldPASSpasses include, passes exclude
/green/oliveFAILpasses include, fails exclude

Regular expressions in include/exclude patterns

To express more advanced include or exclude patterns you can use regular expressions for the include_patterns and exclude_patterns configuration options.

Regular expressions follow Perl 5 syntax and start with regexp: followed by a compound regular expression in which each distinct include/exclude pattern is separated by the | character. Regex and simple include/exclude patterns cannot be mixed within a single configuration option.

An example of the more advanced regexp: form is:

exclude_patterns=regexp:search\?date=|^https:|\?OpenImageResource|/cgi-bin/|\.pdf$

which combines five alternative patterns into one overall pattern expression to match:

  1. search?date= for example, to exclude calendars.
  2. HTTPS urls
  3. Dynamic content generated by URLs containing ?OpenImageResource.
  4. Dynamic content from CGI scripts.
  5. PDF files.

Note: regex special characters that appear in patterns must be escaped (e.g. \? and \.):

include_patterns=regexp:\.anu\.edu\.au

Excluding URLs during a running crawl

The crawler supports exclusion of URLs during a running crawl. The crawler.monitor_url_reject_list collection.cfg parameter allows an administrator to specify additional URLs patterns to exclude while the crawler is running. These URL patterns will apply from the next crawler checkpoint and should be converted to a regular exclude pattern once the crawl completes.

Creating a web collection

Web collections require at a minimum a set of seed URLs and include/exclude patterns to be defined when setting up the collection.

To create a new web collection:

  1. Select web from the create collection menu in the administration interface.
  2. Enter a unique collection id to identify the collection. (eg. example-organisation-web)
  3. Enter a collection title which is a human readable identifier for the collection. (e.g. Example Organisation: public websites)
  4. Enter a set of start URLs.
  5. Enter include and exclude patterns.
  6. Save and update the collection.

Web crawler configuration

Web crawler logs

The web crawler writes a number of log files detailing various aspects of the crawl. See: web crawler logs

See also

top

Funnelback logo
v15.16.0