Web collections

Introduction

A web collection is a collection of documents obtained from one or more web sites. Web collections contain HTML, PDF and Microsoft Office files that are gathered using a web crawler which discovers content by following the links it finds.

In order to avoid crawling the entire Internet the crawler uses a number of configuration options to determine which links it will follow and what web sites or domains it should limit its crawl to.

Web collection basics

The web crawler works by accessing a number of defined URLs in the seed list and extracting the links from these pages. A number of checks are performed on each link as it is extracted to determine if the link should be crawled. The link is compared against a set of include/exclude rules and a number of other settings (such as acceptable file types) determining if the link is suitable for inclusion in the index. A link that is deemed suitable is added to a list of uncrawled URLs called the crawl frontier.

The web crawler will continue run taking URLs off the crawl frontier, extracting links in the pages and checking the links against the include/exclude rules until it runs out of links in the crawl frontier, or an overall timeout is reached.

Include/exclude rules

The crawler processes each URL it encounters against the various options in collection.cfg to determine if the URL will be included + or excluded − from further processing:

Example

Assuming you had the following options:

include_patterns=/red,/green,/blue
exclude_patterns=/green/olive

Then the following URLs will be included or excluded...

URL	Success?	Comments
/orange	FAIL	fails include
/green/emerald	PASS	passes include, passes exclude
/green/olive	FAIL	passes include, fails exclude

Regular expressions in include/exclude patterns

To express more advanced include or exclude patterns you can use regular expressions for the include_patterns and exclude_patterns configuration options.

Regular expressions follow Perl 5 syntax and start with regexp: followed by a compound regular expression in which each distinct include/exclude pattern is separated by the | character. Regex and simple include/exclude patterns cannot be mixed within a single configuration option.

An example of the more advanced regexp: form is:

exclude_patterns=regexp:search\?date=|^https:|\?OpenImageResource|/cgi-bin/|\.pdf$

which combines five alternative patterns into one overall pattern expression to match:

search?date= for example, to exclude calendars.
HTTPS urls
Dynamic content generated by URLs containing ?OpenImageResource.
Dynamic content from CGI scripts.
PDF files.

Note: regex special characters that appear in patterns must be escaped (e.g. \? and \.):

include_patterns=regexp:\.anu\.edu\.au

Excluding URLs during a running crawl

The crawler supports exclusion of URLs during a running crawl. The crawler.monitor_url_reject_list collection.cfg parameter allows an administrator to specify additional URLs patterns to exclude while the crawler is running. These URL patterns will apply from the next crawler checkpoint and should be converted to a regular exclude pattern once the crawl completes.

Creating a web collection

Web collections require at a minimum a set of seed URLs and include/exclude patterns to be defined when setting up the collection.

To create a new web collection:

Select 'Create collection' in the admin home page's top navigation bar.
Enter the required details, including a group id, a collection id (eg. example-organisation-web) to identify the collection.
Select "web" as the collection type
Select a license for the collection and configure the collection's ownership
Click the 'Create Collection' button, which creates the collection and navigates to the main configuration screen.
Enter a set of start URLs and save them.
Enter include and exclude patterns and save them.
Customise any other requried settings.
Return to the admin home page and update the collection.

Web crawler configuration

Web crawler logs

The web crawler writes a number of log files detailing various aspects of the crawl. See: web crawler logs