Configuring the web crawler

Introduction

This page gives a guide to configuring the Funnelback web crawler. The web crawler is used to gather pages for indexing, by following hypertext links and downloading documents. The main collection creation page for a web collection describes the standard parameters used to configure the web crawler. These include giving it a start point, how long to crawl for, what domain to stay within and which areas to avoid.

For most purposes the default settings for the other crawler parameters will give good performance. This page is for administrators who have particular performance or crawling requirements.

A full list of all crawler related parameters (plus default values) is given on the configuration page. All of the configuration parameters mentioned can be modified by editing the collection.cfg file for a collection in the Administration interface.

Speeding up web crawling

The default behaviour of the web crawler is to be as polite as possible. This is enforced by requiring that only one crawler thread should be accessing an individual web server at any one time. This prevents multiple crawler threads from overloading a web server. It is implemented by mapping individual servers to specific crawler threads.

If you have control over the web server(s) being crawled you may decide to relax this constraint, particularly if you know they can handle the load. This can be accomplished by using a site_profiles.cfg file, where you can specify how many parallel requests to use for particular web servers.

Two general parameters which can also be tuned for speed are:

crawler.num_crawlers=20
crawler.request_delay=250

Increasing the number of crawlers (threads) will increase throughput, as will decreasing the delay between requests. The latter is specified in milliseconds, with a default delay of one quarter of a second. We do not recommend decreasing this below 100ms.

Warning: These parameters should be tuned with care to avoid overloading web servers and/or saturating your network link. If crawling a single large site we recommend starting with a small number of threads (e.g. 4) and working up until acceptable performance is reached. Similarly for decreasing the request delay i.e. work down until the overall crawl time has been satisfactorily reduced.

Incremental crawling

An incremental crawl updates an existing set of downloaded pages instead of starting the crawl from scratch. The crawler achieves this by comparing the document length provided by the Web server (in response to a HTTP head request) with that obtained in the previous crawl. This can reduce network traffic and storage requirements and speed up collection update times.

The ratio of incremental to full updates can be controlled by the following parameter:

schedule.incremental_crawl_ratio

The number of scheduled incremental crawls that are performed between each full crawl (e.g. a value of '10' results in an update schedule consisting of every ten incremental crawls being followed by a full crawl). This parameter is only referenced by the update system when no explicit update options are provided by the administrator.

An additional configuration parameter used in incremental crawling is the crawler.secondary_store_root setting. The webcrawler will check the secondary store specified in this parameter and not download content from the web which hasn't changed. When a web collection is created the Funnelback administration interface will insert the correct location for this parameter, and it will not normally need to be edited manually.

Revisit policies

Incremental crawls utilise a revisit policy to further refine the behaviour of an incremental crawl. The revisit policy is used by the crawler to make a decision of whether or not to revisit a site during an incremental crawl based on how frequently the site content has been found to change.

Funnelback supports two types of revisit policies:

Always revisit policy: This is the default behaviour. Funnelback will always revisit a URL and check the HTTP headers when performing an incremental crawl.
Simple revisit policy: Funnelback tracks how frequently a page changes and will make a decision based on the some configuration settings on whether or not to skip the URL for the current crawl.

See: Web crawler revisit policies

Crawling critical sites/pages

If you have a requirement to keep your index of a particular web site (or sites) as up-to-date as possible, you could create a specific collection for this area. For example, if you have a news site which is regularly updated you could specify that the news collection be crawled at frequent intervals. Similarly, you might have a set of "core" critical pages which must be indexed when new content becomes available.

You could use some of the suggestions in this document on speeding up crawling and limiting crawl size to ensure that the update times and cycles for these critical collections meet your requirements.

You could then create a separate collection for the rest of your content which may not change as often or where the update requirements are not as stringent. This larger collection could be updated over a longer time period. By using a meta collection you can then combine these collections so that users can search all available information.

Alternatively, you could use an instant update. See "updating collections" for more details.

Adding additional file types

By default the crawler will store and index html, PDF, Microsoft Office, RTF and text documents. Funnelback can be configured to store and index additional file types. See: configure Tika to index additional supported file types

Specify preferred server names

For some collections you may decide you wish to control what server name the crawler uses when storing content. For example, a site may have been renamed from www.old.com to www.new.com, but because so many pages still link to the old name the crawler may store the content under the old name (unless HTTP or HTML redirects have been set up).

A simple text file can be used to specify which name to use e.g. www.new.com=www.old.com. This can also be used to control whether the crawler treats an entire site as a duplicate of another (based on the content of their home page). Details on how to set this up are given in the documentation for the crawler.server_alias_file parameter.

Memory requirements

The process which runs the webcrawler will take note of the gather.max_heap_size setting in the collection's collection.cfg file. This will specify the maximum size of the heap for the crawler process, in MB. For example the default is set at:

gather.max_heap_size=640

This should be suitable for most crawls which crawl less than 250k URLs. For crawls over this size you should expect to increase the heap size up to at least 2000MB, subject to the amount of RAM available and what other large jobs might be running on the machine.

Crawling dynamic web sites

In most cases Funnelback will crawl dynamically generated web sites by default. However, some sites (e.g. e-commerce, product catalogs etc.) may enforce the use of cookies and sessions IDs. These are normally used to track a human user as they browse through a site.

By default the Funnelback webcrawler is configured to accept cookies by default, by having the following parameters set:

crawler.accept_cookies=true
crawler.packages.httplib=HTTPClient

This turns on cookie storage in memory (and allows cookies to be sent back to the server), by using the appropriate HTTP library. Note that even if a site uses cookies it should still return valid content if a client (e.g. the crawler) does not make use of them.

It is also possible to strip session IDs and other superfluous parameters from URLs during the crawl. This can help reduce the amount of duplicate or near-duplicate content brought back. This is configured using the following optional parameter (with an example expression):

crawler.remove_parameters=regexp:&style(sheet)?=mediaRelease|&x=\d+

The example above will strip off style and stylesheet parameters, or x=21037536 type parameters (e.g. session IDs). It uses regular expressions (Perl 5 syntax) and the regexp: flag is required at the start of the expression. Note that this parameter is normally empty by default.

Finally, the last parameter which you may wish to modify when crawling dynamic web sites is:

crawler.max_files_per_area=10000

This parameter is used to specify the maximum number of files the crawler should download from a particular area on a web site. You may need to increase this if a lot of your content is served from a single point e.g. site.com/index.asp?page_id=348927. The crawler will stop downloading after it reaches the limit for this area. In this case you would need to increase the limit to ensure all the content you require is downloaded.

Crawling password protected websites

Crawling sites protected by HTTP Basic authentication or Windows Integrated authentication (NTLM) is covered in a separate document on crawling password protected sites.

Sending custom HTTP request header fields

In some circumstances you may want to send custom HTTP request header fields in the requests that the web crawler makes when contacting a web site. For example, you might want to send specific cookie information to allow the crawler to "log in" to a web site that uses cookies to store login information.

The following two parameters allow you to do this:

crawler.request_header: Optional additional header to be inserted in HTTP(S) requests made by the webcrawler.
crawler.request_header_url_prefix: Optional URL prefix to be applied when processing the crawler.request_header parameter

Form-based authentication

Some websites require a login using a HTML form. If you need to crawl this type of content you can specify how to interact with the forms using either crawler.form_interaction.in_crawl.groupId.url_pattern or crawler.form_interaction.pre_crawl.groupId.url.

Once the forms have been processed the webcrawler can use the resulting cookie to authenticate its requests to the site.

Note: Form-based authentication is different from HTTP basic authentication. Details on how to interact with this are described in a separate document on crawling password protected websites.

Crawling with pre-defined cookies

In some situations you may need to crawl a site using a pre-defined cookie. Further information on this configuration option is available from the cookies.txt page.

Crawling HTTPS websites

This is covered in a separate document: Crawling HTTPS websites.

Crawling Sharepoint websites

If your Sharepoint site is password protected you will need to use Windows Integrated Authentication when crawling - see details on this in the document on crawling password protected sites.
You may need to configure "alternate access mappings" in Sharepoint so that it uses a fully qualified hostname when serving content e.g. serving content using http://sharepoint.example.com/ rather than http://sharepoint/. Please see your Sharepoint administration manual for details on how to configure these mappings.

Limiting crawl size

In some cases you may wish to limit the amount of data brought back by the crawler. The usual approach would be to specify a time limit for the crawl:

crawler.overall_crawl_timeout=24
crawler.overall_crawl_units=hr

The default timeout is set at 24 hours. If you have a requirement to crawl a site within a certain amount of time (as part of an overall update cycle) you can set this to the desired value. You should give the crawler enough time to download the most important content, which will normally be found early on in the crawl. You can also try speeding up the crawler to meet your time limit.

Another parameter which can be used to limit crawl size is: crawler.max_files_stored

This is the maximum number of files to store on disk (default is unlimited). Finally, you can specify the maximum link distance from the start point (default is unlimited):

For example, if max_link_distance = 1, only crawl the links on start_url. This could be used to restrict the crawl to a specific list of URLs, which were generated by some other process e.g. as a pre_gather command.

Warning: Turning the max_link_distance parameter on drops the crawler down to single-threaded operation.

Redirects

The crawler stores information about redirects in a file called redirects.txt in the collection's log directory. This records information on HTTP redirects, HTML meta-refresh directives, duplicates, canonical link directives etc.

This information is then processed by the indexer and used in ranking e.g. ensuring that anchortext is associated with the correct redirect target etc.

Crawler couldn't access seed page

In some scenarios you may see the following message in a collection's main update log:

Crawler couldn't access seed page.

This means the web crawler couldn't access any of the specified start URLs. To see why this is the case you should check the individual crawler logs in the "offline" view, which should give details on why it was unable to process the URL(s).

Changing configuration parameters during a running crawl

The web crawler monitor options provide a number of settings that can be dynamically adjusted while a crawl is running.