Skip to content

crawler.form_interaction.pre_crawl.[groupId].url

Specifies a URL of the page containing the HTML web form in pre_crawl mode.

Key: crawler.form_interaction.pre_crawl.[groupId].url
Type: String
Can be set in: collection.cfg

Description

This parameter specifies the URL of a page which contains HTML web forms. not the action URL for the script which processes the form. This can be used to support form-based authentication using cookies, allowing the webcrawler to login to a secure area in order to crawl it.

In the pre crawl mode, the webcrawler logs in once at the start of the crawl in order to get a set of authentication cookie(s). These can then be used during the crawl in order to get access to authenticated content.

The values which should be passed to the form can be specified using either crawler.form_interaction.pre_crawl.groupId.cleartext.urlParameterKey or crawler.form_interaction.pre_crawl.groupId.encrypted.urlParameterKey keys.

Things to note

  • You may need to add an exclude_pattern for any "logout" links so the crawler does not log itself out when it starts crawling the authenticated content.
  • You may need to manually specify a parameter if it is generated by Javascript, as the crawler does not currently interpret Javascript in forms.
  • You may need to set the value of crawler.user_agent to an appropriate browser user agent string in order to login to some sites.
  • You may need to specify an appropriate interval to renew the authentication cookies by re-running the form interaction process periodically.

Any cookie collected during the authentication process will be set in a header for every request the crawler will make during the crawl. However, the crawler.accept_cookies setting is still effective: If you disable it only the authentication cookie will be set, and if you enable it the crawler will collect cookies during the crawl in addition to the authentication cookie.

Note: Depending on the site you are trying to crawl you may need to turn off general cookie processing to get authentication to work. This might be the case if the site being crawled causes the required authentication cookies to be overwritten. You can avoid this by setting crawler.accept_cookies to 'false'.

Default Value

None. No pre_crawl urls are configured by default.

Examples

To specify a url with forms in pre_crawl mode

crawler.form_interaction.pre_crawl.1.url=https://www.example.com/login

Crawler groups the pre_crawl authentication configuration for a given url by matching the groupId parameter. If you need to specify a form number for the url https://www.example.com/login then the groupId parameter in both keys should be same. Which is 1 in the below example.

crawler.form_interaction.pre_crawl.1.url=https://www.example.com/login
crawler.form_interaction.pre_crawl.1.form_number=1

See Also

top

Funnelback logo
v15.24.0