Skip to content

crawler.form_interaction.in_crawl.[groupId].url_pattern

Specifies a URL or URL pattern of the page containing the HTML web form in in_crawl mode.

Key: crawler.form_interaction.in_crawl.[groupId].url_pattern
Type: String
Can be set in: collection.cfg

Description

This parameter specifies the URL or URL pattern for the form action (processing end-point), instead of the URL of the form itself

In-crawl authentication

In in_crawl mode the webcrawler submits form details during a crawl whenever a form with a specific "action" is encountered.

The values which should be passed to the form can be specified using either crawler.form_interaction.in_crawl.groupId.cleartext.urlParameterKey or crawler.form_interaction.in_crawl.groupId.encrypted.urlParameterKey keys.

If the crawler parsed a form that resulted in the same absolute action URL during the crawl it would submit the specified values. This simulates the behaviour of a human who browses to password protected content and is asked to authenticate using a form which submits the form details to "login.jsp". It also handles the situation where there may be a series of login/authentication URLs and redirects - as long as the crawler eventually downloads HTML containing the required form action then it will submit the required credentials.

Assuming the specified credentials are correct and the login (and subsequent redirects) succeed then the authenticated content will be downloaded and stored using the original URL that was requested. Any authentication cookies generated during this process will be cached so that subsequent requests to that site do not require the webcrawler to login again.

Notes

  • If you are using in-crawl authentication then the first field in the configuration file must be the absolute URL for the entity processing the form submission.
  • A limitation of the current implementation is that only one in-crawl "form action" can be configured, which means that only the last action target found in the form_interaction.cfg file will be used i.e. you should only have one non-comment line in your form_interaction.cfg file if using the in-crawl mode.

Logging

All log messages relating to form interaction will be written at the start of the main crawl.log file in the offline or live "log" directory for the collection in question.

You can use the administration interface log viewer to view this file and debug issues with form interaction if required.

Debugging

In order to debug login issues you may need to look at how a browser logs in to the site in question. You can do this by using a tool like:

to look at the network requests and responses (including cookie information) that gets transmitted when you manually log in to the site in the browser.

You can then compare this network trace with the output in the crawl.log file. Some sample output is shown below:

Requested URL: https://identity.example.com/opensso/UI/Login

POST Parameters:
name=goto, value=http://my.example.com/user/
name=IDToken2, value=username
name=IDToken1, value=1234

POST Status: Moved Temporarily

In this example comparing the POST parameters with that in the browser trace showed that the "goto" parameter was different. Investigation of the HTML source of the login form showed that this parameter was being generated by some Javascript.

Since the crawler will not interpret Javascript we would then need to explicitly add this parameter using crawler.form_interaction.in_crawl.groupId.cleartext.urlParameterKey key

Funnelback also provides a crawler debug API call which can display the requests the crawler would send and the responses it receives while crawling a single URL.

Troubleshooting notes

  • Try to log in to the form with your browser, but with Javascript disabled. If that doesn't work then the crawler won't be able to process the form as it relies on Javascript execution.
  • Make sure all parameters are accounted for. Some backends like ASPX applications expects all parameters to be present in the submitted form, including parameters that look irrelevant to the authentication, such as submit buttons values.
  • Make sure the crawler doesn't send extra empty parameters. For example if your form has two submit inputs "Login" and "Cancel", the value for "Cancel" should not be sent when the form is submitted. A regular browser will not send the value because the Cancel button is not clicked during login (Only the "Login" button is), but the crawler must be specifically told to not send this value by setting the parameter to an empty value in the collection.cfg file (see instructions crawler.form_interaction.in_crawl.groupId.cleartext.urlParameterKey).

Default Value

None. No in_crawl urls are configured by default.

Examples

To specify a url with forms in in_crawl mode

crawler.form_interaction.in_crawl.1.url_pattern=https://www.example.com/login.jsp

Crawler groups the in_crawl authentication configuration for a given url by matching the groupId parameter. If you need to specify url parameters for the url https://www.example.com/login.jsp then the groupId parameter in both keys should be same. Which is 1 in the below example.

crawler.form_interaction.pre_crawl.1.url=https://www.example.com/login.jsp
crawler.form_interaction.pre_crawl.1.cleartext.username=1

See Also

top

Funnelback logo
v15.24.0