Skip to content

crawler.form_interaction_in_crawl

Specify whether crawler should submit web form login details during crawl rather than in a pre-crawl phase.

Key: crawler.form_interaction_in_crawl
Type: Boolean
Can be set in: collection.cfg

Description

This parameter controls whether the crawler will submit form login details during a crawl (in-crawl authentication) instead of using a pre-crawl authentication step.

The login details are specified in a form_interaction.cfg file.

Notes

  • This setting changes the semantics of the fields in the configuration file - please read the documentation on the form_interaction.cfg file for details.
  • It will also mean that cookies will not be explicitly renewed during the course of the crawl. Instead the crawler will submit the required form details when an appropriate form is encountered during the course of the crawl.

Logging

All log messages relating to form interaction will be written at the start of the main crawl.log file in the offline or live "log" directory for the collection in question.

You can use the administration interface log viewer to view this file and debug issues with form interaction if required.

Debugging

In order to debug login issues you may need to look at how a browser logs in to the site in question. You can do this by using a tool like:

to look at the network requests and responses (including cookie information) that gets transmitted when you manually log in to the site in the browser.

You can then compare this network trace with the output in the crawl.log file. Some sample output is shown below:

Requested URL: https://identity.example.com/opensso/UI/Login

POST Parameters:
name=goto, value=http://my.example.com/user/
name=IDToken2, value=username
name=IDToken1, value=1234

POST Status: Moved Temporarily

In this example comparing the POST parameters with that in the browser trace showed that the "goto" parameter was different. Investigation of the HTML source of the login form showed that this parameter was being generated by some Javascript.

Since the crawler will not interpret Javascript we would then need to explicitly add this parameter to the form_interaction.cfg file.

Funnelback also provides a crawler debug API call which can display the requests the crawler would send and the responses it receives while crawling a single URL.

Please note that because passwords can be revealed in the requests this endpoint requires access to the collection and the sec.administer.system permission.

For a server named funnelback-server and a collection with ID collection we could access the debug endpoint and have it display the log for http://example.com with the following URL.

https://funnelback-server:8443/admin-api/crawler/v1/debug/collections/collection/http-request?url=http%3A%2F%2Fexample.com&level=BODY

And the returned content would show each request and response performed by the form interaction system (or by any other authentication mechanisim).

Troubleshooting notes

  • Try to log in to the form with your browser, but with Javascript disabled. If that doesn't work then the crawler won't be able to process the form as it relies on Javascript execution.
  • Make sure all parameters are accounted for. Some backends like ASPX applications expects all parameters to be present in the submitted form, including parameters that look irrelevant to the authentication, such as submit buttons values.
  • Make sure the crawler doesn't send extra empty parameters. For example if your form has two submit inputs "Login" and "Cancel", the value for "Cancel" should not be sent when the form is submitted. A regular browser will not send the value because the Cancel button is not clicked during login (Only the "Login" button is), but the crawler must be specifically told to not send this value by setting the parameter to an empty value in the form_interaction.cfg file (see instructions above).

Settings

The use of this file will:

Default Value

crawler.form_interaction_in_crawl=false

Examples

To turn on in-crawl interaction:

crawler.form_interaction_in_crawl=true

See Also

top

Funnelback logo
v15.22.0