Specify whether crawler should submit web form login details during crawl rather than in a pre-crawl phase.
Can be set in: collection.cfg
This parameter controls whether the crawler will submit form login details during a crawl (in-crawl authentication) instead of using a pre-crawl authentication step.
The login details are specified in a form_interaction.cfg file.
- This setting changes the semantics of the fields in the configuration file - please read the documentation on the form_interaction.cfg file for details.
- It will also mean that cookies will not be explicitly renewed during the course of the crawl. Instead the crawler will submit the required form details when an appropriate form is encountered during the course of the crawl.
All log messages relating to form interaction will be written at the start of the main crawl.log file in the offline or live "log" directory for the collection in question.
You can use the administration interface log viewer to view this file and debug issues with form interaction if required.
In order to debug login issues you may need to look at how a browser logs in to the site in question. You can do this by using a tool like:
to look at the network requests and responses (including cookie information) that gets transmitted when you manually log in to the site in the browser.
You can then compare this network trace with the output in the
crawl.log file. Some sample output
is shown below:
Requested URL: https://identity.example.com/opensso/UI/Login POST Parameters: name=goto, value=http://my.example.com/user/ name=IDToken2, value=username name=IDToken1, value=1234 POST Status: Moved Temporarily
Funnelback also provides a crawler debug API call which can display the requests the crawler would send and the responses it receives while crawling a single URL.
Please note that because passwords can be revealed in the requests this endpoint
requires access to the collection and the
For a server named funnelback-server and a collection with ID collection we could access the debug endpoint and have it display the log for http://example.com with the following URL.
And the returned content would show each request and response performed by the form interaction system (or by any other authentication mechanisim).
- Make sure all parameters are accounted for. Some backends like ASPX applications expects all parameters to be present in the submitted form, including parameters that look irrelevant to the authentication, such as submit buttons values.
- Make sure the crawler doesn't send extra empty parameters. For example if your form has two submit
inputs "Login" and "Cancel", the value for "Cancel" should not be sent when the form is submitted.
A regular browser will not send the value because the Cancel button is not clicked during login (Only
the "Login" button is), but the crawler must be specifically told to not send this value by setting
the parameter to an empty value in the
form_interaction.cfgfile (see instructions above).
The use of this file will:
- Override the content of the crawler.request_header parameter if this has been specified.
To turn on in-crawl interaction: