crawler.form_interaction_file (collection.cfg setting)
This parameter specifies the path to an optional file which describes how to interact with HTML web forms. This can be used to support form-based authentication using cookies, allowing the webcrawler to login to a secure area in order to crawl it.
There are two modes supported in this feature:
- Pre-Crawl Authentication (Default)
- In-Crawl Authentication
In the first mode the webcrawler logs in once at the start of the crawl in order to get a set of authentication cookie(s). These can then be used during the crawl in order to get access to authenticated content.
In the second mode the webcrawler submits form details during a crawl whenever a form with a specific "action" is encountered.
To configure form interaction you should go to the file manager page for a collection and create a
form_interaction.cfg file. This will be created in the collection's configuration directory and the webcrawler will process this file if it exists.
In pre-crawl authentication the
form_interaction.cfg file might contain the following content:
# Process the 1st form on the given page, and input the given values https://sample.com/login 1 user=john&password=1234 # Process the 3rd form on this page https://sample.com/client 3 ClientID=54321
The list of forms are processed in order. The crawler will contact each URL in turn, and submit each form at the beginning of the crawl. Lines beginning with a # are treated as comments and ignored. The format of each line is:
form_url form_number parameters
- The form URL is the URL of the page containing the form, not the action URL for the script which processes the form.
- The form number specifies which form on the page to process, counting in the order of occurrence in the page. For pages with only one form, the value 1 should be used.
- The parameters are those for which you need to give specific values.
The format for the parameters in the 3rd field is a string of URL-escaped name=value pairs separated by the & character, stored inside parameters:
Additionally, you may supply a blank value for a key if you want that field to not be submitted.
For instance, if you wanted to encode:
You could do so like:
You may not need to specify all form parameters, only those for which you need to give specific input values. The webcrawler will parse the form and use any default values specified in the form for the other parameters.
Once the forms have been processed any cookies generated are then used by the webcrawler to authenticate its requests for content from the site.
Things to note
- You may need to add an exclude_pattern for any "logout" links so the crawler does not log itself out when it starts crawling the authenticated content.
- You may need to set the value of crawler.user_agent to an appropriate browser user agent string in order to login to some sites.
- You may need to specify an appropriate interval to renew the authentication cookies by re-running the form interaction process periodically.
Any cookie collected during the authentication process will be set in a header for every request the crawler will make during the crawl. However, the crawler.accept_cookies setting is still effective: If you disable it only the authentication cookie will be set, and if you enable it the crawler will collect cookies during the crawl in addition to the authentication cookie.
Note: Depending on the site you are trying to crawl you may need to turn off general cookie processing to get authentication to work. This might be the case if the site being crawled causes the required authentication cookies to be overwritten. You can avoid this by setting crawler.accept_cookies to 'false'.
In some situations you may decide that having the webcrawler 'pre-authenticate' by generating cookies at the start of a crawl may not be appropriate. If that is the case you may instead configure the crawler to try to login to a specific form action URL during the course of the crawl (which may happen multiple times for different sites in your domain which use the same centralised authentication mechanism).
then the form_interaction.cfg file will still be parsed at the start of the crawl, with the following difference in behaviour:
- The first field should now contain the absolute URL for the form action (processing end-point), instead of the URL of the form itself
- Any HTML encountered during the crawl which contains this form action will cause the crawler to submit the form details specified in that line
- When using in-crawl authentication the value of the form_number field is ignored, however, since this field is required to be present a simple value such as 1 can be used.
So if the
form_interaction.cfg file contained the following non-comment line:
https://sample.com/auth/login.jsp 1 parameters:[user=john&password=1234]
then if the crawler parsed a form that resulted in the same absolute action URL during the crawl it would submit the given values (in this case 'user' and 'password'). This simulates the behaviour of a human who browses to password protected content and is asked to authenticate using a form which submits the form details to "login.jsp". It also handles the situation where there may be a series of login/authentication URLs and redirects - as long as the crawler eventually downloads HTML containing the required form action then it will submit the required credentials.
Assuming the specified credentials are correct and the login (and subsequent redirects) succeed then the authenticated content will be downloaded and stored using the original URL that was requested. Any authentication cookies generated during this process will be cached so that subsequent requests to that site do not require the webcrawler to login again.
- If you are using in-crawl authentication then the first field in the configuration file must be the absolute URL for the entity processing the form submission.
- A limitation of the current implementation is that only one in-crawl "form action" can be configured, which means that only the last action target found in the
form_interaction.cfgfile will be used i.e. you should only have one non-comment line in your form_interaction.cfg file if using the in-crawl mode.
Empty (No file specified)
Default location for a collection (assuming form_interaction.cfg file created via file manager):
If configuration file is in another location:
All log messages relating to form interaction will be written at the start of the main crawl.log file in the offline or live "log" directory for the collection in question.
You can use the administration interface log viewer to view this file and debug issues with form interaction if required.
In order to debug login issues you may need to look at how a browser logs in to the site in question. You can do this by using a tool like:
to look at the network requests and responses (including cookie information) that gets transmitted when you manually log in to the site in the browser.
You can then compare this network trace with the output in the
crawl.log file. Some sample output is shown below:
Requested URL: https://identity.example.com/opensso/UI/Login POST Parameters: name=goto, value=http://my.example.com/user/ name=IDToken2, value=username name=IDToken1, value=1234 POST Status: Moved Temporarily
mitmproxy -p 8090
and then setting the following parameters in your collection.cfg file:
will cause the webcrawler to crawl through the proxy rather than directly connecting to the site(s) you are trying to crawl. Your proxy may then allow you to see the (un-encrypted) traffic to assist in debugging authentication issues.
- Make sure all parameters are accounted for. Some backends like ASPX applications expects all parameters to be present in the submitted form, including parameters that look irrelevant to the authentication, such as submit buttons values.
- Make sure the crawler doesn't send extra empty parameters. For example if your form has two submit inputs "Login" and "Cancel", the value for "Cancel" should not be sent when the form is submitted. A regular browser will not send the value because the Cancel button is not clicked during login (Only the "Login" button is), but the crawler must be specifically told to not send this value by setting the parameter to an empty value in the
form_interaction.cfgfile (see instructions above).
Running the form interaction process separately
You can also try to run the form interaction processing separately from the crawl, on the command line. To do so, use the following command:
cd $SEARCH_HOME java -cp "bin/*:lib/java/all/*" com.funnelback.crawler.forms.FormInteraction $SEARCH_HOME/conf/<collection>/form_interaction.cfg
cd %SEARCH_HOME% java -cp "bin\*;lib\java\all\*" com.funnelback.crawler.forms.FormInteraction %SEARCH_HOME%\conf\<collection>\form_interaction.cfg
The use of this file will:
- Override the content of the crawler.request_header parameter if this has been specified.
- Switch crawling to the default HTTPClient library for its cookie support, overriding any explicit setting in the crawler.packages.httplib parameter.