Skip to content

crawler.form_interaction_file (collection.cfg setting)

Description

This parameter specifies the path to an optional file which describes how to interact with HTML web forms. This can be used to support form-based authentication using cookies, allowing the webcrawler to login to a secure area in order to crawl it.

There are two modes supported in this feature:

  1. Pre-Crawl Authentication (Default)
  2. In-Crawl Authentication

In the first mode the webcrawler logs in once at the start of the crawl in order to get a set of authentication cookie(s). These can then be used during the crawl in order to get access to authenticated content.

In the second mode the webcrawler submits form details during a crawl whenever a form with a specific "action" is encountered.

To configure form interaction you should go to the file manager page for a collection and create a form_interaction.cfg file. This will be created in the collection's configuration directory and the webcrawler will process this file if it exists.

Pre-crawl authentication

In pre-crawl authentication the form_interaction.cfg file might contain the following content:

# Process the 1st form on the given page, and input the given values
https://sample.com/login 1 user=john&password=1234
# Process the 3rd form on this page
https://sample.com/client 3 ClientID=54321

The list of forms are processed in order. The crawler will contact each URL in turn, and submit each form at the beginning of the crawl. Lines beginning with a # are treated as comments and ignored. The format of each line is:

form_url form_number parameters
  1. The form URL is the URL of the page containing the form, not the action URL for the script which processes the form.
  2. The form number specifies which form on the page to process, counting in the order of occurrence in the page. For pages with only one form, the value 1 should be used.
  3. The parameters are those for which you need to give specific values.

The format for the parameters in the 3rd field is a string of URL-escaped name=value pairs separated by the & character, stored inside parameters:[]

Additionally, you may supply a blank value for a key if you want that field to not be submitted.

For instance, if you wanted to encode:

KeyValue
namejohn
passwordp@ssword
cancel__

You could do so like:

parameters:[name=john&password=p%40ssword&cancel=]

You may not need to specify all form parameters, only those for which you need to give specific input values. The webcrawler will parse the form and use any default values specified in the form for the other parameters.

Once the forms have been processed any cookies generated are then used by the webcrawler to authenticate its requests for content from the site.

Things to note

  • You may need to add an exclude_pattern for any "logout" links so the crawler does not log itself out when it starts crawling the authenticated content.
  • You may need to manually specify a parameter if it is generated by Javascript, as the crawler does not currently interpret Javascript in forms.
  • You may need to set the value of crawler.user_agent to an appropriate browser user agent string in order to login to some sites.
  • You may need to specify an appropriate interval to renew the authentication cookies by re-running the form interaction process periodically.

Any cookie collected during the authentication process will be set in a header for every request the crawler will make during the crawl. However, the crawler.accept_cookies setting is still effective: If you disable it only the authentication cookie will be set, and if you enable it the crawler will collect cookies during the crawl in addition to the authentication cookie.

Note: Depending on the site you are trying to crawl you may need to turn off general cookie processing to get authentication to work. This might be the case if the site being crawled causes the required authentication cookies to be overwritten. You can avoid this by setting crawler.accept_cookies to 'false'.

In-crawl authentication

In some situations you may decide that having the webcrawler 'pre-authenticate' by generating cookies at the start of a crawl may not be appropriate. If that is the case you may instead configure the crawler to try to login to a specific form action URL during the course of the crawl (which may happen multiple times for different sites in your domain which use the same centralised authentication mechanism).

If you add the following setting to your collection.cfg file:

crawler.form_interaction_in_crawl=true

then the form_interaction.cfg file will still be parsed at the start of the crawl, with the following difference in behaviour:

  1. The first field should now contain the absolute URL for the form action (processing end-point), instead of the URL of the form itself
  2. Any HTML encountered during the crawl which contains this form action will cause the crawler to submit the form details specified in that line
  3. When using in-crawl authentication the value of the form_number field is ignored, however, since this field is required to be present a simple value such as 1 can be used.

So if the form_interaction.cfg file contained the following non-comment line:

https://sample.com/auth/login.jsp 1 parameters:[user=john&password=1234]

then if the crawler parsed a form that resulted in the same absolute action URL during the crawl it would submit the given values (in this case 'user' and 'password'). This simulates the behaviour of a human who browses to password protected content and is asked to authenticate using a form which submits the form details to "login.jsp". It also handles the situation where there may be a series of login/authentication URLs and redirects - as long as the crawler eventually downloads HTML containing the required form action then it will submit the required credentials.

Assuming the specified credentials are correct and the login (and subsequent redirects) succeed then the authenticated content will be downloaded and stored using the original URL that was requested. Any authentication cookies generated during this process will be cached so that subsequent requests to that site do not require the webcrawler to login again.

Notes

  • If you are using in-crawl authentication then the first field in the configuration file must be the absolute URL for the entity processing the form submission.
  • A limitation of the current implementation is that only one in-crawl "form action" can be configured, which means that only the last action target found in the form_interaction.cfg file will be used i.e. you should only have one non-comment line in your form_interaction.cfg file if using the in-crawl mode.

Default value

Empty (No file specified)

Examples

Default location for a collection (assuming form_interaction.cfg file created via file manager):

crawler.form_interaction_file==$SEARCH_HOME/conf/$COLLECTION_NAME/form_interaction.cfg

If configuration file is in another location:

crawler.form_interaction_file=/another/location/form.cfg

Logging

All log messages relating to form interaction will be written at the start of the main crawl.log file in the offline or live "log" directory for the collection in question.

You can use the administration interface log viewer to view this file and debug issues with form interaction if required.

Debugging

In order to debug login issues you may need to look at how a browser logs in to the site in question. You can do this by using a tool like:

to look at the network requests and responses (including cookie information) that gets transmitted when you manually log in to the site in the browser.

You can then compare this network trace with the output in the crawl.log file. Some sample output is shown below:

Requested URL: https://identity.example.com/opensso/UI/Login

POST Parameters:
name=goto, value=http://my.example.com/user/
name=IDToken2, value=username
name=IDToken1, value=1234

POST Status: Moved Temporarily

In this example comparing the POST parameters with that in the browser trace showed that the "goto" parameter was different. Investigation of the HTML source of the login form showed that this parameter was being generated by some Javascript.

Since the crawler will not interpret Javascript we would then need to explicitly add this parameter to the form_interaction.cfg file.

You can also look at running the webcrawler through a proxy so that you can view the traffic that the proxy is relaying. For example, running mitmproxy on the machine the crawler is running on:

mitmproxy -p 8090

and then setting the following parameters in your collection.cfg file:

http_proxy=127.0.0.1
http_proxy_port=8090

will cause the webcrawler to crawl through the proxy rather than directly connecting to the site(s) you are trying to crawl. Your proxy may then allow you to see the (un-encrypted) traffic to assist in debugging authentication issues.

Troubleshooting notes

  • Try to log in to the form with your browser, but with Javascript disabled. If that doesn't work then the crawler won't be able to process the form as it relies on Javascript execution.
  • Make sure all parameters are accounted for. Some backends like ASPX applications expects all parameters to be present in the submitted form, including parameters that look irrelevant to the authentication, such as submit buttons values.
  • Make sure the crawler doesn't send extra empty parameters. For example if your form has two submit inputs "Login" and "Cancel", the value for "Cancel" should not be sent when the form is submitted. A regular browser will not send the value because the Cancel button is not clicked during login (Only the "Login" button is), but the crawler must be specifically told to not send this value by setting the parameter to an empty value in the form_interaction.cfg file (see instructions above).

Running the form interaction process separately

You can also try to run the form interaction processing separately from the crawl, on the command line. To do so, use the following command:

on Linux

cd $SEARCH_HOME
java -cp "bin/*:lib/java/all/*" com.funnelback.crawler.forms.FormInteraction $SEARCH_HOME/conf/<collection>/form_interaction.cfg

on Windows

cd %SEARCH_HOME%
java -cp "bin\*;lib\java\all\*" com.funnelback.crawler.forms.FormInteraction %SEARCH_HOME%\conf\<collection>\form_interaction.cfg

Settings

The use of this file will:

  • Override the content of the crawler.request_header parameter if this has been specified.
  • Switch crawling to the default HTTPClient library for its cookie support, overriding any explicit setting in the crawler.packages.httplib parameter.

See also

top

Funnelback logo
v15.16.0