Specifies a path to a file which configures interaction with form-based authentication.
Can be set in: collection.cfg
This parameter specifies the path to an optional file which describes how to interact with HTML web forms. This can be used to support form-based authentication using cookies, allowing the webcrawler to login to a secure area in order to crawl it.
There are two modes supported in this feature:
- Pre-Crawl Authentication (Default)
- In-Crawl Authentication
In the first mode the webcrawler logs in once at the start of the crawl in order to get a set of authentication cookie(s). These can then be used during the crawl in order to get access to authenticated content.
In the second mode the webcrawler submits form details during a crawl whenever a form with a specific "action" is encountered.
To configure form interaction you should go to the file manager
page for a collection and create a
form_interaction.cfg file. This will be created in the collection's
configuration directory and the webcrawler will process this file if it exists.
In pre-crawl authentication the
form_interaction.cfg file might contain the following content:
# Process the 1st form on the given page, and input the given values https://sample.com/login 1 user=john&password=1234 # Process the 3rd form on this page https://sample.com/client 3 ClientID=54321
The list of forms are processed in order. The crawler will contact each URL in turn, and submit each form at the beginning of the crawl. Lines beginning with a # are treated as comments and ignored. The format of each line is:
form_url form_number parameters
- The form URL is the URL of the page containing the form, not the action URL for the script which processes the form.
- The form number specifies which form on the page to process, counting in the order of occurrence in the page. For pages with only one form, the value 1 should be used.
- The parameters are those for which you need to give specific values.
The format for the parameters in the 3rd field is a string of URL-escaped name=value pairs separated by the & character, stored inside parameters:
Additionally, you may supply a blank value for a key if you want that field to not be submitted.
For instance, if you wanted to encode:
You could do so like:
You may not need to specify all form parameters, only those for which you need to give specific input values. The webcrawler will parse the form and use any default values specified in the form for the other parameters.
Once the forms have been processed any cookies generated are then used by the webcrawler to authenticate its requests for content from the site.
Things to note
- You may need to add an exclude_pattern for any "logout" links so the crawler does not log itself out when it starts crawling the authenticated content.
- You may need to set the value of crawler.user_agent to an appropriate browser user agent string in order to login to some sites.
- You may need to specify an appropriate interval to renew the authentication cookies by re-running the form interaction process periodically.
Any cookie collected during the authentication process will be set in a header for every request the crawler will make during the crawl. However, the crawler.accept_cookies setting is still effective: If you disable it only the authentication cookie will be set, and if you enable it the crawler will collect cookies during the crawl in addition to the authentication cookie.
Note: Depending on the site you are trying to crawl you may need to turn off general cookie processing to get authentication to work. This might be the case if the site being crawled causes the required authentication cookies to be overwritten. You can avoid this by setting crawler.accept_cookies to 'false'.
In some situations you may decide that having the webcrawler 'pre-authenticate' by generating cookies at the start of a crawl may not be appropriate. If that is the case you may instead configure the crawler to try to login to a specific form action URL during the course of the crawl (which may happen multiple times for different sites in your domain which use the same centralised authentication mechanism).
then the form_interaction.cfg file will still be parsed at the start of the crawl, with the following difference in behaviour:
- The first field should now contain the absolute URL for the form action (processing end-point), instead of the URL of the form itself
- Any HTML encountered during the crawl which contains this form action will cause the crawler to submit the form details specified in that line
- When using in-crawl authentication the value of the form_number field is ignored, however, since this field is required to be present a simple value such as 1 can be used.
So if the
form_interaction.cfg file contained the following non-comment line:
https://sample.com/auth/login.jsp 1 parameters:[user=john&password=1234]
then if the crawler parsed a form that resulted in the same absolute action URL during the crawl it would submit the given values (in this case 'user' and 'password'). This simulates the behaviour of a human who browses to password protected content and is asked to authenticate using a form which submits the form details to "login.jsp". It also handles the situation where there may be a series of login/authentication URLs and redirects - as long as the crawler eventually downloads HTML containing the required form action then it will submit the required credentials.
Assuming the specified credentials are correct and the login (and subsequent redirects) succeed then the authenticated content will be downloaded and stored using the original URL that was requested. Any authentication cookies generated during this process will be cached so that subsequent requests to that site do not require the webcrawler to login again.
- If you are using in-crawl authentication then the first field in the configuration file must be the absolute URL for the entity processing the form submission.
- A limitation of the current implementation is that only one in-crawl "form action" can be configured,
which means that only the last action target found in the
form_interaction.cfgfile will be used i.e. you should only have one non-comment line in your form_interaction.cfg file if using the in-crawl mode.
Default location of the file:
If configuration file is in another location: