Crawling HTTPS websites

Introduction

Some websites are set up to be accessed using the Secure Sockets Layer (SSL) and HTTP (https) rather than just HTTP. This means that traffic between the client and web server will be encrypted, allowing for the secure transfer of data. However, in order for Funnelback to successfully search sites like these, several steps must be taken.

Crawler HTTPS configuration

A number of configuration parameters permit the crawler (Funnelback) to gather pages via HTTPS. The relevant parameters are:

Required parameter settings

crawler.protocols=https,...: Including https in this parameter is essential, otherwise all https URLs will be rejected by the exclusion rules.

Most sites can be crawled satisfactorily with just these two parameters set as above.

In addition, the parameter crawler.sslTrustEveryone is set to "true" by default. This setting will ignore invalid certificate chains (both client and server) and host name verification. If you are crawling sites which have valid signed certificate chains then you may wish to reset this to "false".

Note:: The crawler.ssl* parameters are supported by the HTTPClient library only.

Troubleshooting HTTPClient SSL operations

Any problems with root certificate validation will be reported in the crawl.log, like this:

HTTPClientTimedRequest: Error: javax.net.ssl.SSLPeerUnverifiedException:
   peer not authenticated: https://<failed url>

or this:

HTTPClientTimedRequest: Error: javax.net.ssl.SSLException:
   Name in certificate ‘<hostname-1>’ does not match host name ‘<hostname-2>’:
   https://<failed hostname-2 url>.

The first can occur if there is something wrong with the server certificate chain - missing or unknown authority. The second often occurs when virtual servers are not included on server certificates.

Further details on run-time certificate validation can be obtained by appending -Djavax.net.debug=ssl:handshake to the java_options parameter, which will show details of the trust store used and any certificate chains presented. To avoid being swamped, tackle one failed certificate at a time.

Having identified the problem, if the missing certificate chain is available it can be added to a trust-store using Java’s keytool. That trust-store can be used via the crawler.sslTrustStore parameter. Note however, that it will replace the default Java trust-store so those default certificates will be unavailable. An alternative is to copy the Java trust-store and add the new certificate(s) to that (all using keytool), then use the updated copy.

The parameters crawler.sslClientStore and crawler.sslClientStorePassword are provided if client validation is required by a server. Again, a JKS keystore can be built using Java’s keytool. The crawler.sslClientStorePassword may be required for internal validation of the client certificate store (private keys) at crawler start-up.

If you see the following type of error message in your crawler log files:

javax.net.ssl.SSLProtocolException: handshake alert:  unrecognized_name https://example.com/ javax.net.ssl.SSLException:
Connection has been shutdown: javax.net.ssl.SSLProtocolException: handshake alert:  unrecognized_name

then this may be being caused by the web server not handling SSL/TLS extensions correctly, or using a type of encryption that is not supported by Java. In this case you can crawl these types of sites by adding: -Djsse.enableSNIExtension=false to the java_options setting in your collection.cfg file. For more information on this setting, please see the JSSE Reference Guide.

Crawling HTTPS websites

Introduction

Crawler HTTPS configuration

Required parameter settings

Troubleshooting HTTPClient SSL operations

See also

Contents