Web crawler configuration options

Web crawler options

The web crawler has a comprehensive set of configuration options that can be used to adjust how the web crawler operates.

General options

Option	Description	Default
crawler.num_crawlers	Number of crawler threads which simultaneously crawl different hosts.	`20`
crawler.request_delay	Milliseconds between HTTP requests (for a specific crawler thread).	`250`
crawler.user_agent	The user agent that the web crawler identifies as uses when making HTTP requests.	`Mozilla/5.0 (compatible; Funnelback)`
crawler.server_alias_file	Path to optional file containing server alias mappings. See: server aliases
crawler.classes.RevisitPolicy	Java class used for enforcing the revisit policy for URLs	`com.funnelback.common.revisit.AlwaysRevisitPolicy`
crawler.revisit.edit_distance_threshold	Threshold for edit distance between two versions of a page when deciding whether it has changed or not when using the `SimpleRevisitPolicy`.	`20`
crawler.revisit.num_times_revisit_skipped_threshold	Threshold for number of times a page revisit has been skipped when deciding whether to revisit it when using the `SimpleRevisitPolicy`.	`2`
crawler.revisit.num_times_unchanged_threshold	Threshold for number of times a page has been unchanged when deciding whether to revisit it when using the `SimpleRevisitPolicy`.	`5`
data_report	Specifies if data reports should be generated for the crawl.	`true`
vital_servers	Specifies a list of servers that must be present in the crawl for a successful update.

Options controlling what gets included

Option	Description	Default
include_patterns	URLs matching this are included in crawl (unless they match any exclude_patterns).
exclude_patterns	URLs matching this are excluded from the crawl.	`/cgi-bin,/vti,/_vti,calendar,SQ_DESIGN_NAME=print,SQ_ACTION=logout,SQ_PAINT_LAYOUT_NAME=,%3E%3C/script%3E,google-analytics.com`
crawler.use_sitemap_xml	Specifies if `sitemap.xml` files should be processed during a web crawl.	`false`
crawler.start_urls_file	Path to a file that contains a list of URLs (one per line) that will be used as the starting point for a crawl. Note that this setting overrides the `start_url` that the crawler is passed on startup (usually stored in the `crawler.start_url` configuration option).	`collection.cfg.start.urls`
start_url	Crawler seed URL. Crawler follows links in this page, and then the links of those pages and so on.	`_disabled__see_start_urls_file`
crawler.protocols	Crawl URLs via these protocols (comma separated list).	`http,https`
crawler.reject_files	Do not crawl files with these extensions.	`asc,asf,asx,avi,bat,bib,bin,bmp,bz2,c,class,cpp,css,deb,dll,dmg,dvi,exe,fits,fts,gif,gz,h,ico,jar,java,jpeg,jpg,lzh,man,mid,mov,mp3,mp4,mpeg,mpg,o,old,pgp,png,ppm,qt,ra,ram,rpm,svg,swf,tar,tcl,tex,tgz,tif,tiff,vob,wav,wmv,wrl,xpm,zip,Z`
crawler.accept_files	Only crawl files with these extensions. Not normally used - default is to accept all valid content.
crawler.store_all_types	If true, override accept/reject rules and crawl and store all file types encountered.	`false`
crawler.store_empty_content_urls	If true, store URLs even if, after filtering, they contain no content.	`false`
crawler.non_html	Specifies non-html file formats to filter, based on the file extension (e.g. pdf, doc, xls)	`doc,docx,pdf,ppt,pptx,rtf,xls,xlsx,xlsm`
crawler.allowed_redirect_pattern	Specify a regex to allow crawler redirections that would otherwise by disallowed by the current include/exclude patterns.

Options controlling link extraction

Option	Description	Default
crawler.parser.mimeTypes	Extract links from these comma-separated or regexp: content-types.	`text/html,text/plain,text/xml,application/xhtml+xml,application/rss+xml,application/atom+xml,application/json,application/rdf+xml,application/xml`
crawler.extract_links_from_javascript	Whether to extract links from Javascript while crawling.	`false`
crawler.follow_links_in_comments	Whether to follow links in HTML comments while crawling.	`false`
crawler.link_extraction_group	The group in the crawler.link_extraction_regular_expression which should be extracted as the link/URL.
crawler.link_extraction_regular_expression	The expression used to extract links from each document. This must be a Perl compatible regular expression.

Options controlling size limits and timeouts

Option	Description	Default
crawler.max_dir_depth	A URL with more than this many sub directories will be ignored (too deep, probably a crawler trap)	`15`
crawler.max_download_size	Maximum size of files crawler will download (in MB). Default: 10MB	`10`
crawler.max_files_per_area	Maximum files per area e.g. number of files in one directory or generated by one dynamic generator e.g. index.asp?doc=123. This parameter used to be called crawler.max_dir_size	`10000`
crawler.max_files_per_server	Maximum files per server (default (empty) is unlimited)
crawler.max_files_stored	Maximum number of files to download (default, and less than 1, is unlimited)
crawler.max_link_distance	How far to crawl from the start_url (default is unlimited). e.g. if crawler.max_link_distance = 1, only crawl the links on start_url. NB: Turning this on drops crawler to single-threaded operation.
crawler.max_parse_size	Crawler will not parse documents beyond this many megabytes in size	`10`
crawler.max_url_length	A URL with more characters than this will be ignored (too long, probably a crawler trap)	`256`
crawler.max_url_repeating_elements	A URL with more than this many repeating elements (directories) will be ignored (probably a crawler trap or incorrectly configured web server)	`5`
crawler.overall_crawl_timeout	Maximum crawl time after which the update continues with indexing and changeover. The units of this parameter depend on the value of the crawler.overall_crawl_units parameter.	`24`
crawler.overall_crawl_units	The units for the crawler.overall_crawl_timeout parameter. A value of `hr` indicates hours and `min` indicates minutes.	`hr`
crawler.request_timeout	Timeout for HTTP page GETs (milliseconds)	`15000`
crawler.max_timeout_retries	Maximum number of times to retry after a network timeout (default is 0)	`0`

Authentication options

Option	Description	Default
crawler.allow_concurrent_in_crawl_form_interaction	Enable/disable concurrent processing of in-crawl form interaction.	`true`
crawler.form_interaction.pre_crawl.groupId.url	Specify a URL of the page containing the HTML web form in pre_crawl authentication mode
crawler.form_interaction.in_crawl.groupId.url_pattern	Specify a URL or URL pattern of the page containing the HTML web form in in_crawl authentication mode
crawler.ntlm.domain	NTLM domain to be used for web crawler authentication.
crawler.ntlm.password	NTLM password to be used for web crawler authentication.
crawler.ntlm.username	NTLM username to be used for web crawler authentication.
ftp_passwd	Password to use when gathering content from an FTP server.
ftp_user	Username to use when gathering content from an FTP server.
http_passwd	Password used for accessing password protected content during a crawl when.
http_user	Username used for accessing password protected content during a crawl.

Web crawler monitor options

Option	Description	Default
crawler.monitor_authentication_cookie_renewal_interval	Optional time interval at which to renew crawl authentication cookies
crawler.monitor_checkpoint_interval	Time interval at which to checkpoint (seconds).	`1800`
crawler.monitor_delay_type	Type of delay to use during crawl (dynamic or fixed).	`dynamic`
crawler.monitor_halt	Checked during a crawl - if set to `true` then crawler will cleanly shutdown.	`false`
crawler.monitor_preferred_servers_list	Optional list of servers to prefer during crawl.
crawler.monitor_time_interval	Time interval at which to output monitoring information (seconds).	`30`
crawler.monitor_url_reject_list	Optional parameter listing URLs to reject during a running crawl.

HTTP options

Option	Description	Default
http_proxy	The hostname (e.g. proxy.company.com) of the HTTP proxy to use during crawling. This hostname should not be prefixed with 'http://'.
http_proxy_passwd	The proxy password to be used during crawling.
http_proxy_port	Port of HTTP proxy used during crawling.
http_proxy_user	The proxy user name to be used during crawling.
http_source_host	IP address or hostname used by crawler, on a machine with more than one available.
crawler.request_header	Optional additional header to be inserted in HTTP(S) requests made by the webcrawler.
crawler.request_header_url_prefix	Optional URL prefix to be applied when processing the crawler.request_header parameter.
crawler.store_headers	Write HTTP header information at top of HTML files if true. Header information is used by indexer.	`true`

Logging options

Option	Description	Default
crawler.verbosity	Verbosity level (0-6) of crawler logs. Higher number results in more messages.	`4`
crawler.header_logging	Option to control whether HTTP headers are written out to a separate log file (default is false).	`false`
crawler.incremental_logging	Option to control whether a list of new and changed URLs should be written to a log file during incremental crawling	`false`
crawler.logfile	The crawler's log path and filename.	`$SEARCH_HOME/data/$COLLECTION_NAME/offline/log/crawl.log`

Web crawler advanced options

Option	Description	Default
crawler	The name of the crawler binary.	`com.funnelback.crawler.FunnelBack`
crawler_binaries	Location of the crawler files.
crawler.accept_cookies	Cookie policy. Default is false i.e. do not accept cookies. Requires HTTPClient if true.	`true`
crawler.cache.DNSCache_max_size	Maximum size of internal DNS cache. Upon reaching this size the cache will drop old elements.	`200000`
crawler.cache.LRUCache_max_size	Maximum size of LRUCache. Upon reaching this size the cache will drop old elements.	`500000`
crawler.cache.URLCache_max_size	Maximum size of URLCache. May be ignored by some cache implementations.	`50000000`
crawler.check_alias_exists	Check if aliased URLs exists - if not, revert back to original URL	`false`
crawler.checkpoint_to	Location of crawler checkpoint files.	`$SEARCH_HOME/data/$COLLECTION_NAME/offline/checkpoint`
crawler.classes.Crawler	Java class used by crawler - defines top level behaviour, which protocols are supported etc.	`com.funnelback.crawler.NetCrawler`
crawler.classes.Frontier	Java class used for the frontier (a list of URLs not yet visited).	`com.funnelback.common.frontier.MultipleRequestsFrontier:com.funnelback.common.frontier.DiskFIFOFrontier:1000`
crawler.classes.Policy	Java class used for enforcing the include/exclude policy for URLs	`com.funnelback.crawler.StandardPolicy`
crawler.classes.statistics	List of statistics classes to use during a crawl in order to generate figures for data reports	`CrawlSizeStatistic,MIMETypeStatistic,BroadMIMETypeStatistic,FileSizeStatistic,FileSizeByDocumentTypeStatistic,SuffixTypeStatistic,ReferencedFileTypeStatistic,URLlengthStatistic,WebServerTypeStatistic,BroadWebServerTypeStatistic`
crawler.classes.URLStore	Java class used to store content on disk e.g. create a mirror of files crawled.	`com.funnelback.common.store.WarcStore`
crawler.cookie_jar_file	File containing cookies to be pre-loaded when a web crawl begins.	`$SEARCH_HOME/conf/$COLLECTION_NAME/cookies.txt`
crawler.eliminate_duplicates	Whether to eliminate duplicate documents while crawling (default is true)	`true`
crawler.frontier_num_top_level_dirs	Optional setting to specify number of top level directories to store disk based frontier files in.
crawler.frontier_use_ip_mapping	Whether to map hosts to frontiers based on IP address. (default is false)	`false`
crawler.frontier_hosts	Lists of hosts running crawlers if performing a distributed web crawl.
crawler.frontier_port	Port on which `DistributedFrontier` will listen on.
crawler.max_individual_frontier_size	Maximum size of an individual frontier (unlimited if not defined)
crawler.inline_filtering_enabled	Option to control whether text extraction from binary files is done inline during a web crawl	`true`
crawler.lowercase_iis_urls	Whether to lowercase all URLs from IIS web servers (default is false)	`false`
crawler.predirects_enabled	Enable crawler predirects (boolean). See: crawler predirects
crawler.remove_parameters	Optional list of parameters to remove from URLs.
crawler.robotAgent	Matching is case-insensitive over the length of the name in a `robots.txt` file.	`Funnelback`
crawler.secondary_store_root	Location of secondary (previous) store - used in incremental crawling	`$SEARCH_HOME/data/$COLLECTION_NAME/live/data`
crawler.sslClientStore	Path to a SSL Client certificate store (absolute or relative). Empty/missing means no client certificate store. Certificate stores can be managed by Java's keytool.
crawler.sslClientStorePassword	Password for the SSL Client certificate store. Empty/missing means no password, and may prevent client certificate validation. Certificate stores can be managed by Java's keytool.
crawler.sslTrustEveryone	Trust ALL Root Certificates and ignore server hostname verification if true. This bypasses all certificate and server validation by the HTTPS library, so every server and certificate is trusted. It can be used to overcome problems with unresolveable external certificate chains and poor certificates for virtual hosts, but will allow server spoofing.	`true`
crawler.sslTrustStore	Path to a SSL Trusted Root store (absolute or relative). Empty/missing means use those provided with Java. Certificate stores can be managed by Java's keytool.
crawler.send-http-basic-credentials-without-challenge	This option controls whether or not Funnelback sends any HTTP credentials along with every request.	`true`
schedule.incremental_crawl_ratio	The number of scheduled incremental crawls that are performed between each full crawl (e.g. a value of '10' results in an update schedule consisting of every ten incremental crawls being followed by a full crawl).	`10`

Web crawler configuration options

Web crawler options

General options

Options controlling what gets included

Options controlling link extraction

Options controlling size limits and timeouts

Authentication options

Web crawler monitor options

HTTP options

Logging options

Web crawler advanced options

See also

Contents