Web crawler configuration options
Web crawler options
The web crawler has a comprehensive set of configuration options that can be used to adjust how the web crawler operates.
General options
Option | Description | Default |
---|---|---|
crawler.num_crawlers | Number of crawler threads which simultaneously crawl different hosts. | 20 |
crawler.request_delay | Milliseconds between HTTP requests (for a specific crawler thread). | 250 |
crawler.user_agent | The user agent that the web crawler identifies as uses when making HTTP requests. | Mozilla/5.0 (compatible; Funnelback) |
crawler.server_alias_file | Path to optional file containing server alias mappings. See: server aliases | |
crawler.classes.RevisitPolicy | Java class used for enforcing the revisit policy for URLs | com.funnelback.common.revisit.AlwaysRevisitPolicy |
crawler.revisit.edit_distance_threshold | Threshold for edit distance between two versions of a page when deciding whether it has changed or not when using the SimpleRevisitPolicy . |
20 |
crawler.revisit.num_times_revisit_skipped_threshold | Threshold for number of times a page revisit has been skipped when deciding whether to revisit it when using the SimpleRevisitPolicy . |
2 |
crawler.revisit.num_times_unchanged_threshold | Threshold for number of times a page has been unchanged when deciding whether to revisit it when using the SimpleRevisitPolicy . |
5 |
data_report | Specifies if data reports should be generated for the crawl. | true |
vital_servers | Specifies a list of servers that must be present in the crawl for a successful update. |
Options controlling what gets included
Option | Description | Default |
---|---|---|
include_patterns | URLs matching this are included in crawl (unless they match any exclude_patterns). | |
exclude_patterns | URLs matching this are excluded from the crawl. | /cgi-bin,/vti,/_vti,calendar,SQ_DESIGN_NAME=print,SQ_ACTION=logout,SQ_PAINT_LAYOUT_NAME=,%3E%3C/script%3E,google-analytics.com |
crawler.use_sitemap_xml | Specifies if sitemap.xml files should be processed during a web crawl. |
false |
crawler.start_urls_file | Path to a file that contains a list of URLs (one per line) that will be used as the starting point for a crawl. Note that this setting overrides the start_url that the crawler is passed on startup (usually stored in the crawler.start_url configuration option). |
collection.cfg.start.urls |
start_url | Crawler seed URL. Crawler follows links in this page, and then the links of those pages and so on. | _disabled__see_start_urls_file |
crawler.protocols | Crawl URLs via these protocols (comma separated list). | http,https |
crawler.reject_files | Do not crawl files with these extensions. | asc,asf,asx,avi,bat,bib,bin,bmp,bz2,c,class,cpp,css,deb,dll,dmg,dvi,exe,fits,fts,gif,gz,h,ico,jar,java,jpeg,jpg,lzh,man,mid,mov,mp3,mp4,mpeg,mpg,o,old,pgp,png,ppm,qt,ra,ram,rpm,svg,swf,tar,tcl,tex,tgz,tif,tiff,vob,wav,wmv,wrl,xpm,zip,Z |
crawler.accept_files | Only crawl files with these extensions. Not normally used - default is to accept all valid content. | |
crawler.store_all_types | If true, override accept/reject rules and crawl and store all file types encountered. | false |
crawler.store_empty_content_urls | If true, store URLs even if, after filtering, they contain no content. | false |
crawler.non_html | Specifies non-html file formats to filter, based on the file extension (e.g. pdf, doc, xls) | doc,docx,pdf,ppt,pptx,rtf,xls,xlsx,xlsm |
crawler.allowed_redirect_pattern | Specify a regex to allow crawler redirections that would otherwise by disallowed by the current include/exclude patterns. |
Options controlling link extraction
Option | Description | Default |
---|---|---|
crawler.parser.mimeTypes | Extract links from these comma-separated or regexp: content-types. | text/html,text/plain,text/xml,application/xhtml+xml,application/rss+xml,application/atom+xml,application/json,application/rdf+xml,application/xml |
crawler.extract_links_from_javascript | Whether to extract links from Javascript while crawling. | false |
crawler.follow_links_in_comments | Whether to follow links in HTML comments while crawling. | false |
crawler.link_extraction_group | The group in the crawler.link_extraction_regular_expression which should be extracted as the link/URL. | |
crawler.link_extraction_regular_expression | The expression used to extract links from each document. This must be a Perl compatible regular expression. |
Options controlling size limits and timeouts
Option | Description | Default |
---|---|---|
crawler.max_dir_depth | A URL with more than this many sub directories will be ignored (too deep, probably a crawler trap) | 15 |
crawler.max_download_size | Maximum size of files crawler will download (in MB). Default: 10MB | 10 |
crawler.max_files_per_area | Maximum files per area e.g. number of files in one directory or generated by one dynamic generator e.g. index.asp?doc=123. This parameter used to be called crawler.max_dir_size | 10000 |
crawler.max_files_per_server | Maximum files per server (default (empty) is unlimited) | |
crawler.max_files_stored | Maximum number of files to download (default, and less than 1, is unlimited) | |
crawler.max_link_distance | How far to crawl from the start_url (default is unlimited). e.g. if crawler.max_link_distance = 1, only crawl the links on start_url. NB: Turning this on drops crawler to single-threaded operation. | |
crawler.max_parse_size | Crawler will not parse documents beyond this many megabytes in size | 10 |
crawler.max_url_length | A URL with more characters than this will be ignored (too long, probably a crawler trap) | 256 |
crawler.max_url_repeating_elements | A URL with more than this many repeating elements (directories) will be ignored (probably a crawler trap or incorrectly configured web server) | 5 |
crawler.overall_crawl_timeout | Maximum crawl time after which the update continues with indexing and changeover. The units of this parameter depend on the value of the crawler.overall_crawl_units parameter. | 24 |
crawler.overall_crawl_units | The units for the crawler.overall_crawl_timeout parameter. A value of hr indicates hours and min indicates minutes. |
hr |
crawler.request_timeout | Timeout for HTTP page GETs (milliseconds) | 15000 |
crawler.max_timeout_retries | Maximum number of times to retry after a network timeout (default is 0) | 0 |
Authentication options
Option | Description | Default |
---|---|---|
crawler.allow_concurrent_in_crawl_form_interaction | Enable/disable concurrent processing of in-crawl form interaction. | true |
crawler.form_interaction.pre_crawl.groupId.url | Specify a URL of the page containing the HTML web form in pre_crawl authentication mode | |
crawler.form_interaction.in_crawl.groupId.url_pattern | Specify a URL or URL pattern of the page containing the HTML web form in in_crawl authentication mode | |
crawler.ntlm.domain | NTLM domain to be used for web crawler authentication. | |
crawler.ntlm.password | NTLM password to be used for web crawler authentication. | |
crawler.ntlm.username | NTLM username to be used for web crawler authentication. | |
ftp_passwd | Password to use when gathering content from an FTP server. | |
ftp_user | Username to use when gathering content from an FTP server. | |
http_passwd | Password used for accessing password protected content during a crawl when. | |
http_user | Username used for accessing password protected content during a crawl. |
Web crawler monitor options
Option | Description | Default |
---|---|---|
crawler.monitor_authentication_cookie_renewal_interval | Optional time interval at which to renew crawl authentication cookies | |
crawler.monitor_checkpoint_interval | Time interval at which to checkpoint (seconds). | 1800 |
crawler.monitor_delay_type | Type of delay to use during crawl (dynamic or fixed). | dynamic |
crawler.monitor_halt | Checked during a crawl - if set to true then crawler will cleanly shutdown. |
false |
crawler.monitor_preferred_servers_list | Optional list of servers to prefer during crawl. | |
crawler.monitor_time_interval | Time interval at which to output monitoring information (seconds). | 30 |
crawler.monitor_url_reject_list | Optional parameter listing URLs to reject during a running crawl. |
HTTP options
Option | Description | Default |
---|---|---|
http_proxy | The hostname (e.g. proxy.company.com) of the HTTP proxy to use during crawling. This hostname should not be prefixed with 'http://'. | |
http_proxy_passwd | The proxy password to be used during crawling. | |
http_proxy_port | Port of HTTP proxy used during crawling. | |
http_proxy_user | The proxy user name to be used during crawling. | |
http_source_host | IP address or hostname used by crawler, on a machine with more than one available. | |
crawler.request_header | Optional additional header to be inserted in HTTP(S) requests made by the webcrawler. | |
crawler.request_header_url_prefix | Optional URL prefix to be applied when processing the crawler.request_header parameter. | |
crawler.store_headers | Write HTTP header information at top of HTML files if true. Header information is used by indexer. | true |
Logging options
Option | Description | Default |
---|---|---|
crawler.verbosity | Verbosity level (0-6) of crawler logs. Higher number results in more messages. | 4 |
crawler.header_logging | Option to control whether HTTP headers are written out to a separate log file (default is false). | false |
crawler.incremental_logging | Option to control whether a list of new and changed URLs should be written to a log file during incremental crawling | false |
crawler.logfile | The crawler's log path and filename. | $SEARCH_HOME/data/$COLLECTION_NAME/offline/log/crawl.log |
Web crawler advanced options
Option | Description | Default |
---|---|---|
crawler | The name of the crawler binary. | com.funnelback.crawler.FunnelBack |
crawler_binaries | Location of the crawler files. | |
crawler.accept_cookies | Cookie policy. Default is false i.e. do not accept cookies. Requires HTTPClient if true. | true |
crawler.cache.DNSCache_max_size | Maximum size of internal DNS cache. Upon reaching this size the cache will drop old elements. | 200000 |
crawler.cache.LRUCache_max_size | Maximum size of LRUCache. Upon reaching this size the cache will drop old elements. | 500000 |
crawler.cache.URLCache_max_size | Maximum size of URLCache. May be ignored by some cache implementations. | 50000000 |
crawler.check_alias_exists | Check if aliased URLs exists - if not, revert back to original URL | false |
crawler.checkpoint_to | Location of crawler checkpoint files. | $SEARCH_HOME/data/$COLLECTION_NAME/offline/checkpoint |
crawler.classes.Crawler | Java class used by crawler - defines top level behaviour, which protocols are supported etc. | com.funnelback.crawler.NetCrawler |
crawler.classes.Frontier | Java class used for the frontier (a list of URLs not yet visited). | com.funnelback.common.frontier.MultipleRequestsFrontier:com.funnelback.common.frontier.DiskFIFOFrontier:1000 |
crawler.classes.Policy | Java class used for enforcing the include/exclude policy for URLs | com.funnelback.crawler.StandardPolicy |
crawler.classes.statistics | List of statistics classes to use during a crawl in order to generate figures for data reports | CrawlSizeStatistic,MIMETypeStatistic,BroadMIMETypeStatistic,FileSizeStatistic,FileSizeByDocumentTypeStatistic,SuffixTypeStatistic,ReferencedFileTypeStatistic,URLlengthStatistic,WebServerTypeStatistic,BroadWebServerTypeStatistic |
crawler.classes.URLStore | Java class used to store content on disk e.g. create a mirror of files crawled. | com.funnelback.common.store.WarcStore |
crawler.cookie_jar_file | File containing cookies to be pre-loaded when a web crawl begins. | $SEARCH_HOME/conf/$COLLECTION_NAME/cookies.txt |
crawler.eliminate_duplicates | Whether to eliminate duplicate documents while crawling (default is true) | true |
crawler.frontier_num_top_level_dirs | Optional setting to specify number of top level directories to store disk based frontier files in. | |
crawler.frontier_use_ip_mapping | Whether to map hosts to frontiers based on IP address. (default is false) | false |
crawler.frontier_hosts | Lists of hosts running crawlers if performing a distributed web crawl. | |
crawler.frontier_port | Port on which DistributedFrontier will listen on. |
|
crawler.max_individual_frontier_size | Maximum size of an individual frontier (unlimited if not defined) | |
crawler.inline_filtering_enabled | Option to control whether text extraction from binary files is done inline during a web crawl | true |
crawler.lowercase_iis_urls | Whether to lowercase all URLs from IIS web servers (default is false) | false |
crawler.predirects_enabled | Enable crawler predirects (boolean). See: crawler predirects | |
crawler.remove_parameters | Optional list of parameters to remove from URLs. | |
crawler.robotAgent | Matching is case-insensitive over the length of the name in a robots.txt file. |
Funnelback |
crawler.secondary_store_root | Location of secondary (previous) store - used in incremental crawling | $SEARCH_HOME/data/$COLLECTION_NAME/live/data |
crawler.sslClientStore | Path to a SSL Client certificate store (absolute or relative). Empty/missing means no client certificate store. Certificate stores can be managed by Java's keytool. | |
crawler.sslClientStorePassword | Password for the SSL Client certificate store. Empty/missing means no password, and may prevent client certificate validation. Certificate stores can be managed by Java's keytool. | |
crawler.sslTrustEveryone | Trust ALL Root Certificates and ignore server hostname verification if true. This bypasses all certificate and server validation by the HTTPS library, so every server and certificate is trusted. It can be used to overcome problems with unresolveable external certificate chains and poor certificates for virtual hosts, but will allow server spoofing. | true |
crawler.sslTrustStore | Path to a SSL Trusted Root store (absolute or relative). Empty/missing means use those provided with Java. Certificate stores can be managed by Java's keytool. | |
crawler.send-http-basic-credentials-without-challenge | This option controls whether or not Funnelback sends any HTTP credentials along with every request. | true |
schedule.incremental_crawl_ratio | The number of scheduled incremental crawls that are performed between each full crawl (e.g. a value of '10' results in an update schedule consisting of every ten incremental crawls being followed by a full crawl). | 10 |