Skip to content

Web crawler configuration options

Web crawler options

The web crawler has a comprehensive set of configuration options that can be used to adjust how the web crawler operates.

General options

crawler.num_crawlersNumber of crawler threads which simultaneously crawl different hosts.20
crawler.request_delayMilliseconds between HTTP requests (for a specific crawler thread).250
crawler.user_agentThe user agent that the web crawler identifies as uses when making HTTP requests.Mozilla/5.0 (compatible; Funnelback)
crawler.server_alias_filePath to optional file containing server alias mappings. See: server aliases
crawler.classes.RevisitPolicyJava class used for enforcing the revisit policy for URLscom.funnelback.common.revisit.AlwaysRevisitPolicy
crawler.revisit.edit_distance_thresholdThreshold for edit distance between two versions of a page when deciding whether it has changed or not when using the SimpleRevisitPolicy.20
crawler.revisit.num_times_revisit_skipped_thresholdThreshold for number of times a page revisit has been skipped when deciding whether to revisit it when using the SimpleRevisitPolicy.2
crawler.revisit.num_times_unchanged_thresholdThreshold for number of times a page has been unchanged when deciding whether to revisit it when using the SimpleRevisitPolicy.5
data_reportSpecifies if data reports should be generated for the crawl.true
vital_serversSpecifies a list of servers that must be present in the crawl for a successful update.

Options controlling what gets included

include_patternsURLs matching this are included in crawl (unless they match any exclude_patterns).
exclude_patternsURLs matching this are excluded from the crawl./cgi-bin,/vti,/_vti,calendar,SQ_DESIGN_NAME=print,SQ_ACTION=logout,SQ_PAINT_LAYOUT_NAME=,%3E%3C/script%3E,
crawler.use_sitemap_xmlSpecifies if sitemap.xml files should be processed during a web crawl.false
crawler.start_urls_filePath to a file that contains a list of URLs (one per line) that will be used as the starting point for a crawl. Note that this setting overrides the start_url that the crawler is passed on startup (usually stored in the crawler.start_url configuration option).collection.cfg.start.urls
start_urlCrawler seed URL. Crawler follows links in this page, and then the links of those pages and so on._disabled__see_start_urls_file
crawler.protocolsCrawl URLs via these protocols (comma separated list).http,https
crawler.reject_filesDo not crawl files with these extensions.asc,asf,asx,avi,bat,bib,bin,bmp,bz2,c,class,cpp,css,deb,dll,dmg,dvi,exe,fits,fts,gif,gz,h,ico,jar,java,jpeg,jpg,lzh,man,mid,mov,mp3,mp4,mpeg,mpg,o,old,pgp,png,ppm,qt,ra,ram,rpm,svg,swf,tar,tcl,tex,tgz,tif,tiff,vob,wav,wmv,wrl,xpm,zip,Z
crawler.accept_filesOnly crawl files with these extensions. Not normally used - default is to accept all valid content.
crawler.store_all_typesIf true, override accept/reject rules and crawl and store all file types encountered.false
crawler.store_empty_content_urlsIf true, store URLs even if, after filtering, they contain no content.false
crawler.non_htmlSpecifies non-html file formats to filter, based on the file extension (e.g. pdf, doc, xls)doc,docx,pdf,ppt,pptx,rtf,xls,xlsx,xlsm
crawler.allowed_redirect_patternSpecify a regex to allow crawler redirections that would otherwise by disallowed by the current include/exclude patterns.
crawler.parser.mimeTypesExtract links from these comma-separated or regexp: content-types.text/html,text/plain,text/xml,application/xhtml+xml,application/rss+xml,application/atom+xml,application/json,application/rdf+xml,application/xml
crawler.extract_links_from_javascriptWhether to extract links from Javascript while crawling.false
crawler.follow_links_in_commentsWhether to follow links in HTML comments while crawling.false
crawler.link_extraction_groupThe group in the crawler.link_extraction_regular_expression which should be extracted as the link/URL.
crawler.link_extraction_regular_expressionThe expression used to extract links from each document. This must be a Perl compatible regular expression.

Options controlling size limits and timeouts

crawler.max_dir_depthA URL with more than this many sub directories will be ignored (too deep, probably a crawler trap)15
crawler.max_download_sizeMaximum size of files crawler will download (in MB). Default: 10MB10
crawler.max_files_per_areaMaximum files per area e.g. number of files in one directory or generated by one dynamic generator e.g. index.asp?doc=123. This parameter used to be called crawler.max_dir_size10000
crawler.max_files_per_serverMaximum files per server (default (empty) is unlimited)
crawler.max_files_storedMaximum number of files to download (default, and less than 1, is unlimited)
crawler.max_link_distanceHow far to crawl from the start_url (default is unlimited). e.g. if crawler.max_link_distance = 1, only crawl the links on start_url. NB: Turning this on drops crawler to single-threaded operation.
crawler.max_parse_sizeCrawler will not parse documents beyond this many megabytes in size10
crawler.max_url_lengthA URL with more characters than this will be ignored (too long, probably a crawler trap)256
crawler.max_url_repeating_elementsA URL with more than this many repeating elements (directories) will be ignored (probably a crawler trap or incorrectly configured web server)5
crawler.overall_crawl_timeoutMaximum crawl time after which the update continues with indexing and changeover. The units of this parameter depend on the value of the crawler.overall_crawl_units parameter.24
crawler.overall_crawl_unitsThe units for the crawler.overall_crawl_timeout parameter. A value of hr indicates hours and min indicates
crawler.request_timeoutTimeout for HTTP page GETs (milliseconds)15000
crawler.max_timeout_retriesMaximum number of times to retry after a network timeout (default is 0)0

Authentication options

crawler.allow_concurrent_in_crawl_form_interactionEnable/disable concurrent processing of in-crawl form interaction.true
crawler.form_interaction.pre_crawl.groupId.urlSpecify a URL of the page containing the HTML web form in pre_crawl authentication mode
crawler.form_interaction.in_crawl.groupId.url_patternSpecify a URL or URL pattern of the page containing the HTML web form in in_crawl authentication mode
crawler.ntlm.domainNTLM domain to be used for web crawler authentication.
crawler.ntlm.passwordNTLM password to be used for web crawler authentication.
crawler.ntlm.usernameNTLM username to be used for web crawler authentication.
ftp_passwdPassword to use when gathering content from an FTP server.
ftp_userUsername to use when gathering content from an FTP server.
http_passwdPassword used for accessing password protected content during a crawl when.
http_userUsername used for accessing password protected content during a crawl.

Web crawler monitor options

crawler.monitor_authentication_cookie_renewal_intervalOptional time interval at which to renew crawl authentication cookies
crawler.monitor_checkpoint_intervalTime interval at which to checkpoint (seconds).1800
crawler.monitor_delay_typeType of delay to use during crawl (dynamic or fixed).dynamic
crawler.monitor_haltChecked during a crawl - if set to true then crawler will cleanly shutdown.false
crawler.monitor_preferred_servers_listOptional list of servers to prefer during crawl.
crawler.monitor_time_intervalTime interval at which to output monitoring information (seconds).30
crawler.monitor_url_reject_listOptional parameter listing URLs to reject during a running crawl.

HTTP options

http_proxyThe hostname (e.g. of the HTTP proxy to use during crawling. This hostname should not be prefixed with 'http://'.
http_proxy_passwdThe proxy password to be used during crawling.
http_proxy_portPort of HTTP proxy used during crawling.
http_proxy_userThe proxy user name to be used during crawling.
http_source_hostIP address or hostname used by crawler, on a machine with more than one available.
crawler.request_headerOptional additional header to be inserted in HTTP(S) requests made by the webcrawler.
crawler.request_header_url_prefixOptional URL prefix to be applied when processing the crawler.request_header parameter.
crawler.store_headersWrite HTTP header information at top of HTML files if true. Header information is used by indexer.true

Logging options

crawler.verbosityVerbosity level (0-6) of crawler logs. Higher number results in more messages.4
crawler.header_loggingOption to control whether HTTP headers are written out to a separate log file (default is false).false
crawler.incremental_loggingOption to control whether a list of new and changed URLs should be written to a log file during incremental crawlingfalse
crawler.logfileThe crawler's log path and filename.$SEARCH_HOME/data/$COLLECTION_NAME/offline/log/crawl.log

Web crawler advanced options

crawlerThe name of the crawler
crawler_binariesLocation of the crawler files.
crawler.accept_cookiesCookie policy. Default is false i.e. do not accept cookies. Requires HTTPClient if true.true
crawler.cache.DNSCache_max_sizeMaximum size of internal DNS cache. Upon reaching this size the cache will drop old elements.200000
crawler.cache.LRUCache_max_sizeMaximum size of LRUCache. Upon reaching this size the cache will drop old elements.500000
crawler.cache.URLCache_max_sizeMaximum size of URLCache. May be ignored by some cache implementations.50000000
crawler.check_alias_existsCheck if aliased URLs exists - if not, revert back to original URLfalse
crawler.checkpoint_toLocation of crawler checkpoint files.$SEARCH_HOME/data/$COLLECTION_NAME/offline/checkpoint
crawler.classes.CrawlerJava class used by crawler - defines top level behaviour, which protocols are supported
crawler.classes.FrontierJava class used for the frontier (a list of URLs not yet visited)
crawler.classes.PolicyJava class used for enforcing the include/exclude policy for URLscom.funnelback.crawler.StandardPolicy
crawler.classes.statisticsList of statistics classes to use during a crawl in order to generate figures for data reportsCrawlSizeStatistic,MIMETypeStatistic,BroadMIMETypeStatistic,FileSizeStatistic,FileSizeByDocumentTypeStatistic,SuffixTypeStatistic,ReferencedFileTypeStatistic,URLlengthStatistic,WebServerTypeStatistic,BroadWebServerTypeStatistic
crawler.classes.URLStoreJava class used to store content on disk e.g. create a mirror of files
crawler.cookie_jar_fileFile containing cookies to be pre-loaded when a web crawl begins.$SEARCH_HOME/conf/$COLLECTION_NAME/cookies.txt
crawler.eliminate_duplicatesWhether to eliminate duplicate documents while crawling (default is true)true
crawler.frontier_num_top_level_dirsOptional setting to specify number of top level directories to store disk based frontier files in.
crawler.frontier_use_ip_mappingWhether to map hosts to frontiers based on IP address. (default is false)false
crawler.frontier_hostsLists of hosts running crawlers if performing a distributed web crawl.
crawler.frontier_portPort on which DistributedFrontier will listen on.
crawler.max_individual_frontier_sizeMaximum size of an individual frontier (unlimited if not defined)
crawler.inline_filtering_enabledOption to control whether text extraction from binary files is done inline during a web crawltrue
crawler.lowercase_iis_urlsWhether to lowercase all URLs from IIS web servers (default is false)false
crawler.predirects_enabledEnable crawler predirects (boolean). See: crawler predirects
crawler.remove_parametersOptional list of parameters to remove from URLs.
crawler.robotAgentMatching is case-insensitive over the length of the name in a robots.txt file.Funnelback
crawler.secondary_store_rootLocation of secondary (previous) store - used in incremental crawling$SEARCH_HOME/data/$COLLECTION_NAME/live/data
crawler.sslClientStorePath to a SSL Client certificate store (absolute or relative). Empty/missing means no client certificate store. Certificate stores can be managed by Java's keytool.
crawler.sslClientStorePasswordPassword for the SSL Client certificate store. Empty/missing means no password, and may prevent client certificate validation. Certificate stores can be managed by Java's keytool.
crawler.sslTrustEveryoneTrust ALL Root Certificates and ignore server hostname verification if true. This bypasses all certificate and server validation by the HTTPS library, so every server and certificate is trusted. It can be used to overcome problems with unresolveable external certificate chains and poor certificates for virtual hosts, but will allow server spoofing.true
crawler.sslTrustStorePath to a SSL Trusted Root store (absolute or relative). Empty/missing means use those provided with Java. Certificate stores can be managed by Java's keytool.
crawler.send-http-basic-credentials-without-challengeThis option controls whether or not Funnelback sends any HTTP credentials along with every request.true
schedule.incremental_crawl_ratioThe number of scheduled incremental crawls that are performed between each full crawl (e.g. a value of '10' results in an update schedule consisting of every ten incremental crawls being followed by a full crawl).10

See also


Funnelback logo