Web crawler logs
Introduction
The Funnelback web crawler writes its log files to the $SEARCH_HOME/data/<collection>/offline/log
directory. Details on the main log files produced are given below.
Webcrawler Logs
crawl.log
The main web crawler log file, which details the overall progress and status of the crawl.
crawl.log.N.gz
Individual crawler thread logs, where N is a number from 0 to the number of crawlers - 1 (default 19)
...
Rejected: http://www.funnelback.com/css/styles.css
Unacceptable: http://www.squiz.net
Cached: http://www.funnelback.com/our-products
<DOCHDR>
<BASE HREF="http://www.funnelback.com/our-products">
HTTP/1.1 200 OK
Content-Type: text/html; charset=utf-8
Vary: Accept-Encoding
Date: Fri, 16 May 2014 01:40:35 GMT
Cache-Control: private
X-UA-Compatible: IE=edge
Content-Length: 67764
X-FRAME-OPTIONS: SAMEORIGIN
X-Funnelback-Stored-Length: 67679
X-Funnelback-Last-Modification-Seen: 2014:05:16T11:40:35
X-Funnelback-Num-Times-Unchanged: 0
X-Funnelback-Num-Times-Copied: 0
X-Funnelback-Num-Times-Revisit-Skipped: 0
</DOCHDR>
Frontier Delay: 138 ms for frontier which contained URL:
http://www.funnelback.com/our-products
Process: http://www.funnelback.com/our-products
Contacting http://www.funnelback.com/our-products [11:40:35:743]
GET connected to http://www.funnelback.com/our-products
GET Request Bytes: 274 Response Header Bytes: 221 URL:
http://www.funnelback.com/our-products
GET from http://www.funnelback.com/our-products [text/html; charset=utf-8] [62013]
Signalled frontier for host: www.funnelback.com
Parsing http://www.funnelback.com/our-products
Parsed http://www.funnelback.com/our-products
Content Bytes: 61918 URL: http://www.funnelback.com/our-products
Scanner: http://www.funnelback.com/our-products
Extracted_Text: ...
MD5/Hash: 59d188b6f6a8bde267b796e2c9f1660f 1327251562 http://www.funnelback.com/our-products
http://www.funnelback.com/our-products sig_cache_size: 32
...
stored.log
Lists all the URLs which were successfully stored by the crawler in chronological order.
http://www.funnelback.com
http://www.funnelback.com/our-products
http://www.funnelback.com/our-products/enterprise-search
...
url_errors.log
Details all errors encountered during the crawl when attempting to fetch URLs, including HTTP status codes, network exceptions, link extraction, etc.
E http://www.funnelback.com/missing-page [404 Not Found] [2014:06:16:09:57:53]
E http://www.funnelback.com/large-file.pdf [Exceeds max_download_size: 104405535] [2014:01:09:11:46:53]
E http://www.funnelback.com/secure-section [403 Forbidden] [2014:06:16:09:57:53]
E http://www.funnelback.com/gone [410 Gone] [2014:06:16:09:57:53]
E http://www.funnelback.com/blogs [Can't scan root page] [2014:03:31:21:20:53]
E http://www.funnelback.com/popular-page [Net Error: Read timed out] [2014:06:03:10:11:16]
E http://www.funnelback.com/popular-page-2 [Net Error: Connection reset] [2014:06:03:19:46:44]
E xttp://www.funnelback.com/ [Link Extraction: java.net.MalformedURLException: unknown protocol: xttp] [2014:06:03:11:28:38]
...
redirects.txt
All encountered redirects, including HTTP redirects, HTML meta refresh, canonical URL references and server aliases.
# H = HTTP Redirect, M = Meta Refresh (HTML) Redirect, A = Aliased Server URL, D = Duplicate (based on MD5 of extracted text), C = Canonical Link Directive
A http://www2.funnelback.com/ -> http://www.funnelback.com/
H http://www.funnelback.com/rss -> http://www.funnelback.com/feeds/rss
D http://www.funnelback.com/news/latest -> http://www.funnelback.com/news
C http://www.funnelback.com/news?id=1234 -> http://www.funnelback.com/news/2014/03/02/-title
M http://www.funnelback.com/searchbetter -> http://www.funnelback.com/files/white-paper/search-better.pdf
...
servers.log
Details all individual sub-domains that were encountered, how many documents were left in the frontier at the end of the crawl as well as how many were stored.
# Server Frontier Stored
http://www.funnelback.com/ 3 456
http://docs.funnelback.com/ 0 1234
https://docs.funnelback.com/ 0 1
...
domains.log
Same as servers.log
but on a domain-only basis, where all sub-domains are accumulated.
# Domain Frontier Stored
funnelback.com 3 1691