Skip to content

Web crawler logs

Introduction

The Funnelback web crawler writes its log files to the $SEARCH_HOME/data/<collection>/offline/log directory. Details on the main log files produced are given below.

Webcrawler Logs

crawl.log

The main web crawler log file, which details the overall progress and status of the crawl.

crawl.log.N.gz

Individual crawler thread logs, where N is a number from 0 to the number of crawlers - 1 (default 19)

  ...
  Rejected: http://www.funnelback.com/css/styles.css
  Unacceptable: http://www.squiz.net  
  Cached: http://www.funnelback.com/our-products
<DOCHDR>

<BASE HREF="http://www.funnelback.com/our-products">
     HTTP/1.1 200 OK
     Content-Type: text/html; charset=utf-8
     Vary: Accept-Encoding
     Date: Fri, 16 May 2014 01:40:35 GMT
     Cache-Control: private
     X-UA-Compatible: IE=edge
     Content-Length: 67764
     X-FRAME-OPTIONS: SAMEORIGIN
     X-Funnelback-Stored-Length: 67679
     X-Funnelback-Last-Modification-Seen: 2014:05:16T11:40:35
     X-Funnelback-Num-Times-Unchanged: 0
     X-Funnelback-Num-Times-Copied: 0
     X-Funnelback-Num-Times-Revisit-Skipped: 0

</DOCHDR>
   Frontier Delay: 138 ms for frontier which contained URL:   
   http://www.funnelback.com/our-products
   Process: http://www.funnelback.com/our-products
   Contacting http://www.funnelback.com/our-products [11:40:35:743]
   GET connected to http://www.funnelback.com/our-products
   GET Request Bytes: 274 Response Header Bytes: 221 URL:    
   http://www.funnelback.com/our-products
   GET from http://www.funnelback.com/our-products [text/html; charset=utf-8] [62013]
   Signalled frontier for host: www.funnelback.com
   Parsing http://www.funnelback.com/our-products
   Parsed http://www.funnelback.com/our-products
   Content Bytes: 61918 URL: http://www.funnelback.com/our-products
   Scanner: http://www.funnelback.com/our-products
   Extracted_Text: ...
   MD5/Hash: 59d188b6f6a8bde267b796e2c9f1660f 1327251562  http://www.funnelback.com/our-products
   http://www.funnelback.com/our-products sig_cache_size: 32
   ...

stored.log

Lists all the URLs which were successfully stored by the crawler in chronological order.

http://www.funnelback.com
http://www.funnelback.com/our-products
http://www.funnelback.com/our-products/enterprise-search
...

url_errors.log

Details all errors encountered during the crawl when attempting to fetch URLs, including HTTP status codes, network exceptions, link extraction, etc.

  E http://www.funnelback.com/missing-page [404 Not Found] [2014:06:16:09:57:53]
  E http://www.funnelback.com/large-file.pdf [Exceeds max_download_size: 104405535] [2014:01:09:11:46:53]
  E http://www.funnelback.com/secure-section [403 Forbidden] [2014:06:16:09:57:53]
  E http://www.funnelback.com/gone [410 Gone] [2014:06:16:09:57:53]
  E http://www.funnelback.com/blogs [Can't scan root page] [2014:03:31:21:20:53]
  E http://www.funnelback.com/popular-page [Net Error: Read timed out] [2014:06:03:10:11:16]
  E http://www.funnelback.com/popular-page-2 [Net Error: Connection reset] [2014:06:03:19:46:44]
  E xttp://www.funnelback.com/ [Link Extraction: java.net.MalformedURLException: unknown protocol: xttp] [2014:06:03:11:28:38]
  ...

sorted.log

Combination of stored.log and url_errors.log in lexicographical order (for ease of access by other programs).

redirects.txt

All encountered redirects, including HTTP redirects, HTML meta refresh, canonical URL references and server aliases.

  # H = HTTP Redirect, M = Meta Refresh (HTML) Redirect, A = Aliased Server URL, D = Duplicate (based on MD5 of extracted text), C = Canonical Link Directive
  A http://www2.funnelback.com/ -> http://www.funnelback.com/
  H http://www.funnelback.com/rss -> http://www.funnelback.com/feeds/rss
  D http://www.funnelback.com/news/latest -> http://www.funnelback.com/news
  C http://www.funnelback.com/news?id=1234 -> http://www.funnelback.com/news/2014/03/02/-title
  M http://www.funnelback.com/searchbetter -> http://www.funnelback.com/files/white-paper/search-better.pdf
  ...

servers.log

Details all individual sub-domains that were encountered, how many documents were left in the frontier at the end of the crawl as well as how many were stored.

# Server Frontier Stored
http://www.funnelback.com/ 3 456
http://docs.funnelback.com/ 0 1234
https://docs.funnelback.com/ 0 1
...

domains.log

Same as servers.log but on a domain-only basis, where all sub-domains are accumulated.

# Domain Frontier Stored
funnelback.com 3 1691

See also

top

Funnelback logo
v15.16.0