WarcCat

Introduction

WarcCat is a tool that can be used to inspect and perform basic operations on warc files.

Usage

WarcCat <Matcher> <Print options>

To get the usage of WarcCat run (from the $SEARCH_HOME):

# Linux
$SEARCH_HOME/linbin/java/bin/java -classpath "$SEARCH_HOME/lib/java/all/*" com.funnelback.warc.util.WarcCat
# Windows
%SEARCH_HOME%\wbin\java\binjava -classpath %SEARCH_HOME%\lib\java\all\^* com.funnelback.warc.util.WarcCat

Warc stem

-stem <warc stem>: The input stem of the warc file to be displayed e.g. /foo/bar for the warc file /foo/bar.warc.

Matcher

Specifies the records within the warc file that will be selected.

One of the following should be specified:

-matcher MatchAll: (Default) Match every record in the warc file.
-matcher Bounded -MF start=<N> -MF end=<N>: Match a range of records from the warc file. The start and end values specify the range, where <N>=1 is the first record in the warc file.
-matcher HeaderFieldRegex -MF headerFieldName=<NAME> -MF regex=<EXPR>: Match a set of values from a specified header using a regular expression.
-matcher MatchURI -MF uri=<URI>: Match a specific URI.
-matcher RegexURI -MF regex=<EXPR>: Match a set of URIs using a regular expression.
-matcher MatchStartOfURI -MF prefix=<PREFIX>: Match a set of URIs using a URI prefix.

Print options

Specifies how the selected records should be printed.

One of the following should be specified:

-printer All -PF newLineBreakBetween=: (Default) Print the warc headers and uncompressed content for the matching records from the warc file.
-printer AllCompressed -PF newLineBreakBetween=: Print the warc headers and compressed content for the matching records from the warc file.
-printer ContentUncompressed -PF newLineBreakBetween=: Print the content (uncompressed) only for the matching records from the warc file.
-printer HeaderOnly -PF newLineBreakBetween=: Print the warc headers only for the matching records from the warc file.
-printer SplitIntoFiles -PF prefix=<PREFIX> -PF recordsPerFile=<N> -PF overwrite=: Split a warc file into n-document chunks saved as separate warc files. prefix is the file name stem of the output warc files. recordsPerFile sets the maximum number of documents to include in the split warc files. When overwrite is set to true existing output warc files will be overwritten.

The default value for newLineBreakBetween is false.

An additional option controlling the overall warc file header can also be specified:

-printWarcInfo : (Default = true) Print the warc file header. This should be printed at the start of any warc file. Set to false when appending records to an existing warc file.

Examples

View a warc file

Display everything in a warc file, $SEARCH_HOME/data/COLLECTION/live/data/funnelback-web-crawl.warc:

$SEARCH_HOME/linbin/java/bin/java -classpath "$SEARCH_HOME/lib/java/all/*" com.funnelback.warc.util.WarcCat -stem data/COLLECTION/live/data/funnelback-web-crawl

Create a warc file containing documents from other warc files

Create a warc file which consists of two documents from another warc file. First we will extract one document:

$SEARCH_HOME/linbin/java/bin/java -classpath "$SEARCH_HOME/lib/java/all/*" com.funnelback.warc.util.WarcCat -stem data/COLLECTION/live/data/funnelback-web-crawl -matcher MatchURI -MF "uri=http://funnelback.com/" -printer AllCompressed -printWarcInfo true > /tmp/newWarcFile.warc

Breaking down that command, we set the matcher to the MatchURI type which requires the uri to be set as well using the -MF to be set followed by uri=<doc URI>. We set the printer to AllCompressed which will print out both the headers and the content, this will compresses the content part to save space. Finally we set the -printWarcInfo to true, which prepends the warc header to the file. To append the second document to the warc file we run:

$SEARCH_HOME/linbin/java/bin/java -classpath "$SEARCH_HOMElib/java/all/*" com.funnelback.warc.util.WarcCat -stem data/COLLECTION/live/data/funnelback-web-crawl -matcher MatchURI -MF "uri=http://docs.funnelback.com/" -printer AllCompressed -printWarcInfo false >> /tmp/newWarcFile.warc

This time we set -printWriterInfo to false, as the warc file already has a warc file header.

Split a warc file into several smaller files

The split into files printer can be used to take an input warc file (indicated by the stem parameter) and split it into multiple files containing n records (indicated by the recordsPerFile parameter).

/opt/funnelback/linbin/java/bin/java -classpath "$SEARCH_HOME/lib/java/all/*" com.funnelback.warc.util.WarcCat -stem  $SEARCH_HOME/data/COLLECTION/live/data/funnelback-web-crawl  -matcher MatchAll  -printer  SplitIntoFiles  -PF  prefix=$SEARCH_HOME/data/COLLECTION/live/data/funnelback-web-crawl-split  -PF  recordsPerFile=100000  -PF  overwrite=true

The bounded matcher can also be used to extract a range of records from a warc file.

# Extract first 100000 records from funnelback-web-crawl.warc and write it to funnelback-web-crawl-1.warc
/opt/funnelback/linbin/java/bin/java -classpath "$SEARCH_HOME/lib/java/all/*:target/funnelback-warc-library.jar" com.funnelback.warc.util.WarcCat -stem $SEARCH_HOME/data/COLLECTION/live/data/funnelback-web-crawl -matcher Bounded -MF first=1 -MF last=100000 -printer AllCompressed > $SEARCH_HOME/data/COLLECTION/live/data/funnelback-web-crawl-1.warc