Document filtering

Introduction

Filtering is the process of transforming gathered content into content suitable for indexing by Funnelback.

Filtering passes the raw document through multiple chained filters which may modify the document in different ways. These modifications may include converting the document format from PDF to a indexable text format like HTML or modifying the document like adding metadata or altering the document's URL.

Filtering is run during the gather phase of an update. For push collections filtering can be run when a document is added.

Note: A full update is required after making any changes to filters as documents that are copied during an incremental update are not re-filtered. Full updates are started from the advanced update screen.

The filter chain

During the filter phase the document passes through a series of document filters with the modified output being passed through to the next filter. The series of filters is referred to as the filter chain.

There are a number of preset filters that are used to perform tasks such as extracting text from a binary document and cleaning the titles.

A typical filter process is shown below. A binary document is converted to text using the Tika filters. This extracts the document text and outputs the document as HTML. This HTML is then passed through the JSoup filter which runs a separate chain of JSoup filters which allow targeted modification of the HTML content and structure. Finally a custom filter performs a number of modifications to the content.

JSoup filters should be used for HTML documents when making modifications to the document structure, or performing operations that select and transform the document's DOM. Custom JSoup filters can be written to perform operations such as:

Injecting metadata
Cleaning titles
Scraping content (e.g. extracting breadcrumbs to metadata)

The filter chain is made up of chains and choices - separated using two types of delimiters. These control if the content passes through a single filter from a set of filters (a choice, indicated by commas), or through each filter (a chain, indicated by colons). Jsoup filters use a separate filter chain.

The set of filters below would be processed as follows: The content would pass through either Filter3, Filter2 or Filter1 before passing through Filter4 and Filter5.

Filter1,Filter2,Filter3:Filter4:Filter5

There are some caveats when specifying filter chains:

Choice sets are checked in reverse order. i.e. filters that appear last in the list will be used first if they are capable of filtering a given document type.
When specifying a combination of choice and chain filters, , has higher precedence than :. In other words it is possible to have a chain of choice filters, but it is not possible to have a choice between several chains of filters.

There is also support for custom document filters written in Groovy. Custom filters receive the document's URL and content as an input and must return the transformed document text ready to pass on to the next filter. The custom filter and can do pretty much anything to the content, and uses Groovy (and Java) code.

Custom filters should be used when a JSoup filter is not appropriate. Custom filters offer more flexibility but are more expensive to run. Custom filters can be used for operations such as:

Manipulating complete documents as binary or string data
Splitting a document into multiple documents
Modifying the document type or URL
Removing documents
Transforming HTML or JSON documents
Implementing document conversion for binary documents
Processing/analysis of documents where structure is not relevant

Document filters

Document filters make up the main filter chain within Funnelback.

Built-in filters

Funnelback ships with the following built-in filters:

Class	Description
CSVToXML	Converts records in a CSV, TSV, SQL or Excel document to multiple XML documents.
DocumentFixerFilterProvider	Analyses the document title and attempts to replace it if the title is not considered a good title. HTML documents only.
ExternalFilterProvider	Uses external programs to convert documents.
ForceCSVMime	Sets the MIME type of all documents to `text/csv`.
ForceJSONMime	Sets the MIME type of all documents to `application/json`.
ForceXMLMime	Sets the MIME type of all documents to `text/xml`.
InjectNoIndexFilterProvider	Automatically inserts noindex tags based on CSS selectors.
JSONToXML	Converts JSON documents to XML.
JSoupProcessingFilterProvider	Converts HTML documents to and from a JSoup object and runs an extra chain of JSoup filters.
MetadataNormaliser	Used to normalise and replace metadata fields.
TikaFilterProvider	Convert binary files of specific file formats (Microsoft Office files, PDF files, etc.) to HTML using Apache Tika.
TextDetectionFilterProvider	Used for detecting whether a URL contains textual content. Used by the Content Auditor.
WorkflowFilter	Used for generic filtering workflows (e.g. inserting metadata based on URL patterns, performing string replacements, etc.)
CombinerFilterProvider	Combine content with extra metadata files (`.pan.txt` or `.fun.txt`). Works with text and HTML content only.

Custom filters

Custom filters can also be written in Groovy that operate on the document content. However, for html documents most custom filtering needs are best served by writing a JSoup filter. Custom filters are appropriate when filtering is required on non-html documents, or to process the document as a whole piece of unstructured content.

See: writing filters

HTML documents - JSoup filtering

JSoup filtering allows for a series of sub-filters to be written that can perform targeted modification of the HTML document structure and content.

The main JSoup filter, which is included in the filter chain takes the HTML document and converts it into a Jsoup (or structured DOM) object that the JSoup filters can then work with using DOM traversal and CSS style selectors, which select on things such as element name, class, ID.

A series of JSoup filters can then be chained together to perform a series of operations on the structured object - this includes modifying content, injecting/deleting elements and restructuring the HTML.

The structured object is serialised at the end of the JSoup filter chain returning the text of the whole data structure to the next filter in the main filter chain.

See: HTML document filters (Jsoup filters)

Shared groovy filters

Groovy filters that are designed to be shared amongst collections can be installed into $SEARCH_HOME/lib/java/groovy with a sub-folder structure that mirrors that of the collection's @groovy folder.

See: naming filters

Filtering performance tips

For HTML document filtering use Jsoup filters where possible as this Jsoup objects are only created once at the start of the Jsoup filter chain.
Avoid using external filters as a new server process is started for every document that is filtered using an external filter.

top