Document filtering
Introduction
Filtering is the process of transforming gathered content into content suitable for indexing by Funnelback.
Filtering passes the raw document through multiple chained filters which may modify the document in different ways. These modifications may include converting the document format from PDF to a indexable text format like HTML or modifying the document like adding metadata or altering the document's URL.
Filtering is run during the gather phase of an update. For push collections filtering can be run when a document is added.
Note: A full update is required after making any changes to filters as documents that are copied during an incremental update are not re-filtered. Full updates are started from the advanced update screen.
The filter chain
During the filter phase the document passes through a series of document filters with the modified output being passed through to the next filter. The series of filters is referred to as the filter chain.
There are a number of preset filters that are used to perform tasks such as extracting text from a binary document and cleaning the titles.
A typical filter process is shown below. A binary document is converted to text using the Tika filters. This extracts the document text and outputs the document as HTML. This HTML is then passed through the JSoup filter which runs a separate chain of JSoup filters which allow targeted modification of the HTML content and structure. Finally a custom filter performs a number of modifications to the content.
JSoup filters should be used for HTML documents when making modifications to the document structure, or performing operations that select and transform the document's DOM. Custom JSoup filters can be written to perform operations such as:
- Injecting metadata
- Cleaning titles
- Scraping content (e.g. extracting breadcrumbs to metadata)
The filter chain is made up of chains and choices - separated using two types of delimiters. These control if the content passes through a single filter from a set of filters (a choice, indicated by commas), or through each filter (a chain, indicated by colons). Jsoup filters use a separate filter chain.
The set of filters below would be processed as follows: The content would pass through either Filter3
, Filter2
or Filter1
before passing through Filter4
and Filter5
.
Filter1,Filter2,Filter3:Filter4:Filter5
There are some caveats when specifying filter chains:
- Choice sets are checked in reverse order. i.e. filters that appear last in the list will be used first if they are capable of filtering a given document type.
- When specifying a combination of choice and chain filters,
,
has higher precedence than:
. In other words it is possible to have a chain of choice filters, but it is not possible to have a choice between several chains of filters.
There is also support for custom document filters written in Groovy. Custom filters receive the document's URL and content as an input and must return the transformed document text ready to pass on to the next filter. The custom filter and can do pretty much anything to the content, and uses Groovy (and Java) code.
Custom filters should be used when a JSoup filter is not appropriate. Custom filters offer more flexibility but are more expensive to run. Custom filters can be used for operations such as:
- Manipulating complete documents as binary or string data
- Splitting a document into multiple documents
- Modifying the document type or URL
- Removing documents
- Transforming HTML or JSON documents
- Implementing document conversion for binary documents
- Processing/analysis of documents where structure is not relevant
See also:
Document filters
Document filters make up the main filter chain within Funnelback.
Built-in filters
Funnelback ships with the following built-in filters:
Class | Description |
---|---|
CSVToXML | Converts records in a CSV, TSV, SQL or Excel document to multiple XML documents. |
DocumentFixerFilterProvider | Analyses the document title and attempts to replace it if the title is not considered a good title. HTML documents only. |
ExternalFilterProvider | Uses external programs to convert documents. |
ForceCSVMime | Sets the MIME type of all documents to text/csv . |
ForceJSONMime | Sets the MIME type of all documents to application/json . |
ForceXMLMime | Sets the MIME type of all documents to text/xml . |
InjectNoIndexFilterProvider | Automatically inserts noindex tags based on CSS selectors. |
JSONToXML | Converts JSON documents to XML. |
JSoupProcessingFilterProvider | Converts HTML documents to and from a JSoup object and runs an extra chain of JSoup filters. |
MetadataNormaliser | Used to normalise and replace metadata fields. |
TikaFilterProvider | Convert binary files of specific file formats (Microsoft Office files, PDF files, etc.) to HTML using Apache Tika. |
TextDetectionFilterProvider | Used for detecting whether a URL contains textual content. Used by the Content Auditor. |
WorkflowFilter | Used for generic filtering workflows (e.g. inserting metadata based on URL patterns, performing string replacements, etc.) |
CombinerFilterProvider | Combine content with extra metadata files (.pan.txt or .fun.txt ). Works with text and HTML content only. |
Custom filters
Custom filters can also be written in Groovy that operate on the document content. However, for html documents most custom filtering needs are best served by writing a JSoup filter. Custom filters are appropriate when filtering is required on non-html documents, or to process the document as a whole piece of unstructured content.
See: writing filters
HTML documents - JSoup filtering
JSoup filtering allows for a series of sub-filters to be written that can perform targeted modification of the HTML document structure and content.
The main JSoup filter, which is included in the filter chain takes the HTML document and converts it into a Jsoup (or structured DOM) object that the JSoup filters can then work with using DOM traversal and CSS style selectors, which select on things such as element name, class, ID.
A series of JSoup filters can then be chained together to perform a series of operations on the structured object - this includes modifying content, injecting/deleting elements and restructuring the HTML.
The structured object is serialised at the end of the JSoup filter chain returning the text of the whole data structure to the next filter in the main filter chain.
See: HTML document filters (Jsoup filters)
Shared groovy filters
Groovy filters that are designed to be shared amongst collections can be installed into $SEARCH_HOME/lib/java/groovy
with a sub-folder structure that mirrors that of the collection's @groovy
folder.
See: naming filters
Filtering performance tips
- For HTML document filtering use Jsoup filters where possible as this Jsoup objects are only created once at the start of the Jsoup filter chain.
- Avoid using external filters as a new server process is started for every document that is filtered using an external filter.