Writing custom filters

Introduction

Custom filters can be used to transform and analyse content prior to indexing. Custom filters are written in Groovy or Java and operate on the document in various forms:

Document content as a string
Document content as raw bytes
Document without access to content (for operations that don't access or modify the document content)
HTML document as a Jsoup object, allowing DOM manipulation via CSS style selectors

Detailed filter framework documentation is also available in Javadoc format.

A series of filter examples are presented below to illustrate implementation of various filter types.

Filter types

There are several different types of filter interfaces supported by Funnelback. These are:

IJSoupFilter: For filtering HTML documents. See Jsoup filter.
StringDocumentFilter: For filtering string (non binary) content e.g. plain text, XML, HTML, JSON, etc. See manipulating string documents for an example.
BytesDocumentFilter: For filtering binary content e.g. pdf. See Converting raw byte (binary) documents to Strings for an example.
Filter: Used for filters that do not edit or read the document content but should run on all documents. See Removing a document for an example.

Naming and Locations.

Collection-level groovy filters should be stored in the collection's @groovy folder ($SEARCH_HOME/conf/COLLECTION_NAME/@groovy) under a folder structure that corresponds to the filter's package name.

If a filter is re-usable across collections it can also be saved to $SEARCH_HOME/lib/java/groovy/ under a folder structure that corresponds to the filter's package name.

Compiled .class files should also be in the same location.

All filters must follow the same naming and location scheme including Jsoup filters.

Example

If our filter looked like:

package com.example;

import com.funnelback.filter.api.filters.*;
import com.funnelback.common.filter.jsoup.*;

public class MyFilter implements Filter {

As the package name is com.example and the class name of the filter is MyFilter then the location for that filter must be in:

$SEARCH_HOME/conf/COLLECTION_NAME/@groovy/com/example/MyFilter.groovy

Configuring the filter to be used.

Typically you can just add your custom filter to the filter chain by using its fully qualified name that is its package name followed by the class name. For the above example, the fully qualified name is com.example.MyFilter. This can be added be added to the filter chain in collection.cfg by appending a comma and the fully qualified name to the filter.classes option.

Example

filter.classes=CombinerFilterProvider,TikaFilterProvider,ExternalFilterProvider:JSoupProcessingFilterProvider:DocumentFixerFilterProvider,com.example.MyFilter

See filter.classes for more information about configuring the filter chain.

Importing external dependencies for use with Groovy scripts

Java classes can be utilised from Groovy scripts via import statements.

Dependencies should be grabbed before they are imported to minimise any chance of the groovy script breaking after an upgrade.

Example

Ensure that the com.twitter.Extractor class can be imported from the twitter-text-1.14.jar.

@Grab(group='com.twitter.twittertext', module='twitter-text', version='3.0.1')
@GrabExclude('org.jsoup:jsoup') // Don't want conflicting versions

import com.twitter.twittertext.Extractor

See: Dependency management with Grape

Ensuring a filter only runs on certain file types

For filters that implement StringDocumentFilter or BytesDocumentFilter, a pre-filter check is used to determine if the filter should run. This is commonly set to only run on a document of a specific type, though any custom logic can be implemented here.

Restricting a filter to a document type is commonly achieved by either checking the document's mime type. Three build in functions are available to assist with checking for HTML, XML or JSON documents.

document.getDocumentType().isHTML() returns true if the document is a HTML document.
document.getDocumentType().isJSON() returns true if the document is a JSON document.
document.getDocumentType().isXML() returns true if the document is a XML document.

Example

The following pre-filter check ensures that the filter applies only to HTML documents.

public PreFilterCheck canFilter(NoContentDocument document, FilterContext context) {
  // Only run this filter on HTML documents
  if (document.getDocumentType().isHTML()) {
    return PreFilterCheck.ATTEMPT_FILTER;
  }
  return PreFilterCheck.SKIP_FILTER;
}

Logging

When developing and debugging a filter it's usually convenient to be able to write debug messages. To have the debug messages written to the collection filter.log (or to the inline filter log, depending on your collection settings) you should use the Log4j logging framework. To do so:

Annotate your class with @groovy.util.logging.Log4j2
Use this object into your method to output debug messages: log.info("Filtering content: [{}]", content)

Note: The default configuration of log messages is set to output the WARN level of above. That means that your messages will only appear if:

You use log.warn() or log.error() or log.fatal()
You re-configure the logging system for your specific namespace if you want to use log.info() or log.debug() or log.trace().

If the package of the class of the filter is within the "com.funnelback" namespace, the default log level of messages is set to INFO.

The logging system can be configured using SEARCH_HOME/conf/log4j2.xml.default as a starting point:

Either copy it to SEARCH_HOME/conf/log4j2.xml to have your configuration apply to all collections
Or copy it to SEARCH_HOME/conf/<collection>/log4j2.xml to apply it to this collection only.

Testing filters

For testing of Jsoup filters see: testing Jsoup filters which support a simplified test system.

For general filters, tests can be added to the FilterTest inner class contained within the filter. Methods that are annotated with @Test will be run when executing the groovy filter on the command line.

See: Basic filter example for an example of implementing and running tests on a filter.

Running filter tests on the command line

A filter should be pass all tests before it is added to the collection's filter chain. To run the tests for a Groovy filter run:

$SEARCH_HOME/linbin/java/bin/java -cp $SEARCH_HOME/lib/java/all/*:$SEARCH_HOME/tools/groovy/bin/groovy groovy.ui.GroovyMain $SEARCH_HOME/conf/COLLECTION_NAME/@groovy/com/myfilters/ExampleGroovyFilter.groovy

The output should show the tests are passing.

Constructors

Generally you can use a no argument constructor, however, other constructors are available if you need access to the search home or collection name. The filter framework will automatically call one of the constructors listed below. For our MyFilter example from above the following constructors could be used:

No argument constructor:

A constructor that takes no arguments.

public class MyFilter implements Filter {
    public MyFilter() {
       // Your constructor code here.
    }

Constructor given search home and collection name.

This constructor is given the search home variable as a java.io.File type and the collection name as a String. This constructor will be called in preference to the other constructor.

import java.io.File;

public class MyFilter implements Filter {
    public MyFilter(File searchHome, String collectionName) {
       // Your constructor code here.
    }

Filter examples

Basic filter example: Simple filter that injects content into a document. Also demonstrates writing of tests for non-Jsoup filters.
Manipulating HTML documents: This might be useful for editing documents that are HTML or for extracting metadata from HTML.
Manipulating string (non binary) documents: A simple way of filtering a document as a string.
Filters which read collection configuration options: Demonstrates a filter which reads options from collection.cfg.
Filters which read a custom configuration file: Show how to read from a custom configuration file for Jsoup and general document filters.
Adding metadata based on document content: Demonstrates adding values to the document content based on the document content.
Adding metadata to all documents: Demonstrates adding metadata to a document regardless of the document content.
Accessing the filtered metadata: Shows how to iterate over the multimap containing metadata added via filters.
Modifying document URLs: Demonstrates modifying the document URL for any document.
Filtering a document into multiple documents: Demonstrates spiting a single input document into multiple documents.
Filtering a HTML document into multiple documents: Demonstrates splitting a single HTML document into multiple HTML documents.
Removing a document: Demonstrates removing a document using the filters, typically resulting in that document not being available in the search index.
Altering the document type: Demonstrates fixing the document type based on the content of the document.
Converting raw byte (binary) documents to Strings: This might be done when converting from a binary format such as pdf to text format such as HTML or XML.
Manipulating raw byte (binary) documents: Demonstrates how to base64 encode a binary document.
Deprecated ScriptedFilterProvider: Provided for backwards compatibility with existing filters and typically should not be used.
Deprecated IFilterProvder: Provided for backwards compatibility with existing filters and typically should not be used.

Writing custom filters

Introduction

Filter types

Naming and Locations.

Example

Configuring the filter to be used.

Example

Importing external dependencies for use with Groovy scripts

Example

Ensuring a filter only runs on certain file types

Example

Logging

Testing filters

Running filter tests on the command line

Constructors

No argument constructor:

Constructor given search home and collection name.

Filter examples

See also:

Contents