Skip to content

Writing custom filters

Introduction

Custom filters can be used to transform and analyse content prior to indexing. Custom filters are written in Groovy or Java and operate on the document in various forms:

  • Document content as a string
  • Document content as raw bytes
  • Document without access to content (for operations that don't access or modify the document content)
  • HTML document as a Jsoup object, allowing DOM manipulation via CSS style selectors

Detailed filter framework documentation is also available in Javadoc format.

A series of filter examples are presented below to illustrate implementation of various filter types.

Filter types

There are several different types of filter interfaces supported by Funnelback. These are:

Naming and Locations.

Collection-level groovy filters should be stored in the collection's @groovy folder ($SEARCH_HOME/conf/COLLECTION_NAME/@groovy) under a folder structure that corresponds to the filter's package name.

If a filter is re-usable across collections it can also be saved to $SEARCH_HOME/lib/java/groovy/ under a folder structure that corresponds to the filter's package name.

Compiled .class files should also be in the same location.

All filters must follow the same naming and location scheme including Jsoup filters.

Example

If our filter looked like:

package com.example;

import com.funnelback.filter.api.filters.*;
import com.funnelback.common.filter.jsoup.*;

public class MyFilter implements Filter {

As the package name is com.example and the class name of the filter is MyFilter then the location for that filter must be in:

$SEARCH_HOME/conf/COLLECTION_NAME/@groovy/com/example/MyFilter.groovy

Configuring the filter to be used.

Typically you can just add your custom filter to the filter chain by using its fully qualified name that is its package name followed by the class name. For the above example, the fully qualified name is com.example.MyFilter. This can be added be added to the filter chain in collection.cfg by appending a comma and the fully qualified name to the filter.classes option.

Example

filter.classes=CombinerFilterProvider,TikaFilterProvider,ExternalFilterProvider:JSoupProcessingFilterProvider:DocumentFixerFilterProvider,com.example.MyFilter

See filter.classes for more information about configuring the filter chain.

Importing external dependencies for use with Groovy scripts

Java classes can be utilised from Groovy scripts via import statements.

Dependencies should be grabbed before they are imported to minimise any chance of the groovy script breaking after an upgrade.

Example

Ensure that the com.twitter.Extractor class can be imported from the twitter-text-1.14.jar.

@Grab(group='com.twitter.twittertext', module='twitter-text', version='3.0.1')
@GrabExclude('org.jsoup:jsoup') // Don't want conflicting versions

import com.twitter.twittertext.Extractor

See: Dependency management with Grape

Ensuring a filter only runs on certain file types

For filters that implement StringDocumentFilter or BytesDocumentFilter, a pre-filter check is used to determine if the filter should run. This is commonly set to only run on a document of a specific type, though any custom logic can be implemented here.

Restricting a filter to a document type is commonly achieved by either checking the document's mime type. Three build in functions are available to assist with checking for HTML, XML or JSON documents.

  • document.getDocumentType().isHTML() returns true if the document is a HTML document.
  • document.getDocumentType().isJSON() returns true if the document is a JSON document.
  • document.getDocumentType().isXML() returns true if the document is a XML document.

Example

The following pre-filter check ensures that the filter applies only to HTML documents.

public PreFilterCheck canFilter(NoContentDocument document, FilterContext context) {
  // Only run this filter on HTML documents
  if (document.getDocumentType().isHTML()) {
    return PreFilterCheck.ATTEMPT_FILTER;
  }
  return PreFilterCheck.SKIP_FILTER;
}

Logging

When developing and debugging a filter it's usually convenient to be able to write debug messages. To have the debug messages written to the collection filter.log (or to the inline filter log, depending on your collection settings) you should use the Log4j logging framework. To do so:

  • Annotate your class with @groovy.util.logging.Log4j2
  • Use this object into your method to output debug messages: log.info("Filtering content: [{}]", content)

Note: The default configuration of log messages is set to output the WARN level of above. That means that your messages will only appear if:

  • You use log.warn() or log.error() or log.fatal()
  • You re-configure the logging system for your specific namespace if you want to use log.info() or log.debug() or log.trace().

If the package of the class of the filter is within the "com.funnelback" namespace, the default log level of messages is set to INFO.

The logging system can be configured using SEARCH_HOME/conf/log4j2.xml.default as a starting point:

  • Either copy it to SEARCH_HOME/conf/log4j2.xml to have your configuration apply to all collections
  • Or copy it to SEARCH_HOME/conf/<collection>/log4j2.xml to apply it to this collection only.

Testing filters

For testing of Jsoup filters see: testing Jsoup filters which support a simplified test system.

For general filters, tests can be added to the FilterTest inner class contained within the filter. Methods that are annotated with @Test will be run when executing the groovy filter on the command line.

See: Basic filter example for an example of implementing and running tests on a filter.

Running filter tests on the command line

A filter should be pass all tests before it is added to the collection's filter chain. To run the tests for a Groovy filter run:

$SEARCH_HOME/linbin/java/bin/java -cp $SEARCH_HOME/lib/java/all/*:$SEARCH_HOME/tools/groovy/bin/groovy groovy.ui.GroovyMain $SEARCH_HOME/conf/COLLECTION_NAME/@groovy/com/myfilters/ExampleGroovyFilter.groovy

The output should show the tests are passing.

Constructors

Generally you can use a no argument constructor, however, other constructors are available if you need access to the search home or collection name. The filter framework will automatically call one of the constructors listed below. For our MyFilter example from above the following constructors could be used:

No argument constructor:

A constructor that takes no arguments.

public class MyFilter implements Filter {
    public MyFilter() {
       // Your constructor code here.
    }

Constructor given search home and collection name.

This constructor is given the search home variable as a java.io.File type and the collection name as a String. This constructor will be called in preference to the other constructor.

import java.io.File;

public class MyFilter implements Filter {
    public MyFilter(File searchHome, String collectionName) {
       // Your constructor code here.
    }

Filter examples

See also:

top

Funnelback logo
v15.24.0