Writing custom filters
Introduction
Custom filters can be used to transform and analyse content prior to indexing. Custom filters are written in Groovy or Java and operate on the document in various forms:
- Document content as a string
- Document content as raw bytes
- Document without access to content (for operations that don't access or modify the document content)
- HTML document as a Jsoup object, allowing DOM manipulation via CSS style selectors
Detailed filter framework documentation is also available in Javadoc format.
A series of filter examples are presented below to illustrate implementation of various filter types.
Filter types
There are several different types of filter interfaces supported by Funnelback. These are:
IJSoupFilter
: For filtering HTML documents. See Jsoup filter.StringDocumentFilter
: For filtering string (non binary) content e.g. plain text, XML, HTML, JSON, etc. See manipulating string documents for an example.BytesDocumentFilter
: For filtering binary content e.g. pdf. See Converting raw byte (binary) documents to Strings for an example.Filter
: Used for filters that do not edit or read the document content but should run on all documents. See Removing a document for an example.
Naming and Locations.
Collection-level groovy filters should be stored in the collection's @groovy
folder ($SEARCH_HOME/conf/COLLECTION_NAME/@groovy
) under a folder structure that corresponds to the filter's package name.
If a filter is re-usable across collections it can also be saved to $SEARCH_HOME/lib/java/groovy/
under a folder structure that corresponds to the filter's package name.
Compiled .class
files should also be in the same location.
All filters must follow the same naming and location scheme including Jsoup filters.
Example
If our filter looked like:
package com.example;
import com.funnelback.filter.api.filters.*;
import com.funnelback.common.filter.jsoup.*;
public class MyFilter implements Filter {
As the package name is com.example
and the class name of the filter is MyFilter
then the location for that filter must be in:
$SEARCH_HOME/conf/COLLECTION_NAME/@groovy/com/example/MyFilter.groovy
Configuring the filter to be used.
Typically you can just add your custom filter to the filter chain by using its fully qualified name that is its package name followed by the class name. For the above example, the fully qualified name is com.example.MyFilter
. This can be added be added to the filter chain in collection.cfg by appending a comma and the fully qualified name to the filter.classes
option.
Example
filter.classes=CombinerFilterProvider,TikaFilterProvider,ExternalFilterProvider:JSoupProcessingFilterProvider:DocumentFixerFilterProvider,com.example.MyFilter
See filter.classes for more information about configuring the filter chain.
Importing external dependencies for use with Groovy scripts
Java classes can be utilised from Groovy scripts via import statements.
Dependencies should be grabbed before they are imported to minimise any chance of the groovy script breaking after an upgrade.
Example
Ensure that the com.twitter.Extractor
class can be imported from the twitter-text-1.14.jar
.
@Grab(group='com.twitter.twittertext', module='twitter-text', version='3.0.1')
@GrabExclude('org.jsoup:jsoup') // Don't want conflicting versions
import com.twitter.twittertext.Extractor
See: Dependency management with Grape
Ensuring a filter only runs on certain file types
For filters that implement StringDocumentFilter
or BytesDocumentFilter
, a pre-filter check is used to determine if the filter should run. This is commonly set to only run on a document of a specific type, though any custom logic can be implemented here.
Restricting a filter to a document type is commonly achieved by either checking the document's mime type. Three build in functions are available to assist with checking for HTML, XML or JSON documents.
document.getDocumentType().isHTML()
returns true if the document is a HTML document.document.getDocumentType().isJSON()
returns true if the document is a JSON document.document.getDocumentType().isXML()
returns true if the document is a XML document.
Example
The following pre-filter check ensures that the filter applies only to HTML documents.
public PreFilterCheck canFilter(NoContentDocument document, FilterContext context) {
// Only run this filter on HTML documents
if (document.getDocumentType().isHTML()) {
return PreFilterCheck.ATTEMPT_FILTER;
}
return PreFilterCheck.SKIP_FILTER;
}
Logging
When developing and debugging a filter it's usually convenient to be able to write debug messages. To have the debug messages written to the collection filter.log
(or to the inline filter log, depending on your collection settings) you should use the Log4j logging framework. To do so:
- Annotate your class with
@groovy.util.logging.Log4j2
- Use this object into your method to output debug messages:
log.info("Filtering content: [{}]", content)
Note: The default configuration of log messages is set to output the WARN level of above. That means that your messages will only appear if:
- You use
log.warn()
orlog.error()
orlog.fatal()
- You re-configure the logging system for your specific namespace if you want to use
log.info()
orlog.debug()
orlog.trace()
.
If the package of the class of the filter is within the "com.funnelback" namespace, the default log level of messages is set to INFO.
The logging system can be configured using SEARCH_HOME/conf/log4j2.xml.default
as a starting point:
- Either copy it to
SEARCH_HOME/conf/log4j2.xml
to have your configuration apply to all collections - Or copy it to
SEARCH_HOME/conf/<collection>/log4j2.xml
to apply it to this collection only.
Testing filters
For testing of Jsoup filters see: testing Jsoup filters which support a simplified test system.
For general filters, tests can be added to the FilterTest
inner class contained within the filter. Methods that are annotated with @Test will be run when executing the groovy filter on the command line.
See: Basic filter example for an example of implementing and running tests on a filter.
Running filter tests on the command line
A filter should be pass all tests before it is added to the collection's filter chain. To run the tests for a Groovy filter run:
$SEARCH_HOME/linbin/java/bin/java -cp $SEARCH_HOME/lib/java/all/*:$SEARCH_HOME/tools/groovy/bin/groovy groovy.ui.GroovyMain $SEARCH_HOME/conf/COLLECTION_NAME/@groovy/com/myfilters/ExampleGroovyFilter.groovy
The output should show the tests are passing.
Constructors
Generally you can use a no argument constructor, however, other constructors are available if you need access to the search home or collection name. The filter framework will automatically call one of the constructors listed below. For our MyFilter
example from above the following constructors could be used:
No argument constructor:
A constructor that takes no arguments.
public class MyFilter implements Filter {
public MyFilter() {
// Your constructor code here.
}
Constructor given search home and collection name.
This constructor is given the search home variable as a java.io.File
type and the collection name as a String
. This constructor will be called in preference to the other constructor.
import java.io.File;
public class MyFilter implements Filter {
public MyFilter(File searchHome, String collectionName) {
// Your constructor code here.
}
Filter examples
- Basic filter example: Simple filter that injects content into a document. Also demonstrates writing of tests for non-Jsoup filters.
- Manipulating HTML documents: This might be useful for editing documents that are
HTML
or for extracting metadata fromHTML
. - Manipulating string (non binary) documents: A simple way of filtering a document as a string.
- Filters which read collection configuration options: Demonstrates a filter which reads options from
collection.cfg
. - Filters which read a custom configuration file: Show how to read from a custom configuration file for Jsoup and general document filters.
- Adding metadata based on document content: Demonstrates adding values to the document content based on the document content.
- Adding metadata to all documents: Demonstrates adding metadata to a document regardless of the document content.
- Accessing the filtered metadata: Shows how to iterate over the multimap containing metadata added via filters.
- Modifying document URLs: Demonstrates modifying the document URL for any document.
- Filtering a document into multiple documents: Demonstrates spiting a single input document into multiple documents.
- Filtering a HTML document into multiple documents: Demonstrates splitting a single HTML document into multiple HTML documents.
- Removing a document: Demonstrates removing a document using the filters, typically resulting in that document not being available in the search index.
- Altering the document type: Demonstrates fixing the document type based on the content of the document.
- Converting raw byte (binary) documents to Strings: This might be done when converting from a binary format such as
pdf
to text format such asHTML
orXML
. - Manipulating raw byte (binary) documents: Demonstrates how to base64 encode a binary document.
- Deprecated ScriptedFilterProvider: Provided for backwards compatibility with existing filters and typically should not be used.
- Deprecated IFilterProvder: Provided for backwards compatibility with existing filters and typically should not be used.