Filter example: basic document filter

Description

Filters can be used to add metadata to document regardless of document type.

In this example a StringDocumentFilter is used to prepend the document content with the string Example:.

Example

Here is a very simple example Groovy filter script to use as a starting point. Copy this filter into $SEARCH_HOME/conf/COLLECTION_NAME/@groovy/com/myfilters/ExampleGroovyFilter.groovy

// Filename:  ExampleGroovyFilter.groovy

// This package declaration should match the .groovy file location
// on disk as well as the name used for the filter.classes collection.cfg setting.
package com.myfilters;

import java.net.URI;
import org.junit.*;
import org.junit.Test;
import com.funnelback.filter.api.*;
import com.funnelback.filter.api.documents.*;
import com.funnelback.filter.api.filters.*;
import com.funnelback.filter.api.mock.*;

//This annotation provides a logger under the "log" name
@groovy.util.logging.Log4j2
public class ExampleGroovyFilter implements StringDocumentFilter {

  /*
   * The result of this determines if the filter is run. In this example the filter
   * is only run if the document type, derived from the Content-Type returned by 
   * the web server is HTML. If the document is not HTML the filter will be skipped
   */
  @Override
  public PreFilterCheck canFilter(NoContentDocument document, FilterContext context) {
    if (document.getDocumentType().isHTML()) {
      return PreFilterCheck.ATTEMPT_FILTER;
    }
    return PreFilterCheck.SKIP_FILTER;
  }

  /*
   * This contains the logic of the filtering. The first line uses the logger to
   * log the URL of the document we are filtering. After that we prefix the
   * document's content with 'Example: ' and create a new document with that
   * content. We then return the new filtered document.
   */
  @Override
  public FilterResult filterAsStringDocument(StringDocument document, FilterContext context) {
    // Log what document we are filtering
    log.info("Filtering document: " + document.getURI());

    // Prepend Example to the document content
    String newContent = "Example: " + document.getContentAsString();

    // Create a clone of the existing document with the new content
    StringDocument newDocument = document
      .cloneWithStringContent(document.getDocumentType(), newContent);

    // Return the new document we created with the new content
    return FilterResult.of(newDocument);
  }

  /*
   * This inner class contains tests for the filter.
   * 
   * Methods in this class annotated with @Test will be run by main.
   */
  public static class FilterTest {

    @Test
    public void exampleTest() throws Exception {
      // This creates the dummy input document. The input document has the URI set
      // to 'http://foo.com/', the document type is set to HTML and the content
      // of the document is set to 'hello'
      StringDocument inputDoc = MockDocuments.mockEmptyStringDoc()
        .cloneWithURI(new URI("http://foo.com/"))
        .cloneWithStringContent(DocumentType.MIME_HTML_TEXT, "hello");

      // This creates an instance of the filter and runs it with the input document
      // we created earlier. Ignore the MockFilterContext for now.
      FilterResult filterResult = new ExampleGroovyFilter()
      	.filter(inputDoc, MockFilterContext.getEmptyContext());

      // Get the resulting document.
      // As filters can return zero, one or more documents we must get 
      // the resulting filtered document for the list of filtered documents.
      // Here we assume the list will contain one document.
      StringDocument filteredDocument = (StringDocument) filterResult
      	.getFilteredDocuments().get(0);

      // Finally we check that the filter has modified the content of the document
      // using a JUnit assert statement.
      Assert.assertEquals(
      	"'Example:' should be prepended to the document content.",
      	"Example: hello",
      	filteredDocument.getContentAsString());
    }
  }

  // Running the main method will execute the test methods.
  public static void main(String[] args) throws Exception {
    FilterTestRunner.runTests(FilterTest.class);
  }

}

Running filter tests on the command line

Before adding your custom filter to your collections filter chain you should always check that the tests are passing. To run the tests for our Groovy filter run:

$SEARCH_HOME/linbin/java/bin/java -cp $SEARCH_HOME/lib/java/all/*:$SEARCH_HOME/tools/groovy/bin/groovy groovy.ui.GroovyMain $SEARCH_HOME/conf/<collection>/@groovy/com/myfilters/ExampleGroovyFilter.groovy

The output should show the single test is passing.

Adding the filter to the filter chain

To use a Groovy filter script, you must first modify your collection's filter.classes setting to include the script in the list of filters to use for the collection. The default setting is currently:

filter.classes=CombinerFilterProvider,TikaFilterProvider,ExternalFilterProvider:DocumentFixerFilterProvider

To allow our new filter to have the last word (i.e. be run after the document fixer) we should change this to:

filter.classes=CombinerFilterProvider,TikaFilterProvider,ExternalFilterProvider:DocumentFixerFilterProvider:com.myfilters.ExampleGroovyFilter

Modifying the filter

The example given above is a solid starting point, and the sections which should require changing for most filters would be the specific implementation of canFilter() which controls if the filter should be run on the document based on the document type, URI or document metadata (Document content should never be inspected in this method to avoid expensive copy operations). The other method filterAsStringDocument() holds the logic for filtering the document. In our case we modify the document content however we can also modify the URI, metadata, document type, charset (in some filters). Further we can even use the filters to split documents into multipe documents or remove documents. See the filter examples for a list of examples demonstrating the different features the filter framework provides.

It is always best to write tests which exercise all methods in your filter. For simplicity we have included the tests within the filter itself, this is not required and may not be suitable to your development environment.