Filter example: scripted filter

Introduction

⚠ This is deprecated. It is provided for backwards compatibility.

This gives an example of using the ScriptFilterProvider interface to implement a custom filter.

Example

Here is a very simple example groovy filtering script to give us a starting point.

// Filename:  ExampleGroovyFilter.groovy

// This package declaration should match the .groovy file location
// on disk as well as the named used for the filter.classes parameter.
package au.com.example;
 
@groovy.transform.InheritConstructors
public class ExampleGroovyFilter extends com.funnelback.common.filter.ScriptFilterProvider {
  // We get a documentType (file extension) to decide if
  // we should filter this document.
  // This example filters all document types.
  public Boolean isDocumentFilterable(String documentType) {
    return true;
  }
 
  // The filter method performs the actual filtering work,
  // and returns the new document content.
  // This example just prefixes "Example" to all documents
  public String filter(String input, String documentType) {
    return "Example " + input;
  }
 
  // There's also another method that will receive the document URL
  // as an additional parameter.
  // Note that the other method MUST be implemented as well.
  // In this case you can just call this one with no URL:
  // filter(input, documentType, null)
  public String filter(String input, Charset charset, String documentType, String url) {
    return "Example " + input;
  }
 
  // Utility method to unit-test the filter on an individual file
  // when developing, from the Groovy Console for example
  public static void main(String[] args) {
    // File to filter with test content
    def testFile = new File("C:\\Temp\\test-content")
 
    // Create new filter object, for a fake collection.
    def f = new FilterExample("collection-name", false)
 
    // Filter content using main filter() function
    def contentFiltered = f.filter(testFile.text, testFile.absolutePath)
 
    // OPTIONAL: Filter content with a document URL
    def contentFiltered = f.filter(testFile.text, testFile.absolutePath, "http://fake.url/file")
 
    // print filtered content
    print contentFiltered
 }
}

Building filters

The example given above is a solid starting point, and the only sections which should require changing for most filters would be the specific implementation of the isDocumentFilterable() and filter() methods - The rest is likely to be boilerplate common to all filters.

Pre-defined variables

As a result of extending the ScriptFilterProvider class, your filter will have access to the following instance variables :

A fileBeingFiltered string, which may contain a file object (ie. file path) indicating which file is being filtered (may be null if the filter is being run inline). It will be null if you test your script from the Groovy Console instead of running a collection filter phase.
A config object, which allows access to collection.cfg and global.cfg settings with a call like config.value("service_name")

Note also that the script class will have access to all the existing Funnelback Java Common classes as well as any JAR files which are added to the SEARCH_HOME/lib/java/all/ directory.

Caveats

No URL provided

In some cases the method public String filter(String input, String documentType, String url); will be called with the url parameter set to null. This can happen with some gather components for which an URL is not available at the time of the gathering (e.g. on non-web collections).

Examples

Extract Metadata Groovy filter

Below is an example filter which takes the content of the first H1 tag in the page, the base href URL, and creates an external metadata file mapping the content of the first H1 into the x metadata class.

package com.funnelback.common.filter;
 
import com.funnelback.common.Environment;
 
public class ExtractMetadataExample extends com.funnelback.common.filter.ScriptFilterProvider {
 
  PrintWriter output;
 
  public ExtractMetadataExample(String collectionName, boolean inlineFiltering) {
    super(collectionName, inlineFiltering);
 
    File outputFile = new File(Environment.getValidSearchHome(), "conf" + File.separator + collectionName + File.separator + "external_metadata.cfg");
    output = new PrintWriter(new FileWriter(outputFile));
  }
 
  // We filter all documents
  public Boolean isDocumentFilterable(String documentType) {
    return true;
  }
 
  // Take first h1 tag content and put it into an external metadata file
  public String filter(String input, String documentType) {
    // Look for h1 tags case insensitive and ignoring newlines
    def h1Matcher = input =~ /(?is).*<h1>(.*?)<\/h1>.*/;
 
    if (h1Matcher.matches()) {
      // Look for the document's URL
 
      // (!) in inline filtering the DOCHDR is not available. In that
      // case you need to implement a different method, instead of
      // filter(input, documentType) you can implement filter(input, documentType, url)
      def urlMatcher = input =~ /(?is).*<base href="(.*?)">.*/;
 
      if (urlMatcher.matches()) {
        String line = urlMatcher[0][1] + " x:\"" + h1Matcher[0][1] + "\"";
        output.println(line);
      }
    }
 
    // No content changes - return the input
    return input;
  }
 
  public void cleanup() {
    if (output != null) {
      output.flush();
      output.close();
    }
    super.cleanup();
  }
 
  // A main method to allow very basic testing
  public static void main(String[] args) {
    def f = new ExtractMetadataExample("dummy-collection", false);
    println(f.filter("<base href=\"example_url\"> \n"
      + "<h1>example metadata</H1>", ""));
  }
}

You can run this script on the command line using:

$SEARCH_HOME/tools/groovy/bin/groovy -cp "$SEARCH_HOME/lib/java/all/*" ExtractMetadataExample.groovy

Another method you may find useful is:

import com.funnelback.common.utils.HTMLUtils;
     
String modifiedHTML = HTMLUtils.insertMetadata(String html, String metadataName, String metadataValue);

which allows you to insert a given metadata name/value pair into some HTML.

Note: From version 14 upwards, the package path has changed. The import statement now reads: import com.funnelback.common.html.HTMLUtils

JSoup Groovy filter

Below is an example filter which takes the content of the first H1 tag in the page, and replaces the title element with it using JSoup rather than manipulating the text directly.

JSoup is a library for parsing and manipulating HTML and supports CSS/jQuery style selectors for finding elements in the parsed HTML document, which may be substantially simpler than trying to work with regular expressions.

package com.funnelback.common.filter;
 
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
 
@groovy.transform.InheritConstructors
public class JsoupExample extends com.funnelback.common.filter.ScriptFilterProvider {
 
  // We filter all documents
  public Boolean isDocumentFilterable(String documentType) {
    return true;
  }
 
  // Take first h1 tag content and put it into the title
  public String filter(String input, String documentType) {
    Document doc = Jsoup.parse(input);
 
    Elements h1s = doc.select("h1");
    Element h1 = h1s.first();
 
    // Replace the title with the content of the first h1
    doc.select("title").first()(h1.text());
 
   return doc.outerHtml();
  }
 
  // A main method to allow very basic testing
  public static void main(String[] args) {
    def f = new JsoupExample("dummy-collection", false);
    println(f.filter("\n \n  <title>bad title</title>\n \n \n  foo  and  \n  <h1>good title</h1> end\n \n", "html"));
  }
}