Skip to content

HTML document filtering (Jsoup filters)

Introduction

Jsoup filters are special document filters that can be used to transform and manipulate HTML documents based on their DOM structure. Jsoup filters can be chained together to perform a series of modifications to the HTML document.

The JSoupProcessingFilterProvider filter in the main filter chain converts the HTML document into a Jsoup object which represents the HTML document and passes this to the start of the Jsoup filter chain. At the end of the Jsoup filter chain this object is serialised and passed onto the next filter in the main filter chain.

JSoup filters should be used for making modifications to the HTML document structure, or performing operations that select and transform the document's DOM or content. Custom JSoup filters can be written to perform operations such as:

  • injecting metadata,
  • cleaning titles,
  • scraping content (e.g. extracting breadcrumbs to metadata).

The Jsoup filter chain

Jsoup filters are run on HTML documents when the JSoupProcessingFilterProvider filter is included as part of the main filter chain. This will happen by default unless the JSoupProcessingFilterProvider filter has been removed from the collection's filter.classes.

the-filter-chain-01.png

The Jsoup filter chain is a comma-separated list of Jsoup filters which are applied only to HTML documents.

The set of filters below would be processed as follows: The content would pass through either JsoupFilter1 before being passed on to JsoupFilter2 then JSoupFilter3.

JSoupFilter1,JSoupFilter2,JSoupFilter3

Configuring the Jsoup filter chain

The Jsoup filter chain is defined using the filter.jsoup.classes key in collection.cfg.

Built-in Jsoup filters

Funnelback ships with a number of built-in Jsoup filters configured which are used to produce the metadata required to build the content and accessibility auditor reports:

Filter nameDescription
MetadataScraperScrapes content and injects it as metadata.
ContentGeneratorUrlDetectionDetects additional URLs for the given content based on it's generator (e.g. CMS specific edit links)
FleschKincaidGradeLevelEstimates how readable the document is, and records the estimate with the document.
UndesirableTextDetects occurrences of configured undesirable text and records them for content auditor to report upon.

Writing Jsoup filters

Jsoup filters are written following the same rules as general filters. Jsoup filters can access the document properties and collection configuration in a similar manner.

See:

Jsoup filter example

The following Jsoup filter scrapes some content using Jsoup selectors (similar to CSS selectors) and injects these as metadata fields which can then be mapped using standard metadata mapping rules.

This file would be named MetadataScraper.groovy and saved into the collection's @groovy/com/funnelback/example folder and added to the Jsoup filter chain by adding com.funnelback.example.MetadataScraper to the filter.jsoup.classes.

package com.funnelback.example

import com.funnelback.common.filter.jsoup.*

/**
 * Scrapes tags, author and thumbnails from the content and injects these as custom metadata
 */

@groovy.util.logging.Log4j2
public class MetadataScraper implements IJSoupFilter {

   @Override
   void processDocument(FilterContext context) {
    def doc = context.getDocument()
    def url = doc.baseUri()

    try {
      // Extract links contained within divs with a class of tags
      // and add these to a custom.tags metadata field
      doc.select("div.tags a").each() { tagLink ->
        def tag = tagLink.text().trim().toLowerCase()
        context.additionalMetadata.put("custom.tags", tag)
        log.debug("Added tag '{}' for '{}'", tag, url)
      }

      // Extract thumbnails contained within divs with an ID of thumbnail
      // and write these to a custom.thumbnails metadata field
      doc.select("div#thumbnail img").each() { img ->
        def imgUrl = img.attr("src")
        context.additionalMetadata.put("custom.thumbnails", imgUrl)
        log.debug("Added thumbnail '{}' for '{}'", imgUrl, url)
      }

      // Extract author contained in the anchor tag with a class of author
      // and write this to a custom.author metadata field
      doc.select("a.author").each() { authorLink ->
        def author = authorLink.text().trim().toLowerCase()
        context.additionalMetadata.put("custom.author", author)
        log.debug("Added author '{}' for '{}'", author, url)
      }
    } catch (e) {
      log.error("Error scraping metadata from '{}'", url, e)
    }
  }
}

Testing JSoup filters

Funnelback supports a simple process for writing tests for JSoup filters. This allows an author of a JSoup filter to define expected input and output for a filter and confirm that the filter is functioning as expected.

Testing a JSoup filter requires the following simple steps:

  1. Define the input
  2. Define the expected output
  3. Run the test

If more complex tests are required the main method can be implemented within the filter as for other Groovy filter scripts.

Define the test(s)

For each test:

  1. Provide a file containing the (input) HTML that will be processed by the JSoup filter. The file should be named with the following format: <filter class name>-<test name>.test (e.g. myFilter-test1.test) and saved in the same location as the JSoup filter.
  2. Provide a file containing the (output) HTML returned by the JSoup filter. The file should be named with the following format: <filter class name>-<test name>.expected (e.g. myFilter-test1.expected) and saved in the same location as the JSoup filter.

Multiple tests can be provided by producing multiple test/expected files (e.g. myFilter-test1.test & myFilter-test1.expected, myFilter-complex.test & myFilter-complex.expected).

Running the test(s)

Run a simple command and inspect the output.

To run the tests for the file myFilter.groovy located in the collection's @groovy/com/funnelback/common/filter/jsoup folder run the following command:

/opt/funnelback/tools/groovy/bin/groovy -cp "/opt/funnelback/lib/java/all/*" /opt/funnelback/conf/collection/@groovy/com/funnelback/common/filter/jsoup/myFilter.groovy

Example

[search@localhost jsoup]$ /opt/funnelback/tools/groovy/bin/groovy -cp "/opt/funnelback/lib/java/all/*" /opt/funnelback/conf/collection/@groovy/com/funnelback/common/filter/jsoup/myFilter.groovy
Found the following test cases in /opt/funnelback/conf/collection/@groovy/com/funnelback/common/filter/jsoup :
 -   myFilter-test1.test
 -   myFilter-simple.test
 -   myFilter-complex.test
PASS - myFilter-test1.actual matches myFilter-test1.expected
FAIL - myFilter-simple.actual does not match myFilter-simple.expected
PASS - myFilter-complex.actual matches myFilter-complex.expected
Summary:
    PASSED: 2
    FAILED: 1

If any of the tests failed the actual and expected outputs can be compared (e.g. using unix command "diff") and any stack traces inspected if the filter crashed during processing rather than just producing incorrect output.

Note: differences in line endings between the actual and expected output are ignored as filters are often developed in a different environment to where they are run in production.

com.funnelback.common.filter.jsoup.IJSoupFilter interface

package com.funnelback.common.filter.jsoup;

/**
 * Interface for filtering steps applied by JSoupProcessingFilterProvider.
 * 
 * The chain of steps to apply can be configured with the 'filter.jsoup.classes'
 * collection.cfg parameter.
 * 
 * One instance of this class will be created per thread performing filtering
 * (to avoid unintended concurrency problems).
 */
public interface IJSoupFilter extends Runnable {

  /**
   * Perform any setup the filter may need before process is first called. Where 
   * possible any time consuming work should be done here to avoid re-doing it 
   * for every document in the collection.
   * 
   * The default implementation does nothing, which is appropriate for many filters.
   */
  default public void setup(SetupContext setup) {
    // Do nothing
  }

  /**
   * Implement this method to perform any desired processing on the filterContext.getDocument()
   * object before it is stored in the collection.
   * 
   * Modifications can be made in place, or by setting additionalMetadata values in filterContext.getAdditionalMetadata().
   * 
   * Since this method is called for every document, try to do any time consuming setup in the setup() method.
   * 
   * Note that this method will never be called in parallel, since a new instance of your filter
   * will be created for each thread. If you need to share state across all filter instances
   * you can do so by creating static variables/methods on your filter class and calling/using them
   * however, be aware then that you must then manage any concurrency safely yourself.
   */
  public void processDocument(FilterContext filterContext);

  /**
   * We provide a default run method, which is run when a groovy script is executed.
   * 
   * This method implements a very basic test harness that:
   * - Looks in the same directory as the groovy script file being run (and bails out if it's not under @groovy)
   * - Looks for files name scriptName-someTestName.test and scriptName-someTestName.expected in that directory
   * - Runs the filter in question on each scriptName-someTestName.test to produce scriptName-someTestName.actual
   * - Compares the expected and actual output, and produces basic information about the passes/failures.
   */
  default public void run() {
    // snip
  }

}

Note that a single process method must be provided, which will be called for each document, and a context object as described below will be provided.

Any changes the IJSoupFilter implementation wishes to make should either be made in-place to the given document object, or by altering content within the context. Changes made to either of these made by a filter in the chain will be visible to subsequent filters.

Note that changes being made in-place means that you may wish to call clone() on any nodes you wish to modify without the changes being reflected in the final document.

com.funnelback.common.filter.jsoup.SetupContext

package com.funnelback.common.filter.jsoup;

import java.io.File;
import java.util.Set;
import java.util.stream.Collectors;

import com.funnelback.common.config.Config;

/**
 * Represents the 'setup' of a filter, which includes information
 * about the Funnelback installation and collection for which
 * documents will be filtered.
 */
public class SetupContext {

  /** Config for Funnelback and the collection */
  // Keep private so we can swap it for a rewritten config transparently later
  private final Config config;

  /**
   * Create a new SetupContext from the given Config.
   *
   * We intentionally keep this private and expose only relevant methods
   * with a view towards being about to change this when config is redesigned
   * without affecting any implemented filters.
   */
  public SetupContext(Config config) {
    this.config = config;
  }

  /** The home of the Funnelback installation currently being used to perform filtering. */
  public File getSearchHome() {
    return config.getSearchHomeDir();
  }

  /** The name of the funnelback collection for which filtering is being performed */
  public String getCollectionName() {
    return config.getCollectionName();
  }

  /**
   * Provides access to config settings from the current funnelback installation and collection.
   * 
   * Need to set a default value from a Groovy implementation?
   * Consider http://www.groovy-lang.org/operators#_elvis_operator
   * 
   * @param key The name of the collection.cfg or global.cfg setting being requested
   * @return The value as set for the current install and collection context
   */
  public String getConfigSetting(String key) {
    return config.value(key);
  }

  /**
   * Provides a list of config setting keys which have some prefix.
   * 
   * This is useful in cases where you need to process some set of config
   * values (e.g. one setting per data source).
   */
  public Set<String> getConfigKeysWithPrefix(String prefix) {
    return config.getConfigData().keySet().stream()
    	.filter((a) -> a.startsWith(prefix)).collect(Collectors.toSet());
  }

}

com.funnelback.common.filter.jsoup.FilterContext

package com.funnelback.common.filter.jsoup;

import java.util.HashMap;
import java.util.Map;

import org.jsoup.nodes.Document;

import lombok.Data;

import com.google.common.collect.HashMultimap;
import com.google.common.collect.Multimap;

@Data
public class FilterContext {
  /**
   * A representation of the setup details for this filter.
   * 
   * For example, where Funnelback is installed and what collection is being used,
   * as well as access to configuration settings.
   */
  private final SetupContext setup;

  /**
   * The document being filtered.
   * 
   * Modifications may be made in place if needed, however it may be easier in many
   * cases to add metadata via the additionalMetadata Multimap below.
   */
  private final Document document;

  /**
   * Metadata to be added to the document being filtered as a result of filtering. One metadata entry
   * with the same key as the map entry will be added for each entry in the value set.
   * 
   * This is a Multimap which supports multiple values for each key. See 
   * http://docs.guava-libraries.googlecode.com/git/javadoc/com/google/common/collect/Multimap
   * for details of the calls available, though you can probably just treat it like a normal map
   * if you don't need to replace existing values.
   * 
   * The current implementation will add the metadata by inserting HTML meta tags into the document
   * but in the future we may change this to store metadata separately to the content itself, so avoid
   * relying on a specific storage location.
   */
  private final Multimap<p><String, String> additionalMetadata = HashMultimap.create();

  /**
   * A map of custom data which may be used to communicate between filters in the chain.
   * Any filter may add, edit or remove entries for the document during filtering.
   * 
   * Note that the map and all content will be discarded once the document has been filtered.
   */
  private final Map<String, Object> customData = new HashMap<String, Object>();

}

See also:

top

Funnelback logo
v15.16.0