Built-in filters: Scripted workflow filter (WorkflowFilter)

Introduction

The scripted workflow filter allows conditions and actions that can be executed during content filtering to be defined.

Configuring the scripted workflow filter

Enabling

Edit the filter.classes parameter in your collection.cfg file and add the following string to the end com.funnelback.common.filter.WorkflowFilter.

Example

filter.classes=CombinerFilterProvider,TikaFilterProvider,ExternalFilterProvider:DocumentFixerFilterProvider:com.funnelback.common.filter.WorkflowFilter

Create a workflow.cfg file using the file manager. This file will contain the conditions and actions you wish to define.

Configuring scripted workflow rules

The workflow.cfg contains Groovy code consisting of a number of if statements that perform a specified action.

Syntax

The syntax for each workflow command is as follows:

if (<CONDITION>) {
    ACTION
}

Statements can be nested

if (<CONDITION1>) {
    if <CONDITION2> {
        ACTION
    }
}

Conditions can be combined using AND and OR commands:

if ((<CONDITION1>).and(<CONDITION2>)) {
    ACTION1
}
if ((<CONDITION3>).or(<CONDITION4>)) {
    ACTION2
}

Variables can be defined using the def keyword.

def pubs = urlContains("publications");

if (publications == true) {
    ACTION
}

Conditions

Function	Description
urlContains(regex)	Returns true if URL contains given regular expression, false otherwise.
urlDoesNotContain(regex)	Returns true if URL does not contain given regular expression, false otherwise.
urlStartsWith(regex)	Returns true if URL starts with the given regular expression, false otherwise.
urlDoesNotStartWith(regex)	Returns true if URL does not start with the given regular expression, false otherwise.
urlEndsWith(regex)	Returns true if URL ends with the given regular expression, false otherwise.
urlDoesNotEndWith(regex)	Returns true if URL does not end with the given regular expression, false otherwise.
contentContains(regex)	Returns true if content contains the given regular expression, false otherwise.
contentDoesNotContain(regex)	Returns true if content does not contain the given regular expression, false otherwise.
contentStartsWith(regex)	Returns true if content starts with the given regular expression, false otherwise.
contentDoesNotStartWith(regex)	Returns true if content does not start with the given regular expression, false otherwise.
contentEndsWith(regex)	Returns true if content ends with the given regular expression, false otherwise.
contentDoesNotEndWith(regex)	Returns true if content does not end with the given regular expression, false otherwise.

Actions

Function	Description
replaceContent(regex, replacement)	Modifies the document content by looking for all matches for the given regular expression and replacing them with the given replacement text.
getMatchingContent(regex)	Returns the first matching section of the document content that matches the given regular expression.
insertMetaTag(name, content)	Insert a meta tag with the given name and content values into the document.

Debugging

Error messages will be printed out to the filter.log file, or the crawler/gather log files if using inline filtering.

Examples

This section gives some examples of the script language that might be put in the workflow.cfg file.

if ((contentContains("(?i)ovum")).or(contentContains("Gartner"))) {
    if (urlContains("analyst-reviews")) {
        insertMetaTag("robots", "noindex");
    }
}

In the example above the content must contain either Ovum or Gartner and the URL must contain analyst-reviews. The (?i) syntax means to use a case-insensitive match. If these conditions are met then a robots noindex meta tag will be inserted into the content, meaning that the document will not be indexed.

// Example of extraction of content for re-insertion
if ((urlContains("funnelback")).and(urlDoesNotStartWith("test")).and(contentContains("\\w+")).and(urlEndsWith(".pdf"))) {
    def matched = getMatchingContent("original(.*?)text");
    replaceContent "original(.*?)text", "replaced text: middle was [" + matched + "]"
}

In this second example we are extracting content for re-insertion. The def keyword is used to define a variable in the scripting language we use (Groovy).

// Example of title replacement
if ((urlContains("amazon")).or(urlDoesNotStartWith("test"))) {
    replaceContent "<title>(.*?)
</title>", "
<title>New Title
</title>"
}

Here we are inserting a new title into the content using the replaceContent action, which takes a regular expression to match with and then some replacement text.

// Example of extracting content and inserting into metadata
if (urlEndsWith(".pdf")) {
    def matched = getMatchingContent("middle(.*?)content");

    if (matched != "") {
       insertMetaTag("my_meta_data", matched);
    }
}

In this last example we extract some matching content and insert it as meta data. It will be inserted into the "..." section of the document if it has one, or after the opening ag otherwise.