Custom collections

Introduction

A custom collection allows the administrator to configure Funnelback to gather content from data sources not directly supported within Funnelback by implementing a custom gathering script. Such scripts can be implemented in the Groovy programming language, with support from a number of Funnelback specific libraries.

After creating a new custom collection, the gathering logic for it must be provided in a config file named custom_gather.groovy (described below), which can be edited under the Browse Collection Configuration Files > Edit Configuration Files section from the admin home page. This custom logic is then executed during the gather phase of the collection's update cycle. The gathering process is expected to add content to the collection's offline store, with the update then proceeding through subsequent indexing and swap-views phases as well as running any workflow commands.

custom_gather.groovy

The custom_gather.groovy config file allows a custom collection's gathering logic to be implemented. The following example provides a basic template for new custom_gather.groovy files.

Please note that the libraries used in this example are subject to change between versions of Funnelback, and so some effort may be required to upgrade custom_gather.groovy scripts between Funnelback versions. The libraries available from within this scripts are the one in $SEARCH_HOME/lib/java/all/.

import com.funnelback.common.*;
import com.funnelback.common.config.*;
import com.funnelback.common.io.store.*;
import com.funnelback.common.io.store.bytes.*;
import com.funnelback.common.io.store.xml.*;
import com.funnelback.common.utils.*;
import com.google.common.collect.*;
import java.io.File;

// Create a configuration object to read collection.cfg
Config config =  new NoOptionsConfig(new File(args[0]), args[1]);

// Create a Store instance to store gathered data
def store = new RawBytesStoreFactory(config)
                .withFilteringEnabled(true) // enable filtering.
                .newStore();

store.open();
try {
    // Loop here to fetch and store each desired record

    def record = new RawBytesRecord(
            "<html><p>Hello, world</p></html>".getBytes("UTF-8"), // Convert the content to utf-8 bytes 
            "http://example.com/"); // set the URI to store the document as.
    
    def metadata = ArrayListMultimap.create();
    // Set the correct Content-Type of the record.
    metadata.put("Content-Type", "text/html; charset=UTF-8");
    
    store.add(record, metadata);
} finally {
    // close() required for the store to be flushed
    store.close();
}

custom_gather.groovy should expect to be called with two command line arguments, first the location of the Funnelback installation (e.g. /opt/funnelback) and second the name of the collection for which gathering should occur.

As shown in the example, a store object representing the offline data directory can be created, and new records added to, a process which would normally be implemented within some custom loop which causes new records to be gathered from the appropriate source repository.

Reading configuration settings

The config object created in the example above represents both the collection's configuration and the global Funnelback configuration. The config.value(settingName, defaultValue) call will return the value of the settingName from collection.cfg or global.cfg, allowing the gathering process to be configured for the specific collection.

Halting gathering

The config.isUpdateStopped() call will report whether the user has requested that Funnelback stop the currently running update. It is good practice to monitor this value regularly during gathering and to gracefully stop the gathering process if the user requests it. If this value is not monitored and handled appropriately the custom_gather.groovy script will continue uninterrupted and the update will, instead, be halted after the script completes.

Preserving old content

In some cases it may be useful to have a custom collection which always begins with the content from the previous successful update (rather than always gathering everything from scratch). The easiest way to achieve this is to copy all the content from the live view to the offline view at the beginning of the script. Example code for doing so is included below.

def offlineData = new File(args[0], "data" + File.separator + args[1] + File.separator + "offline" + File.separator + "data");
def liveData = new File(args[0], "data" + File.separator + args[1] + File.separator + "live" + File.separator + "data");
org.apache.commons.io.FileUtils.copyDirectory(liveData, offlineData);

Document filtering

Custom collections can use the filter framework to convert and manipulate documents before they are stored.

When working with a push collection store recommended practice is to follow the process below and filter within the custom collection (rather than specifying a filter parameter when pushing content).

In order to use the filter framework:

Ensure that the withFilteringEnabled(true) is specified when creating the store.

// Create a Store instance to store gathered data
def store = new RawBytesStoreFactory(config)
                .withFilteringEnabled(true) // enable filtering.
                .newStore();

Set the content type value and charset in the metadata. This is used by the filter framework to make decisions on whether or not to apply a filter.

def metadata = ArrayListMultimap.create();
// Set the Content-Type with a charset for the item.
metadata.put("Content-Type", "application/json; charset=utf-8");

Configure the filters to run by setting filter.classes in collection.cfg

e.g. to apply custom filtering to the JSON then, the JSON to XML filter to the stored content:
```
filter.classes=SomeCustomJSONFilter:JSONToXML
```

Using external dependencies

External dependencies can be imported via Grapes/Grab as for the filter framework. See: Importing external dependencies for use with Groovy scripts

Implementing reusable logic

Groovy classes to be used across an installation of Funnelback can be placed in the $SEARCH_HOME/lib/java/groovy directory. Any classes within that directory will be available to be imported into the custom_gather.groovy script and used from within it.

Recommended practice for custom_gather.groovy scripts is to keep the script itself as small/simple as possible, creating separate reusable classes as described above to perform the main gathering tasks.

Storing XML content

It is possible to store XML content when using the org.w3c.dom.Document object. Here is an example of doing that:

import com.funnelback.common.*;
import com.funnelback.common.config.*;
import com.funnelback.common.io.store.*;
import com.funnelback.common.io.store.bytes.*;
import com.funnelback.common.io.store.xml.*;
import com.funnelback.common.utils.*;
import com.google.common.collect.*;
import java.io.File;

// Create a configuration object to read collection.cfg
Config config =  new NoOptionsConfig(new File(args[0]), args[1]);

// Create a Store instance to store gathered data
def store = new RawBytesStoreFactory(config)
                .withFilteringEnabled(true) // enable filtering.
                .newStore()
                .asXmlStore(); // change the store to accept XML org.w3c.dom.Document objects

store.open();
try {
    // Loop here to fetch and store each desired record

    org.w3c.dom.Document xml = XMLUtils.fromString("<doc><url>http://example.com</url><title>Example</title></doc>");
    def record = new XmlRecord(xml, "http://example.com/");

    // No need to set Content-Type as it will be set by the store.
    def metadata = ArrayListMultimap.create();
    
    store.add(record, metadata);
} finally {
    // close() required for the store to be flushed
    store.close();
}

Storing content in a push collection

Storage of content within a push collection requires the following options to be set in collection.cfg. No specific updates to the custom_gather.groovy are required. The push collection must exist and doesn't have to be on the same Funnelback server.

The following settings are sufficient to store documents into a push collection on the local machine.

store.raw-bytes.class=com.funnelback.common.io.store.bytes.Push2Store
store.push.collection=<Name of the push collection>

If the push API is available on another server you may specify the location using:

store.push.url=https://SERVER:<admin port>/push-api/

By default the push service user will be used, if the remote server does not share the same server secret then user account details of a Funnelback user on the remote server must be specified. This is done with:

store.push.user=<Remote user name>
store.push.password=<Remote user password>

Several update phases should be disabled as the custom collection is only used to gather content. If these are not update the collection update will fail due to no content being stored within the collection.

Add the following to the collection.cfg to disable all the update phases except for the gather phase:

index=false
report=false
archive=false
meta_dependencies=false
swap=false

When working with a push collection the gather code must account for the following:

The code must handle any errors that are returned by the push collection (such as the service being unavailable) and appropriately retry or queue requests.
The custom gatherer may need to delete documents as push collections are always added to when an update is run. This can be done by calling store.delete(String key) to remove a specific document.

See: push collections for general information on push collections.

Troubleshooting

Cache copies don't work

For cache copies to work collection.cfg should have:

store.record.type=RawBytesRecord

You must also be using the RawBytesStoreFactory. If you are using XmlStoreFactory you can generally replace:

def store = new XmlStoreFactory(config).newStore();

with

def store = new RawBytesStoreFactory(config)
                .withFilteringEnabled(true) // You may want to disable filtering.
                .newStore()
                .asXmlStore(); // change the store to accept XML org.w3c.dom.Document objects

top