Skip to content

Custom collections

Introduction

A custom collection allows the administrator to configure Funnelback to gather content from data sources not directly supported within Funnelback by implementing a custom gathering script. Such scripts can be implemented in the Groovy programming language, with support from a number of Funnelback specific libraries.

When creating a new custom collection, a template for the gathering logic may be used. Funnelback includes a number of templates specifically for social media APIs, and these options are documented as part of the Social media collections page.

The gathering logic itself is implemented in a config file named custom_gather.groovy (described below), which is then executed during the gather phase of the collection's update cycle. The gathering process is expected to add content to the collection's offline store, with the update then proceeding through subsequent indexing and swap-views phases as well as running any workflow commands.

custom_gather.groovy

The custom_gather.groovy config file allows a custom collection's gathering logic to be implemented. The following example provides a basic template for new custom_gather.groovy files.

Please note that the libraries used in this example are subject to change between versions of Funnelback, and so some effort may be required to upgrade custom_gather.groovy scripts between Funnelback versions. The libraries available from within this scripts are the one in $SEARCH_HOME/lib/java/all/.

import com.funnelback.common.*;
import com.funnelback.common.config.*;
import com.funnelback.common.io.store.*;
import com.funnelback.common.io.store.bytes.*;
import com.funnelback.common.io.store.xml.*;
import com.funnelback.common.utils.*;
import com.google.common.collect.*;
import java.io.File;

// Create a configuration object to read collection.cfg
Config config =  new NoOptionsConfig(new File(args[0]), args[1]);

// Create a Store instance to store gathered data
def store = new RawBytesStoreFactory(config)
                .withFilteringEnabled(true) // enable filtering.
                .newStore();

store.open();
try {
    // Loop here to fetch and store each desired record

    def record = new RawBytesRecord(
            "<html><p>Hello, world</p></html>".getBytes("UTF-8"), // Convert the content to utf-8 bytes 
            "http://example.com/"); // set the URI to store the document as.
    
    def metadata = ArrayListMultimap.create();
    // Set the correct Content-Type of the record.
    metadata.put("Content-Type", "text/html; charset=UTF-8");
    
    store.add(record, metadata);
} finally {
    // close() required for the store to be flushed
    store.close();
}

custom_gather.groovy should expect to be called with two command line arguments, first the location of the Funnelback installation (e.g. /opt/funnelback) and second the name of the collection for which gathering should occur.

As shown in the example, a store object representing the offline data directory can be created, and new records added to, a process which would normally be implemented within some custom loop which causes new records to be gathered from the appropriate source repository.

Reading configuration settings

The config object created in the example above represents both the collection's configuration and the global Funnelback configuration. The config.value(settingName, defaultValue) call will return the value of the settingName from collection.cfg or global.cfg, allowing the gathering process to be configured for the specific collection.

Halting gathering

The config.isUpdateStopped() call will report whether the user has requested that Funnelback stop the currently running update. It is good practice to monitor this value regularly during gathering and to gracefully stop the gathering process if the user requests it. If this value is not monitored and handled appropriately the custom_gather.groovy script will continue uninterrupted and the update will, instead, be halted after the script completes.

Preserving old content

In some cases it may be useful to have a custom collection which always begins with the content from the previous successful update (rather than always gathering everything from scratch). The easiest way to achieve this is to copy all the content form the live view to the offline view at the beginning of the script. Example code for doing so is included below.

def offlineData = new File(args[0], "data" + File.separator + args[1] + File.separator + "offline" + File.separator + "data");
def liveData = new File(args[0], "data" + File.separator + args[1] + File.separator + "live" + File.separator + "data");
org.apache.commons.io.FileUtils.copyDirectory(liveData, offlineData);

Implementing reusable logic

Groovy classes to be used across an installation of Funnelback can be placed in the $SEARCH_HOME/lib/java/groovy directory. Any classes within that directory will be available to be imported into the custom_gather.groovy script and used from within it.

Recommended practice for custom_gather.groovy scripts is to keep the script itself as small/simple as possible, creating separate reusable classes as described above to perform the main gathering tasks.

Storing XML content

It is possible to store XML content when using the org.w3c.dom.Document object. Here is an example of doing that:

import com.funnelback.common.*;
import com.funnelback.common.config.*;
import com.funnelback.common.io.store.*;
import com.funnelback.common.io.store.bytes.*;
import com.funnelback.common.io.store.xml.*;
import com.funnelback.common.utils.*;
import com.google.common.collect.*;
import java.io.File;

// Create a configuration object to read collection.cfg
Config config =  new NoOptionsConfig(new File(args[0]), args[1]);

// Create a Store instance to store gathered data
def store = new RawBytesStoreFactory(config)
                .withFilteringEnabled(true) // enable filtering.
                .newStore()
                .asXmlStore(); // change the store to accept XML org.w3c.dom.Document objects

store.open();
try {
    // Loop here to fetch and store each desired record

    org.w3c.dom.Document xml = XMLUtils.fromString("<doc><url>http://example.com</url><title>Example</title></doc>");
    def record = new XmlRecord(xml, "http://example.com/");

    // No need to set Content-Type as it will be set by the store.
    def metadata = ArrayListMultimap.create();
    
    store.add(record, metadata);
} finally {
    // close() required for the store to be flushed
    store.close();
}

Troubleshooting

Cache copies don't work

For cache copies to work collection.cfg should have:

store.record.type=RawBytesRecord

You must also be using the RawBytesStoreFactory. If you are using XmlStoreFactory you can generally replace:

def store = new XmlStoreFactory(config).newStore();

with

def store = new RawBytesStoreFactory(config)
                .withFilteringEnabled(true) // You may want to disable filtering.
                .newStore()
                .asXmlStore(); // change the store to accept XML org.w3c.dom.Document objects

See also

top

Funnelback logo
v15.12.0