Custom collections

Introduction

A custom collection allows the administrator to configure Funnelback to gather content from data sources not directly supported within Funnelback by implementing a custom gathering script. Such scripts can be implemented in the Groovy programming language, with support from a number of Funnelback specific libraries.

When creating a new custom collection, a template for the gathering logic may be used. Funnelback includes a number of templates specifically for social media APIs, and these options are documented as part of the Social media collections page.

The gathering logic itself is implemented in a config file named custom_gather.groovy (described below), which is then executed during the gather phase of the collection's update cycle. The gathering process is expected to add content to the collection's offline store, with the update then proceeding through subsequent indexing and swap-views phases as well as running any workflow commands.

custom_gather.groovy

The custom_gather.groovy config file allows a custom collection's gathering logic to be implemented. The following example provides a basic template for new custom_gather.groovy files.

Please note that the libraries used in this example are subject to change between versions of Funnelback, and so some effort may be required to upgrade custom_gather.groovy scripts between Funnelback versions. The libraries available from within this scripts are the one in $SEARCH_HOME/lib/java/all/.

import com.funnelback.common.*;
import com.funnelback.common.config.*;
import com.funnelback.common.io.store.*;
import com.funnelback.common.io.store.xml.*;
import com.funnelback.common.utils.*;
import java.net.URL;

// Create a configuration object to read collection.cfg
Config config =  new NoOptionsConfig(new File(args[0]), args[1]);

// Create a Store instance to store gathered data
def store = new XmlStoreFactory(config).newStore();

// Loop here to fetch and store each desired record
def xmlContent = XMLUtils.fromString("<doc><url>http://example.com</url><title>Example</title></doc>");
store.open()
store.add(new XmlRecord(xmlContent, "http://example.com"));
// close() required for the store to be flushed
store.close()

You will need to set docurl in xml.cfg for this example to work.

custom_gather.groovy should expect to be called with two command line arguments, first the location of the Funnelback installation (e.g. /opt/funnelback) and second the name of the collection for which gathering should occur.

As shown in the example, a store object representing the offline data directory can be created, and new records added to, a process which would normally be implemented within some custom loop which causes new records to be gathered from the appropriate source repository.

Reading configuration settings

The config object created in the example above represents both the collection's configuration and the global Funnelback configuration. The config.value(settingName, defaultValue) call will return the value of the settingName from collection.cfg or global.cfg, allowing the gathering process to be configured for the specific collection.

Halting gathering

The config.isUpdateStopped() call will report whether the user has requested that Funnelback stop the currently running update. It is good practice to monitor this value regularly during gathering and to gracefully stop the gathering process if the user requests it. If this value is not monitored and handled appropriately the custom_gather.groovy script will continue uninterrupted and the update will, instead, be halted after the script completes.

Preserving old content

In some cases it may be useful to have a custom collection which always begins with the content from the previous successful update (rather than always gathering everything from scratch). The easiest way to achieve this is to copy all the content form the live view to the offline view at the beginning of the script. Example code for doing so is included below.

def offlineData = new File(args[0], "data" + File.separator + args[1] + File.separator + "offline" + File.separator + "data");
def liveData = new File(args[0], "data" + File.separator + args[1] + File.separator + "live" + File.separator + "data");
org.apache.commons.io.FileUtils.copyDirectory(liveData, offlineData);

Storing non-XML content

If you wish to store content without an XML record (e.g. to store raw HTML, text or other record types), the XML store type can be replaced with the RawBytes store type as in the following example code.

def store = new RawBytesStoreFactory(config).newStoreForView(com.funnelback.common.StoreView.offline);

def bytes = new byte[]{1, 2, 3};
store.open()
store.add(new RawBytesRecord(bytes, "http://example.com"));

Implementing reusable logic

Groovy classes to be used across an installation of Funnelback can be placed in the $SEARCH_HOME/lib/java/groovy directory. Any classes within that directory will be available to be imported into the custom_gather.groovy script and used from within it.

Recommended practice for custom_gather.groovy scripts is to keep the script itself as small/simple as possible, creating separate reusable classes as described above to perform the main gathering tasks.

See also

top