Workflow

Introduction

Funnelback allows you to define one or more commands that are executed before and after each step in the update process, which allows you to modify the workflow of the engine:

Fb-update-steps.png

If the command returns a non-zero exit code, the update process will stop immediately.

Modifying Workflow

The key points in the workflow where you can insert callouts to external scripts and programs are:

pre_gather_command

Run this command before crawling or copying data

post_gather_command

Run this after data has been gathered

pre_index_command

Run this before indexing (and after text extraction)

post_index_command

Run this after indexing

post_update_command

Run this at the very end of the collection update

Post update

The post_update_command (run at the very end of an update) is treated differently:

  • The command is run in the background; and
  • The command's exit value is ignored.

Further workflow modifications

The five pre/post commands mentioned above should be enough to meet the needs of most Funnelback administrators. However, there are several more update 'phases' which can have pre/post commands applied if absolutely necessary. These include the 'convert' (text filtering) phase and several phases that are specific to instant updates (see updating collections): instant-gather, instant-convert, instant-index, delete-prefix, delete-list. So the following configuration parameters may also be added:

  • pre_convert_command
  • post_convert_command
  • pre_instant-gather_command
  • post_instant-gather_command
  • pre_instant-convert_command
  • post_instant-convert_command
  • pre_instant-index_command
  • post_instant-index_command
  • pre_delete-prefix_command
  • post_delete-prefix_command
  • pre_delete-list_command
  • post_delete-list_command

Example: generate data

In this example, you have XML data residing in a file called /tmp/data.zip. Since this data is periodically updated you want to extract the latest data before indexing:

pre_index_command=sh ~/bin/extract-data.sh

The extract-data shell script could be:

#!/bin/sh

cd /opt/funnelback/data/my-collection/offline/data
# cleanup
rm -rf unzipped
mkdir unzipped
# extract the files from the archive...
cd unzipped && unzip /tmp/data.zip

Example: processing log files

If the update is successful, the swap process archives the query logs in C:\Funnelback\data\_collection_\archive and should you want to perform some processing of the log files, you could start a process at the very end of each update:

post_update_command=c:\temp\analyse-logs.exe --collection shakespeare

Substituting values

In order to avoid repeating values, it is possible to pass the value of another collection.cfg parameter to a workflow command script. This is supported with the standard shell ${variable_name} syntax, as in the following example.

pre_index_command=/opt/funnelback/custom/process_crawled_data.sh ${collection_root}

Assuming the collection_root was defined in the config file to be /opt/funnelback/data/example, which would cause the following expansion when running the pre index step.

/opt/funnelback/custom/process_crawled_data.sh /opt/funnelback/data/example

Also, please note that the special value _SEARCH_HOME is automatically available to substitute the installation location of Funnelback, even though it is not normally defined in a collection.cfg file.

You can also avoid placing in the Groovy binary and class paths when using Groovy scripts, as in the following example.

post_index_command=$GROOVY_COMMAND my_command.groovy

The default value for $GROOVY_COMMAND is defined in executables.cfg

top