Supporting multiple query processors

Overview

Supporting multiple query processor is usually motivated by two requirements:

When the query volume is too high for a single machine,
To ensure continuity of service in case of a query processor failure.

Architecture

To address these two problems you can configure Funnelback to work in a live-live query processor configuration:

One administration server is used to configure the collections, crawl and index the data, configure the search result pages and generate query reports,
One or more query processing machines are used to serve search results.

All the query processor machines will be active and able to serve queries at the same time. The incoming search requests need to be distributed across the query processor servers using an external system, usually a load balancer. In the same fashion taking a query processor server offline for maintenance will be done at the load balancer level: Funnelback does not provide a facility to balance the search requests among all the query processor machines.

Workflow

This scenario is supported by configuring additional workflow steps to:

Deploy newly created collections to the query processors
Copy the index and data files from the administration server to the query processors whenever a collection is updated
Copy configuration files from the administration server to the query processors whenever they are modified
Archive the query and click logs on the query processors and transfer them to the administration server to build the query reports.

Support facilities

The additional workflow steps make use of two facilities to transfer files and trigger actions remotely on the query processors: WebDAV and custom web services. Those facilities are available over standard protocols and can be interacted with for other purposes if needed, either using Funnelback-provided command line tools or programmatically.

The query processor servers must not block requests on the administration port (usually port 8443) and WebDAV port (usually port 8076) from the administration server. These are required to facilitate the file transfer and mediator web service requests.

WebDAV

File transfers between two Funnelback servers relies on the WebDAV protocol. WebDAV is an extension of HTTP adding methods to write to a remote server (Upload a file, create a folder, etc.), allowing existing tools (WebDAV clients) and libraries for your preferred language to interact with it.

Funnelback runs a WebDAV server whose root is the Funnelback installation folder ($SEARCH_HOME). It runs over HTTPS and on the default port 8076.

The port can be changed by editing $SEARCH_HOME/conf/global.cfg and setting webdav-service.port=1234. Note that changing the port will require a modification of the various workflow components to account for the change.
The WebDAV service can be disabled by setting daemon.services=... accordingly (See the example in global.cfg.default)
The WebDAV service listen on all IP addresses by default. To specify on which addresses it should listen webdav-service.bind-address=1.2.3.4 can be set. It's recommended to have the WebDAV service listen only to a private LAN interface that is shared between all the query processors.

Any configuration change requires a restart of the Funnelback Daemon service to take effect.

The WebDAV service requires authentication and uses the same user database as the administration interface. Only administrator users can access the WebDAV service.

WebDAV being an extension of HTTP, you can easily access it with your web browser: Simply point it to https://funnelback-server.com:8076/webdav/ and enter administrator credentials.

Web services

Funnelback offers a set of web services that can trigger local or remote actions, such as transferring a complete collection to a remote host, or remotely swapping the views of a collection.

Those web services can be invoked using the $SEARCH_HOME/bin/mediator.pl command line or REST (over HTTPS). The REST web services are deployed on Jetty, under the administration interface context: /search/admin/mediator/.

To access a list of available web services, either:

Use mediator.pl --list-tasks,
Invoke the REST interface by pointing your browser to https://funnelback-server.com:8443/search/admin/mediator/. (Replace 8443 with the administration port used during installation).

For example, to push the intranet collection to a remote machine you could use:

 mediator.pl PushCollection collection=intranet host=qp01.domain.com

Similarly, to remotely swap the views on a query processor server:

mediator.pl SwapViews@qp01.domain.com collection=intranet

The specific commands that are used to support multiple query processors are: PullLogs, SwapViews, PushConfigFile, PushView and PushCollection.

For more information, please consult the mediator.pl documentation page.

Configuration

List of query processors

The list of query processor machines is configured either in each collection collection.cfg, or globally in $SEARCH_HOME/conf/global.cfg using the query-processors=... setting. This setting should contain the comma-separated list of fully-qualified host names for the query processors, and should be set on the administration server:

query-processors=qp01.domain.com,qp02.domain.com

The administration server and the query processors must all share the same:

Server secret, see server secret (global.cfg) for details on how to check and change the server_secret.
Encryption key for passwords, stored in $SEARCH_HOME/conf/keyset or loaded via the encryption.keyset-handle-provider-class (global.cfg) setting.

Note: after changing the server_secret the REDIS password will need to be regenerated and the services restarted.

The administration interface should also use the same port between the administration server and the query processors, in global.cfg(.default):

urls.admin_port=8443
webdav-service.port=8076

It is recommended to disable WebDAV on the administration server as this will prevent any of the query processors pushing to the administration server. In global.cfg on the administration server. This is implemented by removing the WebDavService option from the daemon.services configuration key in global.cfg otherwise if the key is not present in global.cfg then add the following:

daemon.services=FilterService

The administration interface should be set to read only mode for the query processors, in global.cfg(.default) on each query processor:

admin.read-only-mode=true

Initial publication

Collections created on the administration server must be initialized on the remote query processors. This can done automatically by configuring a script to run whenever a collection is created. Edit global.cfg and add the following:

post_collection_create_command=$SEARCH_HOME/bin/push_collection.pl $COLLECTION_NAME

Note: That will only be effective for collections created after this configuration has been put in place. Existing collections will need to be manually initialized by running the following command under $SEARCH_HOME/bin/:

$SEARCH_HOME/linbin/ActivePerl/bin/perl mediator.pl PushCollection collection=my_collection host=qp01.domain.com

...and repeat for each query processor. Note that this operation might take a while depending of the size of your index, but it will eventually complete.

Publication of configuration files

In order to publish modified configuration files to the remote query processors you need to set workflow.publish_hook=$SEARCH_HOME/bin/publish_hook.pl. It can be added on per-collection basis to the collection collection.cfg, or for all the collections in $SEARCH_HOME/conf/collection.cfg.

This setting will cause the hook script to be called when a file is published from a preview profile to a live profile using the publish button in the file manager. The hook script will iterate on the query processors list and push the updated configuration file to each one. This is required on the administration server.

Non profiles files cannot be published by default. To enable publication $SEARCH_HOME/conf/file-manager.ini needs to be edited. Under the section [file-manager::home] add a new line publish-to = REMOTE:

[file-manager::home]
name = Config
path = $home
rules = main-rules
deletable = false
# Enable this in multi-server setups
publish-to = REMOTE

REMOTE is a special publication target that will allow non-profile configuration files to be published. This will enable a publish button in the file manager for each non-profile file.

Note: Funnelback doesn't track the publication status of non-profile files. The publish button will always be displayed, regardless of whether the file has changed or not since the last publication.

Collection update workflow

New indexes and data must be transferred to the query processors when a collection is updated. To do so edit collection.cfg and set a post_archive:

post_archive_command=$SEARCH_HOME/linbin/ActivePerl/bin/perl $SEARCH_HOME/bin/publish_index.pl $COLLECTION_NAME

This script will iterate over the query processors and will:

Transfer the new live folder to the remote machine, into the offline folder,
Swap the views on the remote machine,
Archive the queries and click logs for the previous remote live folder (now offline).

Query reports and analytics

Analytics update are run on the administration server. The clicks and queries logs must be pulled from the remote query processors before the reports are updated. To do so you need to add a pre_reporting_command in collection.cfg:

pre_reporting_command=$SEARCH_HOME/linbin/ActivePerl/bin/perl $SEARCH_HOME/bin/pull_logs.pl $COLLECTION_NAME

This script will transfer the logs from the query processors into the archive folder of the collection on the administration server. Only log files that are not already present will be transferred.

Meta collections

All the above steps needs to be configured for meta collection, except for the collection update workflow (indexes pushing). Because meta collections are not updated like other collections, setting a post_archive_command will have no effect. Moreover, there's no index to transfer for a meta collections. The only thing that needs to be transferred to the query processor is the list of component collections. To do so you can define workflow.publish_hook.meta in the meta collection configuration (or into the server-wide collection.cfg to apply this to all meta collections):

workflow.publish_hook.meta=$SEARCH_HOME/bin/publish_index.pl $COLLECTION_NAME

This script will be called whenever the meta collection is edited, such as when an administrator changes the list of component collections. It will push the live/idx folder that contains the list of components collections in the index.sdinfo file.

When auto-completion or spelling suggestions are enabled on the meta collection the suggestion indexes are rebuilt whenever a component collection is updated. To make sure that the new suggestions indexes are published to the query processors you also need to update the post_meta_dependencies_command on each component collection to publish the index of the meta collection:

post_meta_dependencies_command=$SEARCH_HOME/linbin/ActivePerl/bin/perl $SEARCH_HOME/bin/publish_index.pl meta-collection-name

Note: If the component collections interact with their parent meta collection in post_update_command (or other workflow commands) then you must update the workflow accordingly to push the meta collection after it has been modified.

The pre_reporting_command also needs to be setup in the meta collection configuration file (as mentioned in the previous section) if you want to build Analytics reports for the meta collection.

Note: For the pull_logs.pl script to get all relevant logs from remote Meta collections, manual log archiving will need to be set up on those machines. This is since meta collection logs are normally not rotated or archived as there is no update event.

Adding a query processor node

To add a query processor node:

Edit global.cfg and modify the query-processors setting to add the hostname of the new node,
Manually run the initial publication of the collections you want to deploy (see previous "Initial publication" section), targeting the new node hostname.

Troubleshooting

The following logs might be helpful in diagnosing issues:

$SEARCH_HOME/log/create.log : Log for collection creation. Should show that the collections are initialized to the remote nodes when they are created.
$SEARCH_HOME/log/publish.log : Log for configuration files publication from the file manager. Show remote publication whenever a configuration file is published.
$SEARCH_HOME/log/mediator-cli.log : Log for the mediator tasks, used by the index publication script.
$SEARCH_HOME/log/mediator-endpoint-http.log : Log for the mediator web services, used to setup and swap views.

If connection refused errors are logged check that the server_secret is set correctly on all machines, and that WebDAV and administration interface HTTPS ports are allowed by firewalls between the administration the query processor machines.

Going further

Various alternatives are provided if you need to perform actions or transfer files between two Funnelback servers that not in the scope of this scenario:

Using mediator.pl
Using $SEARCH_HOME/bin/unofficial/transfer-webdav.groovy. This script provides a simple command-line interface to WebDAV file transfer and can be used as a basis to write a custom Groovy script that suits your needs.

In some cases, you may want to share the same Redis dataset between the query processors and the administration server. To do so, read configuring Redis to work with multiple servers.