Supporting multiple query processors
Supporting multiple query processor is usually motivated by two requirements:
- When the query volume is too high for a single machine,
- To ensure continuity of service in case of a query processor failure.
To address these two problems you can configure Funnelback to work in a live-live query processor configuration:
- One administration server is used to configure the collections, crawl and index the data, configure the search result pages and generate query reports,
- One or more query processing machines are used to serve search results.
All the query processor machines will be active and able to serve queries at the same time. The incoming search requests need to be distributed across the query processor servers using an external system, usually a load balancer. In the same fashion taking a query processor server offline for maintenance will be done at the load balancer level: Funnelback does not provide a facility to balance the search requests among all the query processor machines.
This scenario is supported by configuring additional workflow steps to:
- Deploy newly created collections to the query processors,
- Copy the index and data files from the admin server to the query processors whenever a collection is updated,
- Copy configuration files from the admin server to the query processors whenever they are modified,
- Archive the query and click logs on the query processors and transfer them to the Admin server, to build the query reports.
The additional workflow steps make use of two facilities to transfer files and trigger actions remotely on the query processors: WebDAV and custom web services. Those facilities are available over standard protocols and can be interacted with for other purposes if needed, either using Funnelback-provided command line tools or programmatically.
File transfers between two Funnelback servers relies on the WebDAV protocol. WebDAV is an extension of HTTP adding methods to write to a remote server (Upload a file, create a folder, etc.), allowing existing tools (WebDAV clients) and libraries for your preferred language to interact with it.
Funnelback runs a WebDAV server whose root is the Funnelback installation folder (
$SEARCH_HOME). It runs over HTTPS and on the default port 8076.
- The port can be changed by editing
webdav-service.port=1234. Note that changing the port will require a modification of the various workflow components to account for the change.
- The WebDAV service can be disabled by setting
daemon.services=...accordingly (See the example in
- The WebDAV service listen on all IP addresses by default. To specify on which addresses it should listen
webdav-service.bind-address=188.8.131.52can be set. It's recommended to have the WebDAV service listen only to a private LAN interface that is shared between all the query processors.
Any configuration change requires a restart of the Funnelback Daemon service to take effect.
The WebDAV service requires authentication and uses the same user database as the Admin UI. Only administrator users can access the WebDAV service.
WebDAV being an extension of HTTP, you can easily access it with your web browser: Simply point it to https://funnelback-server.com:8076/webdav/ and enter administrator credentials.
Funnelback offers a set of web services that can trigger local or remote actions, such as transferring a complete collection to a remote host, or remotely swapping the views of a collection.
Those web services can be invoked using the
$SEARCH_HOME/bin/mediator.pl command line or REST (over HTTPS). The REST web services are deployed on Jetty, under the Admin UI context:
To access a list of available web services, either:
- Invoke the REST interface by pointing your browser to https://funnelback-server.com:8443/search/admin/mediator/ (Replace 8443 with the admin port used during installation).
For example, to push the intranet collection to a remote machine you could use:
mediator.pl PushCollection collection=intranet host=qp01.domain.com
Similarly, to remotely swap the views on a query processor server:
mediator.pl SwapViews@qp01.domain.com collection=intranet
The specific commands that are used to support multiple query processors are:
For more information, please consult the mediator.pl documentation page.
List of query processors
The list of query processor machines is configured either in each collection
collection.cfg, or globally in
$SEARCH_HOME/conf/global.cfg using the
query-processors=... setting. This setting should contain the comma-separated list of fully-qualified host names for the query processors, and should be set on the admin server:
The admin server and the query processors must all share the same server secret, see server secret (global.cfg) for details on how to check and change the server_secret.
The Admin UI should also use the same port between the admin server and the query processors, in
It is recommended to disable WebDAV on the administration server as this will prevent any of the query processors pushing to the admin server. In
global.cfg on the admin server:
The Admin UI should be set to read only mode for the query processors, in
global.cfg(.default) on each query processor:
Collections created on the administration server must be initialized on the remote query processors. This can done automatically by configuring a script to run whenever a collection is created. Edit
global.cfg and add the following:
Note: That will only be effective for collections created after this configuration has been put in place. Existing collections will need to be manually initialized by running the following command under
$SEARCH_HOME/linbin/ActivePerl/bin/perl mediator.pl PushCollection collection=my_collection host=qp01.domain.com
...and repeat for each query processor. Note that this operation might take a while depending of the size of your index, but it will eventually complete.
Publication of configuration files
In order to publish modified configuration files to the remote query processors you need to set
workflow.publish_hook=$SEARCH_HOME/bin/publish_hook.pl. It can be added on per-collection basis to the collection
collection.cfg, or for all the collections in
This setting will cause the hook script to be called when a file is published from a preview profile to a live profile using the Publish button in the file manager. The hook script will iterate on the query processors list and push the updated configuration file to each one.
Non profiles files cannot be published by default. To enable publication
$SEARCH_HOME/conf/file-manager.ini needs to be edited. Under the section
[file-manager::home] add a new line
publish-to = REMOTE:
[file-manager::home] name = Config path = $home rules = main-rules deletable = false # Enable this in multi-server setups publish-to = REMOTE
REMOTE is a special publication target that will allow non-profile configuration files to be published. This will enable a Publish button in the file manager for each non-profile file.
Note: Funnelback doesn't track the publication status of non-profile files. The Publish button will always be displayed, regardless of whether the file has changed or not since the last publication.
Collection update workflow
New indexes and data must be transferred to the query processors when a collection is updated. To do so edit
collection.cfg and set a
post_archive_command=$SEARCH_HOME/linbin/ActivePerl/bin/perl $SEARCH_HOME/bin/publish_index.pl $COLLECTION_NAME
This script will iterate over the query processors and will:
- Transfer the new live folder to the remote machine, into the offline folder,
- Swap the views on the remote machine,
- Archive the queries and click logs for the previous remote live folder (now offline).
Query reports and Analytics
Analytics update are run on the Admin server. In order for the Admin server to present accurate reports the clicks and queries logs must be pulled from the remote query processors before the reports are updated. To do so you need to add a
pre_reporting_command=$SEARCH_HOME/linbin/ActivePerl/bin/perl $SEARCH_HOME/bin/pull_logs.pl $COLLECTION_NAME
This script will transfer the logs from the query processors into the archive folder of the collection on the Admin server. Only log files that are not already present will be transferred.
All the above steps needs to be configured for meta collection, except for the collection update workflow (indexes pushing). Because meta collections are not updated like other collections, setting a
post_archive_command will have no effect. Moreover, there's no index to transfer for a meta collections. The only thing that needs to be transferred to the query processor is the list of component collections. To do so you can define
workflow.publish_hook.meta in the meta collection configuration:
This script will be called whenever the meta collection is edited, such as when an administrator changes the list of component collections. It will push the
live/idx folder that contains the list of components collections in the
When auto-completion or spelling suggestions are enabled on the meta collection the suggestion indexes are rebuilt whenever a component collection is updated. To make sure that the new suggestions indexes are published to the query processors you also need to update the
post_meta_dependencies_command on each component collection to publish the index of the meta collection:
post_meta_dependencies_command=$SEARCH_HOME/linbin/ActivePerl/bin/perl $SEARCH_HOME/bin/publish_index.pl meta-collection-name
Note: If the component collections interact with their parent meta collection in
post_update_command (or other workflow commands) then you must update the workflow accordingly to push the meta collection after it has been modified.
pre_reporting_command also needs to be setup in the meta collection configuration file (as mentioned in the previous section) if you want to build Analytics reports for the meta collection.
Note: For the pull_logs.pl script to get all relevant logs from remote Meta collections, manual log archiving will need to be set up on those machines. This is since meta collection logs are normally not rotated or archived as there is no update event.
Adding a query processor node
To add a query processor node:
global.cfgand modify the
query-processorssetting to add the hostname of the new node,
- Manually run the initial publication of the collections you want to deploy (see previous "Initial publication" section), targeting the new node hostname.
The following logs might be helpful in diagnosing issues:
$SEARCH_HOME/log/create.log: Log for collection creation. Should show that the collections are initialized to the remote nodes when they are created.
$SEARCH_HOME/log/publish.log: Log for configuration files publication from the file manager. Show show remote publication whenever a configuration file is published.
$SEARCH_HOME/log/mediator-cli.log: Log for the mediator tasks, used by the index publication script.
$SEARCH_HOME/log/mediator-endpoint-http.log: Log for the mediator web services, used to setup and swap views.
If connection refused errors are logged check that the server_secret is set correctly on all machines, and that WebDAV and Admin UI HTTPS ports are allowed by firewalls between the admin the query processor machines.
Various alternatives are provided if you need to perform actions or transfer files between two Funnelback servers that not in the scope of this scenario:
- Using mediator.pl
$SEARCH_HOME/bin/unofficial/transfer-webdav.groovy. This script provides a simple command-line interface to WebDAV file transfer and can be used as a basis to write a custom Groovy script that suits your needs.
In some cases, you may want to share the same Redis dataset between the query processors and the admin server. To do so, read configuring Redis to work with multiple servers.