Push collections

Push collections differ from other collections such as Web or Database collections as they do not gather (crawl) content. Push collections require that a third party service pushes in content into them using the Push RESTFul API.

Overview

Push collections are often used with XML content, although other content types like HTML are supported. Content is pushed in as individual documents (as opposite to having a single XML file containing multiple document nodes), with one API request per document.

Content can be added, retrieved, updated and removed via the API. Additional metadata can be associated with content if necessary (via HTTP headers, or multi-part HTTP requests). Each content has a unique identifier, a Key, which needs to be a valid URL. This key is used to retrieve, update or delete the content, and is used as the content URL when displaying search results.

Content is made available for search after the pending changes on a collection are committed. Commit can be done manually via the API, and will also happen automatically after a configured timeout or after a number of changes have been made.

Push collections use the same configuration systems as other collections (collection.cfg, metamap.cfg, xml.cfg, etc.) but some index-related changes will require a complete re-index of the collection to take effect, via the Vacuum API call.

RESTFul API

The API is documented using Swagger, which can also be used to interact with the API. To access the Swagger API user interface, go to the admin home page then under the System drop-down menu click on View API UI this will open up the Swagger UI, then click on push-api.

Or you can go directly to:

https://host_name:admin_port/search/admin/api-ui/#!/push-api

Getting Started

You will first need to create a Push collection from the admin home page. Once the collection is created you will not need to start the collection as it will be automatically started on the next API request.

Lets add a document with URL: http://myfirstdocument/ to your Push collection. You will need to perform a PUT request to add the document into your Push collection, using the following URL and supplying the document in the body of the request.

PUT https://host_name:admin_port/push-api/v1/collections/collection_name/documents?key=http%3A%2F%2Fmyfirstdocument%2F

The easiest way to try this out is to use the Swagger UI, under the section:

PUT /push-api/v2/collections/{collection}/documents

Set the collection to the name of the collection you created, then set the key to:

http://myfirstdocument/

and set the content to:

The quick brown fox jumps over the lazy dog

don't change the values of any other fields and then click try it out!

By default Push will make documents searchable as soon as possible, perform a search for fox over your collection and you should see a single result with URL http://myfirstdocument/.

If you wish to update the document, you can PUT the document in with the same key and different content. PUT requests will always replace exiting content at the given key. Refer to the Swagger UI documentation for more information.

Push also supports deleting documents. To delete the previously added document we would use the following DELETE call:

DELETE https://host_name:admin_port/push-api/v1/collections/examplecoll/documents?key=http%3A%2F%2Fmyfirstdocument%2F

You can try this from the Swagger UI under the section

DELETE /push-api/v2/collections/{collection}/documents

Set the collection to the name of the collection you created, then set the key to:

http://myfirstdocument/

then click try it out!

If you perform a search for fox over your collection the document will no longer be returned.

Authentication

The Push API accepts both basic authentication as well as token based authentication. Although basic authentication is simpler to use it is not efficient as the authentication will be re-performed server side for every request. It is preferable to use token based authentication if possible.

To use token based authentication you will need to use the POST /admin-api/account/v1/login API. You can find this API under the API UI under the Admin API tab under the user-account-management section. The token returned, in the x-security-token HTTP header, should be supplied with each request in the x-security-token HTTP header. Tokens may expire, typically after two weeks although they may expire sooner. It is best to account for this by re-fetching a token any time the server returns a 401.

Workflow

When changes to a push collection are made such as PUT-ing or DELETE-ing a document or PUT-ing a redirect, they are first stored in a staging area. These changes are not made live until they are committed.

Commits can be triggered through the API or automatically from a timeout or after a number of of changes have been made to the push collection.

Internally push collections will need to merge generations created from commits. This is required to stop performance from degrading. Push will automatically trigger merges as required. Merging on large collections can be costly and may impact query performance if run on a machine with insufficient CPUs or memory.

Index configuration changes require a Vacuum

Re-indexing a push collection is required when changing index configuration settings, such as indexer_options, GScopes, metadata mappings via metamap.cfg or xml.cfg, etc.

To re-index a collection, use the Vacuum API call and set vacuum-type to RE_INDEX. This can be a long operation on large collections as the complete content will be re-indexed.

Keys

Push expects all keys to be valid URIs, and expects that all keys have a scheme name such as http, ftp, local, etc. Push will also canonicalise keys - you can check the returned JSON for the key used by Push. In general, Push will:

  • add a '/' to the end of a URI that do not have a path. e.g. http://funnelback.com would become http://funnelback.com/
  • remove fragments from the URI. e.g. http://funnelback.com/#search would become http://funnelback.com/
  • flatten out the paths e.g. /s/../ would become /

Metadata

Extra metadata may be added to a document when it is PUT into Push. For the end-point shown above, Push uses HTTP headers that start with X-Funnelback-Push-Metadata-, the rest of the header name is used as the name as the metadata and the value is what is added to the document's metadata for example setting the following HTTP header:

 X-Funnelback-Push-Metadata-author: William Shakespeare

would set the 'author' metadata to 'William Shakespeare'. You will need to use metamap.cfg to access the metadata from the indexer and query processor. As HTTP headers are case insensitive in the HTTP specification but not the Java Servlet specification, metadata keys may be converted to lower-case in some environments. If case is important the multi-part end point should be used /v2/collections/{collection}/documents/content-and-metadata.

The multi-part endpoint should also be used when the metadata exceeds what HTTP headers can store:

  • Non-ASCII characters
  • Metadata value containing line breaks

A GET request for a document will return the metadata that is set using the HTTP headers, in the metadata part of the returned JSON. The metadata part of the returned JSON will not contain metadata that the indexer has extracted from the document or added with External_metadata.

Time stamp Metadata

Push (since version 15.6) will set the metadata X-Funnelback-Push-Received-Time to the time the document was PUT into Push. The date is a 19 character UTC time in the form: yyyyMMddHHmmss.SSSZ.

Anchor text & Click logs

Anchor text and clicks logs are a good source of evidence that Funnelback can use to improve search results. To take advantage of these sources we need to add

-anniemode=3

to the 'Query processor options' by going to 'Edit Collection Settings', then under the 'Interface' tab append the above option to 'Query processor options'. By default this option is on, however it will need to be used when Push is a part of a meta collection.

Click logs need to be processed before they can influence ranking, by default click logs are processed every hour. You may trigger processing of clicks logs from the RESTFul API

POST /push-api/v1/collections/{collection}/click-logs

See the Swagger UI for more details.

Resource requirements

Push's memory requirements are related to the number and size of the Keys in the largest Push collection on the Funnelback server. In general the sum of the length of all Keys in the Push collection, is the minimum amount of memory in bytes required for Push to be able to run that collection. Push runs under jetty so competes for access to memory with other parts of Funnelback such as the Modern UI. You should take this into consideration when setting up Funnelback especially when planing to work with collections with more than 1 million URLs.

On top of Push's memory requirements you will need to allocate memory for search indexes. Push's indexes are not as efficient as other collection's indexes and can require up to twice as much memory as other collections. Setting a lower value for push.scheduler.killed-percentage-for-reindex can reduce the memory overhead however it will increase the CPU requirements.

Push is able to take advantage of multiple cores. For example it may be serving multiple API requests while committing and also merging. Although Push can run with a single core, it will work better with more. If searches are being performed on the same machine Push is indexing and merging, you should provide at least 2 cores preferably 4 or more depending on load. You can set the push.worker-thread-count option in global.cfg to change the number of threads Push will use for merges and commits.

Backing up

Push collections cannot be backed by just by copying the files within the data folder of the Push collection. If you do this you will likely end up with a corrupt backup. If you wish to make a backup of the collection you will first need to make a snapshot of the push collection. A snapshot can be made via the RESTFul API (see Swagger UI for more details).

PUT /push-api/v1/collections/{collection}/snapshot/{name}

The snapshot will usually be created under:

$SEARCH_HOME/data/{collection}/snapshot/{name}

OR for windows

%SEARCH_HOME%\data\{collection}\snapshot\{name}

The snapshot is created with hard links, so creation of a snapshot is fast however the snapshot must not be edited until it has been copied.

Restoring a backup

If your push collection has failed or your disk has failed, you can easily restore a push collection from a backup by following these steps:

  • If the collection does not exist yet, then recreate the collection with exactly the same name.
  • Confirm that the push collection's state is STOPPED by calling the RESTFul API with
GET /push-api/v1/collections/{collection}/state
  • If the collection is not stopped, you must change the state of the collection to STOPPED by calling the RESTFul API with
POST /push-api/v1/collections/{collection}/state/stop
  • Now that the collection has stopped you must copy all files within your backup over the top of the existing live files for example:
cp -rv $SEARCH_HOME/data/{collection}/snapshot/{name}/* $SEARCH_HOME/data/{collection}/live/
  • The push collection can now be used and searched.

Multiple query processors support

Push collections can be replicated to multiple query processors. To set this up see Push multiple query processors

Limits

Push collections do not impose any specific limit on the size of each individual document, however some subsystems do, and practical limits apply in a number of areas. In practice, the size of documents added to a push collection via the push api will be limited by the maximum document size accepted by the jetty web server. The default value is currently set at 50MB and can be altered in the $SEARCH_HOME/web/conf/contexts-https/funnelback-push-api.xml file. Please note that any changes to this file will be overwritten when upgrading between Funnelback versions.

URLs of up to 2000 characters are supported. URLs longer than 2000 characters are likely to cause problems in client or proxy systems and should be avoided.

Push does not support the docurl or document xml.cfg directives.

Security

A user will require access to the collection as well as sec.continuous.rest to be able to use the Push API.

Included Java API client

Funnelback comes with a Java API Client, to use this you will need the following jar's available under lib/java/all/ in the Funnelback installation:

  • funnelback-push-api-client.jar
  • funnelback-api-client-core.jar

Here is an example of using the Push API client:

import java.io.IOException;
import java.net.URL;
import java.nio.charset.StandardCharsets;
import java.util.Arrays;
import java.util.Collection;
import java.util.HashMap;
import java.util.Map;

import com.funnelback.api.core.*;
import com.funnelback.push.client.PushContentV2Client;
import com.funnelback.push.client.responses.v2.AddedDocumentsResponseV2;


public class ApiExample {

    public void exampleAddAndDelete() throws IOException, APIException {
        URL url = new URL("https://<server domain>:<admin port>/push-api/");
        String userName = "user";
        String password = "complexPassword";
        String collectionId = "push-collection";
        
        Rest rest = new DefaultRest(userName, password, url);
        //Uncomment if you do not have a valid SSL certificate
        //new APIUtils().trustAllCerts(rest);
        
        PushContentV2Client client = new PushContentV2Client(rest);
        
        //public APIResponse<AddedDocumentsResponseV2> add(String collectionId, String key, byte[] content, 
        //Map<String, Collection<String>> metadata, String contentType, String filterChain) throws IOException, APIException{
        
        String key = "http://example.com/";
        
        //Content must be converted to bytes, ensure you use the correct charset, if possible always convert to
        //UTF-8
        byte[] content = "<html><p>Hello</p></html>".getBytes(StandardCharsets.UTF_8);
        
        //Set some metadata for the document
        Map<String, Collection<String>> metadata = new HashMap<>();
        metadata.put("authors", Arrays.asList("Barry", "Bob"));
        
        String contentType = "text/html; charset=UTF-8";
        
        //Filters can be run on the given document, this value should be set in the same
        //way filter.classes in collection.cfg.
        String filterChain = "";
        
        //Add the document.
        APIResponse<AddedDocumentsResponseV2> response = client.add(collectionId, key, content, metadata, contentType, filterChain);
        System.out.println(response.getResponseBody().getMessage());
        System.out.println("Stored keys: " + response.getResponseBody().getstoredKeys());
        
        //Commit the changes, set true to wait for the commit to complete.
        client.commit(collectionId, true);
        
        //Now delete the document.
        client.delete(collectionId, key);
    }
}

The API client will use token based authentication, and well re-fetch the token when it receives a 401 using the given username and password. If the username and password change you will need to re-create the API client objects.

Troubleshooting

Log files

Push will log all errors for all collections into the push log file located at:

$SEARCH_HOME/web/log/push.log
%SEARCH_HOME%\web\log\push.log

Push will log the output of index pipeline steps, for example padre-iw, under:

$SEARCH_HOME/data/collection_name/live/log/generation_number-/
%SEARCH_HOME\data\collection_name\live\log\generation_number-\

where generation_number is the generation number of the generation being created.

Merge failures

Push has a limit on the number of generations and so will constantly merge multiple generations into a single new generation allowing for new generations to be committed. If Push refuses to commit any more generations it is possible that merging has failed. This can occur because incorrect options or badly formatted files have been supplied to the indexer or other binaries in the index pipeline. When this happens you will be able to see merge errors in the push log file. As Push is constantly moving it is possible a merge error may only happen sporadically, to ensure it is possible to debug these issues you may set the following option in collection.cfg:

push.create-snapshot-on-merge-failure=true

Which will create a snapshot of the push collection if a merge fails. The snapshot will appear in the snapshot directory, like other snapshots, and will be named

FunnelbackInternal-merge-failed-generation_number

The snapshot can be restored to live, on another machine, for debugging.

File issues on Windows

On Windows if a file is open it cannot replaced. This can be a issue when a external process opens files under a Push collections's live view, likely causing commit or merge failures. Examples of programs which might cause this problem are anti-virus software and the Windows search service. Funnelback administrators should ensure that no external program is reading files created by push under the live directory of a Push collection.

If problems arise and you are unable to determine which program is opening files, Push includes support for running Handle when an issue occurs, which can report on which process has a file open. To enable this you will need to download Handle.exe and execute Handle as the user Funnelback is running as. Typically this can be done using PsExec with something similar to the following command line:

PsExec.exe -i -s Handle.exe

By doing this you will be able to read and accept Handle's user the license, a step which would otherwise cause the process to become stuck waiting for confirmation. It is recommend to run the above command twice to confirm you are not asked to read and accept the license on the second attempt.

Once the license is accepted, this additional debugging information can be enabled by setting Handle.exe's location in executables_cfg.md:

handle=C:\foo\bar\Handle.exe

With this setting configured, push will execute Handle when it cannot replace a file, and include the information Handle produces within any exceptions which are logged or returned from API calls, which should help identify which program is opening a Push collection's live files.

See also

top