Gscopes

Introduction

In some applications, it is useful to narrow down a search to particular sub-parts of a collection rather than searching over the entire collection. 'General scopes' (or 'gscopes') are the mechanism used by Funnelback for performing this task. The Faceted Navigation system makes use of the gscopes system in its operation.

Simply speaking, each document in a collection can be assigned a gscope 'number', such as '1' or '2'. Searches can then be restricted to those document sets that have a particular gscope number, such as 1.

For example, imagine a company website that had two major sections, a company news section and a careers section. By setting all the documents in the news section to have gscope '1', and all the documents in the careers section to have gscope '2', you could enable (along with suitable UI customisation) search over only the news section, or only the careers section.

The gscopes system is designed to be flexible in order to support a variety of use cases. Documents can be given multiple gscope numbers. For example, one document could be given the gscopes '1', '5', '6' and '43'. Additionally, a search can be restricted to arbitrary boolean combinations of gscope numbers. For example, you can instruct the search engine to restrict results to those documents that have gscope '4', OR have both gscopes '12' AND '23', as long as they do NOT have gscope '7'.

Configuring gscopes

To use the gscopes system, you must set up a gscopes definition file.

The regex gscope definition file can be created from the administration interface by selecting the collection you wish to use gscopes on, selecting the 'Administer' tab, clicking on 'Browse Collection Configuration Files' and then using the drop down box on the configuration files screen to create a file called gscopes.cfg or query-gscopes.cfg.

Gscopes can either be set by URL patterns using gscopes.cfg or by query expressions using query-gscopes.cfg.

Gscopes are automatically applied during the indexing process. You may also specify Gscopes options by setting Gscopes.options in collection.cfg.

Gscopes are implemented by allocating a certain additional amount of space within the index for each document. Each document is given one bit for each possible gscope number. This means that you must decide what your maximum gscope number is going to be beforehand. The default number of gscopes available is 64. If this is not sufficient, then the -GSB indexer option must be set. The -GSB option sets the number of bytes (not bits) that will be allocated for gscope information. The default setting of 64 gscope numbers is therefore equivalent to setting the indexer option -GSB 8. In other words, for the indexer option -GSB n, there will be 8 * n gscope numbers available.

Command line usage

URL pattern gscopes can be applied manually be running the following commands:

Linux:

 /opt/funnelback/bin/padre-gs /opt/funnelback/data/web/offline/idx/index /opt/funnelback/conf/web/gscopes.cfg

Windows:

 c:\funnelback\bin\padre-gs.exe c:\funnelback\data\db\offline\idx\index c:\funnelback\conf\db\gscopes.cfg

Push collections

Changed gscopes are not autmatically applied to all generations in a Push collection. Gscopes are applied to newly committed generations as well as merged generations. To re-apply gscopes to all generations you will need to trigger a Vacuum.

Gscopes and Meta Collections

There are a number of points to note when configuring gscopes in the context of meta collections:

  1. Each component collection in the meta collection must have the same number of gscope bits configured.
  2. The meaning of the gscope numbers should be consistent across the components for the results to make sense.

One way of achieving this is to have a single gscopes.cfg file in the meta collection's configuration directory and then have the post_index_commands in the component collection.cfg files refer to this file.

Searching with gscopes

To narrow down a search to a particular gscope, the appropriate query processor option must be set. This can either be done via the collection configuration (which will affect every search), or with a CGI parameter directly at search time (which will only affect one search).

To specify the query processor options in the collection.cfg use:

-gscope1=<gscope expression>

where is either:

  • a single gscope number e.g. 2
  • a reverse Polish gscope expression (see below) e.g. 1,2|

To use the CGI parameter add the following to your request URL:

&gscope1=<gscope expression>

where is defined in the same way as above.

Gscope expressions

The gscope expressions used are reverse Polish expressions. This means that all operands to a logical operation (such as AND, OR, NOT) precede the operator itself. This method helps avoid ambiguity and the need for brackets around complex logical expressions. However it can look quite odd to those unfamiliar with it. In Funnelback, '+' is used to represent the AND operation, '|' represents the OR operation and '!' represents the NOT operation. The best way to understand reverse Polish expressions is with some examples:

ExpressionDescription
3Matches documents which have gscope number 3 set.
1,2+Matches documents that have BOTH gscopes 1 and 2 set.
56,4|Matches documents that have gscope 56 OR 4 set.
3!Matches documents that do not have gscope 3 set
1,2,3,4|||Matches documents that have ANY of the gscopes 1,2,3,4
1,2,3,4+++Matches documents that have ALL of the gscopes 1,2,3,4

For more complex expressions than this, it is important to understand that the expression works as a stack. Reading from left to right, operands (gscope numbers) are pushed onto the stack, while operators (e.g. !, +, |) take off one or two numbers from the stack (one for !, two for + or |) to operate on. To help explain this, here are some further examples:

ExpressionDescription
3,4!+Matches documents that have gscope 3, but not 4
1,2,3,4|++Matches documents that have gscope 1, 2 and one or both of 3 and 4.
12,23+4|7!+Matches documents that have gscope '4', OR have both gscopes '12' AND '23', as long as they do NOT have gscope '7'.

See also

top