PaDRE (Parallel Document Retrieval Engine)

Introduction

PADRE is the Funnelback "engine." It indexes the documents from a collection and queries those indexes. PADRE is, in fact, several programs. The main ones being:

padre-cw

is the indexing checker program. It is a utility used for validating indexes.

padre-di

is the program for showing stored document information, including metadata.

padre-fl

is the program for adjusting flag bits - See Document Flags.

padre-iw

is the indexing program. It reads the text files gathered during an update and creates the search indexes.

padre-qs

is the program used for query auto completion.

padre-sr

is the program for displaying the contents of the results file.

padre-sw

is the search program. It parses the CGI parameters and executes the user's query, returning XML results.

Controlling indexable content

By default, PADRE will index all of the content within each document given to it. Finer control is provided through several mechanisms to exclude certain content from being indexed. These are:

Control the content that is gathered

For more information on this, see the relevant section in the documentation for your particular collection type. (e.g. Web crawler exclude patterns

Exclusion of sections within pages

By surrounding sections of pages with special HTML comments, those sections will be ignored by the indexer. These tags can be automatically included for you based upon a regular expression. This is particularly useful in excluding common navigation elements, headers and footers. See Noindex expression. Alternatively you can insert them yourself into your documents. Note that these tags will only apply within the body of the HTML document. For example:

... This section is indexed ...
 <!--noindex-->

... This section is not indexed ...

 <!--endnoindex-->

... This section is indexed ...

Note that the 'noindex' tags used in previous versions of funnelback (beginnoindex, start_indexing and stop_indexing) will still operate correctly.

Exclusion of whole pages

By inserting a special meta tag into a page, the indexer can be instructed not to index it. The meta tag is:

<meta name="robots" content="noindex">

Index age

You can tell how old an index is by looking at the XML results for a search (search.xml) and looking at the collectionUpdated element:

<collectionUpdated>Tue Jul 31 11:24:47 2012</collectionUpdated>

The same information is also available in the index_time file in:

$SEARCH_HOME/data/<collection>/live/idx/

PaDRE (Parallel Document Retrieval Engine)

Introduction

Controlling indexable content

Index age

See also

Contents