Command Line Administration

The recommended method to administer a Funnelback installation is through the web based administration interface. This provides an easy-to-use frontend for administrators. However, it is also possible to administer Funnelback from the command line. This is useful if other systems need to be integrated with Funnelback.

Locations

We assume in the following instructions that the $SEARCH_HOME environment variable is defined. This should point to your installation directory. By default, this is /opt/funnelback/ on Linux and C:\funnelback\ on Windows.

$SEARCH_HOME/bin/

bin contains all administration scripts.

$SEARCH_HOME/conf/

conf contains various global configuration files, as well as collection specific configuration files, under $SEARCH_HOME/conf/collection name.

$SEARCH_HOME/log/

log contains global log files, such as the create.log file, which records creation of collections and the delete.log file, which records deletion of collections.

$SEARCH_HOME/web/

web contains files relating to the admin console and public search interface (such as the cgi files). Web server configuration files are stored in $SEARCH_HOME/web/conf/.

$SEARCH_HOME/data/

data contains collection specific data, such as gathered documents, indexes and log files.The data area has the following structure:

  • Each collection will have a subdirectory under data, containing live, offline, log and archive directories.
  • The archive directory contains compressed query and click log files for the collection.
  • The log directory contains logs that don't fit into any other category, such as update logs and reporting logs.
  • The live and offline directories contain gathered and filtered documents (in "data"), indexes (in "idx") and collection specific logs (in "log").

Creating a collection

All collection configuration files are created from a collection template at $SEARCH_HOME/conf/collection.cfg.default.

All configuration information for a collection is stored in a directory at $SEARCH_HOME/conf//. This includes the main collection.cfg file.

To create a collection from the command line, administrators can create the collection configuration directory, copy the collection template to collection.cfg in this directory, edit the collection configuration and run create-collection.pl over the collection configuration.

A separate convenience script, new-collection.pl, is available and will create the configuration directory and collection configuration file automatically. An optional start URL or location can be passed to this script, as well as a type, allowing the creation of web, local, filecopy, database collections, etc.

The created collection configuration should still be manually checked and edited to change default configuration options. The following options are especially important to check:

  • collection
  • collection_root
  • exclude_patterns
  • include_patterns
  • service_name
  • start_url

Creating a meta collection

A meta collection is one which has no data or indexes of its own but instead points to a set of underlying collections. To create a meta collection, administrators can use the new-collection.pl script, specifying a "meta" collection type.

The administrator must then create a meta.cfg file in the appropriate location: $SEARCH_HOME/conf//meta.cfg. This file is used to list the sub-collections which make up the meta collection.

The format is to list the internal names of the sub-collections, one per line. For example, the file might look like:

   funnelback_website
   shakespeare

You also need to create an index.sdinfo file which lists the full path to the index stems for the subsidiary collections. This file should be placed in $SEARCH_HOME/data//live/idx/ and $SEARCH_HOME/data//offline/idx/, and will look something like:

   $SEARCH_HOME/data/funnelback_website/live/idx/index
   $SEARCH_HOME/data/shakespeare/live/idx/index

Once this is done the meta collection will be as up to date as its component subcollections. This means that you do not need to call the update script for a meta collection.

Updating a collection

To update a collection, use the update.pl script, redirecting the output status messages to an appropriately named update log e.g. update-.log:

   update.pl $SEARCH_HOME/conf/example/collection.cfg > $SEARCH_HOME/log/update-example.log 2>&1

Note that an update may take a significant amount of time, depending upon the update timeout, number of documents found and other factors.

During the update, messages will be logged to the appropriate logs in $SEARCH_HOME/data//offline/log/ and $SEARCH_HOME/data//log/.

Lock files

To prevent multiple simultaneous updates of the same collection, update.pl will create a lock file at the start of an update. This lock file will be placed at $SEARCH_HOME/data//log/.lock. A collection update will not occur unless update.pl can create and gain exclusive access to this lock file. The lock file is removed at the end of a successful update or if an error occurs during the update.

State files

The various update scripts will also write to a state file at $SEARCH_HOME/data//log/.state. This state file will contain text indicating the state of the relevant collection:

  • normal
  • deleting
  • updating
  • gathering
  • crawling
  • stopping_crawl
  • halting_crawl

An additional collection.state file is written to the $SEARCH_HOME/conf// directory for web collections. This file contains the following parameter:

  • incremental_gathers_remaining

which stores the number of incremental gathers that will be done before a full gather is triggered. The value is decremented each time an incremental crawl is done and will be reset to the value of schedule.incremental_crawl_ratio when it reaches zero.

Deleting a Collection

Administrators may fully delete a collection using the delete-collection.pl script. This script will delete all data and configuration associated with the deleted collection:

  • gathered documents
  • indexes
  • configuration files
  • scheduled updates
  • logs

User configuration files are also edited to remove references to the deleted collection.

Command line scripts reference

Detailed internal documentation may be gained for many scripts through the standard Perl "perldoc" command.

  • **new-collection.pl**new-collection.pl creates a collection, including its collection.cfg file.
new-collection.pl <collection name> <collection type> [start url]
  • **create-collection.pl**create-collection.pl creates a collection from an already existing collection.cfg file.
create-collection.pl <collection config>
  • **delete-collection.pl**delete-collection.pl deletes a collection, including its gathered documents, indexes, configuration, scheduled updates and logs. It also removes references to the now non-existent collection from user configuration files.
delete-collection.pl <collection config>
  • **update.pl**update.pl is a wrapper around the entire update process, and calls the appropriate update subscripts.
update.pl <collection config> [update type: -incremental, -reindex, …]
  • **crawl.pl**crawl.pl gathers documents from web collections.
crawl.pl <collection config> [update type: -check, -incremental, -instant-update]
  • **filecopy.pl**filecopy.pl gathers documents from filecopy collections.
filecopy.pl <collection config> [other options]
  • **dbgather.pl**dbgather.pl gathers documents from database collections.
dbgather.pl <collection config> [--full] [other options]
  • **index.pl**index.pl calls Padre to index a collections documents.
index.pl <collection config> [-reindex] [-instant-update]
  • **make_report.pl**make_report.pl processes a collections data files, producing reports on their contents.
make_report.pl <--collection "collection config"> [--log] …
outliers-log-processing.pl [--collection "collection name"]
  • **swap-views.pl**swap-views.pl swaps the live and offline views of a collection after a successful update, placing the newly gathered and indexed data in live for querying, and safely storing the older gathered and indexed data in offline.
swap-views.pl <collection config> [-force]
  • **archive-log.pl**archive-log.pl archives a collections queries.log and clicks.log log files to the collection's archive directory.
archive-log.pl <collection config> [view]
  • **reports-load-queries-log.pl**reports-load-queries-log.pl reads a collections log files and stores a binary database for reporting purposes. The admin UI report frontend will read this database for displaying reports.
reports-load-queries-log.pl <--collection "collection internal name"> [-v] [-v] [-v] [-v]
  • **reports-send-email.pl**reports-send-email.pl sends email query reports to users who have requested them for the specified collection (or for all if none was specified).
reports-send-email.pl [--collection "collection name"]
 modify_perl_hashbang_line.pl
  • **mediator.pl**Trigger local or remote administrative tasks
 mediator.pl --help
 $SEARCH_HOME/web/bin/change_password.sh <user> <password>

top