Collection security

Introduction

Funnelback has a number of mechanisms to secure access to collections and the data they contain. This is known as "Collection Level Security", as opposed to the more fine-grained "document level security" which enforces restrictions at the document level.

If you have have documents that can generally be classified as internal or external and you do not require more fine-grained control than that then collection level security would be appropriate for your needs.

Internal documents are those for use within your organisation only (not for public consumption). To index internal documents, but still have a public index, it is necessary to build:

Internal index: covering all available documents, including internal documents.

External index: covering only documents available externally (not internal documents).

To build these two indexes (or more complex configurations), you will need techniques for access restriction and controlling search scope.

Access restriction

Domain name and IP address access restrictions can be placed on an internal collection by setting the access_restriction option via the administration interface. For example,

access_restriction=150.203.,203.108.55.
access_alternate=collectionname2

This means only users with 150.203.* or 203.108.55.* IP addresses are allowed to search, and those who are not are instead forwarded to an alternate collection called collectionname2 on the local server.

The following example only allows users with *.csiro.au or *.anu.edu.au machines to search. Those who cannot are given an informative "access denied" message.

To configure a collection to be unrestricted, the access_restriction option must be set to no_restriction. Without this setting, the collection will be unsearchable by all domains and IP addresses.

To block access to a collection entirely, the access_restriction option must be set to no_access. This has the effect of disabling the search for all users.

It is possible to set up many restricted collections, each based either on IP or hostname, each forwarding to the same default external index. Chaining is also possible: Very internal index -> Internal index -> External index.

Controlling search scope

The internal and external indexes have different scope (index different document sets). This section describes three ways of controlling scope: rule based crawl, proxy crawl and password crawl.

A rule based crawl simply involves setting appropriate include_patterns and exclude_patterns for each collection. This approach is only recommended in situations where the appearance of new internal-only areas is monitored, since each time this happens the external index's rules should be reviewed.

A proxy crawl is more foolproof than a rule-based approach, but relies on the availability of an external proxy. Consider a university as an example. A rule-based external index of the university may be difficult to maintain, since there are a large number of volatile servers and new internal-only documents may be published without warning.

Instead, the university could set up a proxy with a non-university address. The external index is configured to crawl through this proxy (using the http_proxy options in the collection.cfg file) allowing an "external crawl" on an internal machine.

A password crawl is different from the others. It is used in cases where the internal documents are protected by password rather than IP/hostname restriction. In that case, the external index can be a plain crawl of available server(s) and the internal index can have http_user and http_passwd options to get the internal documents.

Collection security

Introduction

Access restriction

Controlling search scope

See also

Pages

Contents