Skip to content

Filecopy collections

Introduction

File-copy-collections.png

A filecopy collection is used for indexing documents from a file share or a local disk. It is made from a copy of the documents from a local or remote filesystem directory/folder. If you wish to index text-only content (no binaries such as .DOC, .PDF) then you can alternatively use a local collection.

An update will copy new or changed files from the source folder into the collection's offline data directory from where the update will proceed as normal. Binary documents are converted into text, text content is indexed, and the offline view is swapped with the live view.

A filecopy collection is defined by the following properties:

Supported directories

Funnelback supports the indexing of various different types of directory. These include:

Local directories

These are located on the search server and are addressed as local paths.

Windows file shares

These are file shares that are served using the SMB or CIFS protocols, as is standard in most Windows environments. They can be addressed as UNC paths. How the data source is specified will depend on where the data is located. For example, a file-copy collection might have:

  • For a local disk: filecopy.source=/var/documents/shared/
  • For a windows fileshare: filecopy.source=\\fileserver\documents\ or filecopy.source=smb://fileserver/documents/

Note that on Linux operating systems, the default firewall rules may need to be altered to allow for SMB / CIFS name resolution.

RedHat Linux provides instructions for mounting NFS file shares and also comes with SMB/CIFS support

File shares mounted on a Windows machine can be indexed in a similar way, and will provide SMB/CIFS support. Please note that drive letter mappings are done or a per-user basis, so paths must be specified as UNC paths (e.g. \\fileserver\directory) for remote file shares. Also note that local collections can not operate with UNC paths or URLs as their data root.

Document level security

Document level security is supported on Windows to ensure that users can only access the files they are authorized to see.

Serving fileshare results

Fileshare results are served by the user interface layer: It will contact the fileshare to retrieve the requested file and download it to the search user browser. As part of its operation it will perform all required access checks to ensure a user only sees documents they are authorized to see.

Document filtering

Apache Tika is used to convert binary document formats to text. Additional custom filtering can be applied through custom filters.

Additional file types (if supported by Tika) can be filtered by adding the types to filecopy.filetypes and filter.tika.types

See also: Configure Funnelback to index additional file types

Configuration options

The following options are available for directory collections. These options can be set in the collection.cfg.

Option Description
filecopy.cache Enable/disable using the live view as a cache directory where pre-filtered text content can be copied from.
filecopy.discard_filtering_errors Whether to index or not the file names of files that failed to filter.
filecopy.domain Filecopy sources that require a username to access files will use this setting as a domain for the user.
filecopy.exclude_pattern Filecopy collections will exclude files which match this regular expression.
filecopy.filetypes The list of filetypes (i.e. file extensions) that will be included by a filecopy collection.
filecopy.include_pattern If specified, filecopy collections will only include files which match this regular expression.
filecopy.max_files_stored If set, this limits the number of documents a filecopy collection with gather when updating.
filecopy.num_fetchers Number of fetchers threads for interacting with the fileshare in a filecopy collection.
filecopy.num_workers Number of worker threads for filtering and storing files in a filecopy collection.
filecopy.passwd Filecopy sources that require a password to access files will use this setting as a password.
filecopy.request_delay Optional parameter to specify how long to delay between copy requests in milliseconds.
filecopy.security_model Sets the plugin to use to collect security information on files (Early binding Document Level Security.
filecopy.source This is the file system path or URL that describes the source of data files.
filecopy.source_list If specified, this option is set to a file which contains a list of other files to copy, rather than using the filecopy.source. NOTE: Specifying this option will cause the filecopy.source to be ignored.
filecopy.store_class Specifies which storage class to be used by a filecopy collection (e.g. WARC, Mirror).
filecopy.user Filecopy sources that require a username to access files will use this setting as a username.
filecopy.walker_class Main class used by the filecopier to walk a file tree
filter.classes Specifies which java classes should be used for filtering documents.
filter.tika.types Specifies which file types to filter using the TikaFilterProvider

See also

top

Funnelback logo
v15.24.0