Filecopy collections
Introduction
A filecopy collection is used for indexing documents from a file share or a local disk. It is made from a copy of the documents from a local or remote filesystem directory/folder. If you wish to index text-only content (no binaries such as .DOC, .PDF) then you can alternatively use a local collection.
An update will copy new or changed files from the source folder into the collection's offline data directory from where the update will proceed as normal. Binary documents are converted into text, text content is indexed, and the offline view is swapped with the live view.
A filecopy collection is defined by the following properties:
- A source directory to copy files from, possibly with an associated domain name, user name and password.
- Include and exclude patterns
Supported directories
Funnelback supports the indexing of various different types of directory. These include:
Local directories
These are located on the search server and are addressed as local paths.
Windows file shares
These are file shares that are served using the SMB or CIFS protocols, as is standard in most Windows environments. They can be addressed as UNC paths. How the data source is specified will depend on where the data is located. For example, a file-copy collection might have:
- For a local disk:
filecopy.source=/var/documents/shared/
- For a windows fileshare:
filecopy.source=\\fileserver\documents\
orfilecopy.source=smb://fileserver/documents/
Note that on Linux operating systems, the default firewall rules may need to be altered to allow for SMB / CIFS name resolution.
RedHat Linux provides instructions for mounting NFS file shares and also comes with SMB/CIFS support
File shares mounted on a Windows machine can be indexed in a similar way, and will provide SMB/CIFS support. Please note that drive letter mappings are done or a per-user basis, so paths must be specified as UNC paths (e.g. \\fileserver\directory
) for remote file shares. Also note that local collections can not operate with UNC paths or URLs as their data root.
Document level security
Document level security is supported on Windows to ensure that users can only access the files they are authorized to see.
Serving fileshare results
Fileshare results are served by the user interface layer: It will contact the fileshare to retrieve the requested file and download it to the search user browser. As part of its operation it will perform all required access checks to ensure a user only sees documents they are authorized to see.
Document filtering
Apache Tika is used to convert binary document formats to text. Additional custom filtering can be applied through custom filters.
Additional file types (if supported by Tika) can be filtered by adding the types to filecopy.filetypes and filter.tika.types
See also: Configure Funnelback to index additional file types
Configuration options
The following options are available for directory collections. These options can be set in the collection.cfg
.
Option | Description |
---|---|
filecopy.cache | Enable/disable using the live view as a cache directory where pre-filtered text content can be copied from. |
filecopy.discard_filtering_errors | Whether to index or not the file names of files that failed to filter. |
filecopy.domain | Filecopy sources that require a username to access files will use this setting as a domain for the user. |
filecopy.exclude_pattern | Filecopy collections will exclude files which match this regular expression. |
filecopy.filetypes | The list of filetypes (i.e. file extensions) that will be included by a filecopy collection. |
filecopy.include_pattern | If specified, filecopy collections will only include files which match this regular expression. |
filecopy.max_files_stored | If set, this limits the number of documents a filecopy collection with gather when updating. |
filecopy.num_fetchers | Number of fetchers threads for interacting with the fileshare in a filecopy collection. |
filecopy.num_workers | Number of worker threads for filtering and storing files in a filecopy collection. |
filecopy.passwd | Filecopy sources that require a password to access files will use this setting as a password. |
filecopy.request_delay | Optional parameter to specify how long to delay between copy requests in milliseconds. |
filecopy.security_model | Sets the plugin to use to collect security information on files (Early binding Document Level Security. |
filecopy.source | This is the file system path or URL that describes the source of data files. |
filecopy.source_list | If specified, this option is set to a file which contains a list of other files to copy, rather than using the filecopy.source. NOTE: Specifying this option will cause the filecopy.source to be ignored. |
filecopy.store_class | Specifies which storage class to be used by a filecopy collection (e.g. WARC, Mirror). |
filecopy.user | Filecopy sources that require a username to access files will use this setting as a username. |
filecopy.walker_class | Main class used by the filecopier to walk a file tree |
filter.classes | Specifies which java classes should be used for filtering documents. |
filter.tika.types | Specifies which file types to filter using the TikaFilterProvider |