Collection Options (collection.cfg)

collection.cfg is the main configuration file for a funnelback frontend.

The collection.cfg configuration file is created when a collection is created and can be edited through the admin home page's edit collection configuration link.

Location

The locations on disk where these options can be read:

$SEARCH_HOME/conf/[collection]/collection.cfg: Collection configuration options in this file have the highest precendence overriding values in all other files.
$SEARCH_HOME/conf/collection.cfg: Options in this file provide defaults for all profiles in all collections.
$SEARCH_HOME/conf/collection.cfg.default: Options in this file are read only, this provides the product default values.

Format

The format of the file is a simple name=value pair per line, with the values $SEARCH_HOME and $COLLECTION_NAME automatically expanded to the funnelback installation path and the name of the current collection automatically.

Configuration options

The following tables contain descriptions of the options that are used in the configuration file.

Option	Description
accessibility-auditor.check	Turns modern accessibility checks on or off.
accessibility-auditor.min-time-between-recording-history-in-seconds	Specifies how much time must have passed since the last time Accessibility Auditor data was recording before new data will be recorded.
admin.undeletable	This option controls whether a collection can be deleted from the administration interface.
admin_email	Specifies an email that will be emailed after each collection update.
analytics.data_miner.range_in_days	Length of time range (in days) the analytics data miner will go back from the current date when mining query and click log records.
analytics.max_heap_size	Set Java heap size used for analytics.
analytics.outlier.day.minimum_average_count	Control the minimum number of occurrences of a query required before a day pattern can be detected.
analytics.outlier.day.threshold	Control the day pattern detection threshold.
analytics.outlier.exclude_collection	Disable query spike detection (trend alerts) for a collection.
analytics.outlier.exclude_profiles	Disable query spike detection for a profile
analytics.outlier.hour.minimum_average_count	Control the minimum number of occurrences of a query required before a hour pattern can be detected.
analytics.outlier.hour.threshold	Control the hour pattern detection threshold.
analytics.reports.checkpoint_rate	Controls the rate at which the query reports system checkpoints data to disk.
analytics.reports.disable_incremental_reporting	Disable incremental reports database updates. If set all existing query and click logs will be processed for each reports update.
analytics.reports.max_facts_per_dimension_combination	Specifies the amount of data that is stored by query reports.
analytics.scheduled_database_update	Control whether reports for the collection are updated on a scheduled basis.
annie.index_opts	Specify options for the "annie-a" annotation indexing program.
build_autoc_options	Specifies additional configuration options that can be supplied to the auto completion builder.
changeover_percent	Specifies the minimum ratio of documents that must be gathered for an update to succeed.
click_data.archive_dirs	The directories that contain archives of click logs to be included in producing indexes.
click_data.num_archived_logs_to_use	The number of archived click logs to use from each archive directory.
click_data.use_click_data_in_index	A boolean value indicating whether or not click information should be included in the index.
click_data.week_limit	Optional restriction of click data to a set number of weeks into the past.
collection	The internal name of a collection.
collection-update.step.stepTechnicalName.run	Determines if an update step should be run or not.
collection_root	Specifies the location of a collection's data folder.
collection_type	Specifies the type of the collection.
contextual-navigation.cannot_end_with	Prevent from returning suggestions ending in any of defined words or phrases.
contextual-navigation.case_sensitive	List of words to preserve case sensitivity.
contextual-navigation.categorise_clusters	Group contextual navigation suggestions into types and topics.
contextual-navigation.enabled	Enable or disable contextual navigation.
contextual-navigation.kill_list	Exclude contextual navigation suggestions which contain any words or phrases in this list.
contextual-navigation.max_phrase_length	Limit the maximum length of suggestions to the specified number of words.
contextual-navigation.max_phrases	Limit the number of contextual navigation phrases that should be processed.
contextual-navigation.max_results_to_examine	Specify the maximum number of results to examine when generating suggestions.
contextual-navigation.site.granularity	Type of granularity to be used for contextual navigation `site` suggestions.
contextual-navigation.site.max_clusters	Limits the number of `site` suggestions
contextual-navigation.summary_fields	Metadata classes that are analysed for contextual navigation.
contextual-navigation.timeout_seconds	Timeout for how long contextual navigation should run for.
contextual-navigation.topic.max_clusters	Limits the number of `topic` suggestions
contextual-navigation.type.max_clusters	Limits the number of `type` suggestions
crawler	Specifies the name of the crawler binary.
crawler.accept_cookies	This option enables, or disables, the crawler's use of cookies.
crawler.accept_files	Restricts the file extensions the web crawler should crawl.
crawler.allow_concurrent_in_crawl_form_interaction	Enable/disable concurrent processing of in-crawl form interaction.
crawler.allowed_redirect_pattern	Specify a regex to allow crawler redirections that would otherwise by disallowed by the current include/exclude patterns.
crawler.cache.DNSCache_max_size	Maximum size of internal DNS cache. Upon reaching this size the cache will drop old elements.
crawler.cache.LRUCache_max_size	Maximum size of LRUCache. Upon reaching this size the cache will drop old elements.
crawler.cache.URLCache_max_size	Specifies the maximum size of URLCache.
crawler.check_alias_exists	Check if aliased URLs exists - if not, revert back to original URL.
crawler.checkpoint_to	Specifies the location of crawler checkpoint files.
crawler.classes.Crawler	This option defines the Java class to be used as the main crawling process.
crawler.classes.Frontier	Specifies the java class used for the frontier (a list of URLs not yet visited).
crawler.classes.Policy	Specifies the Java class used for enforcing the include/exclude policy for URLs.
crawler.classes.RevisitPolicy	Specifies the Java class used for enforcing the revisit policy for URLs.
crawler.classes.URLStore	Specifies the Java class used to store content on disk e.g. create a mirror of files crawled
crawler.classes.statistics	List of statistics classes to use during a crawl in order to generate figures for data reports
crawler.cookie_jar_file	Specifies the a file containing cookies to be pre-loaded when a web crawl begins
crawler.eliminate_duplicates	Whether to eliminate duplicate documents while crawling.
crawler.extract_links_from_javascript	Whether to extract links from Javascript while crawling.
crawler.follow_links_in_comments	Whether to follow links in HTML comments while crawling.
crawler.form_interaction.in_crawl.groupId.cleartext.urlParameterKey	Specifies a clear text form parameter for in-crawl authentication.
crawler.form_interaction.in_crawl.groupId.encrypted.urlParameterKey	Specifies an encrypted form parameter for in-crawl authentication.
crawler.form_interaction.in_crawl.groupId.url_pattern	Specifies a URL of a HTML web form action for an in-crawl form interaction rule.
crawler.form_interaction.pre_crawl.groupId.cleartext.urlParameterKey	Specifies a clear text form parameter for pre-crawl authentication.
crawler.form_interaction.pre_crawl.groupId.encrypted.urlParameterKey	Specifies an encrypted form parameter for pre-crawl authentication.
crawler.form_interaction.pre_crawl.groupId.form_number	Specifies which form element at the specified URL should be processed.
crawler.form_interaction.pre_crawl.groupId.url	Specifies a URL of the page containing the HTML web form for a pre-crawl form interaction rule.
crawler.frontier_hosts	Lists of hosts running crawlers if performing a distributed web crawl.
crawler.frontier_num_top_level_dirs	Specifies the number of top level directories to store disk based frontier files in.
crawler.frontier_port	Port on which DistributedFrontier will listen on.
crawler.frontier_use_ip_mapping	Whether to map hosts to frontiers based on IP address.
crawler.header_logging	Option to control whether HTTP headers are written out to a separate log file.
crawler.incremental_logging	Option to control whether a list of new and changed URLs should be written to a log file during incremental crawling.
crawler.inline_filtering_enabled	Option to control whether text extraction from binary files is done "inline" during a web crawl.
crawler.link_extraction_group	The group in the crawler.link_extraction_regular_expression option which should be extracted as the link/URL.
crawler.link_extraction_regular_expression	Specifies the regular expression used to extract links from each document.
crawler.logfile	Specifies the crawler's log path and filename.
crawler.lowercase_iis_urls	Whether to lowercase all URLs from IIS web servers.
crawler.max_dir_depth	Specifies the maximum number of sub directories a URL may have before it will be ignored.
crawler.max_download_size	Specifies the maximum size of files the crawler will download (in MB).
crawler.max_files_per_area	Specifies a limit on the number of files from a single directory or dynamically generated URLs that will be crawled.
crawler.max_files_per_server	Specifies the maximum number of files that will be crawled per server.
crawler.max_files_stored	Specifies the maximum number of files to download.
crawler.max_individual_frontier_size	Specifies the maximum size of an individual frontier.
crawler.max_link_distance	Specifies the maximum distance a URL can be from a start URL for it to be downloaded.
crawler.max_parse_size	Crawler will not parse documents beyond this many megabytes in size.
crawler.max_timeout_retries	Maximum number of times to retry after a network timeout.
crawler.max_url_length	Specifies the maximum length a URL can be in order for it to be crawled.
crawler.max_url_repeating_elements	A URL with more than this many repeating elements (directories) will be ignored.
crawler.monitor_authentication_cookie_renewal_interval	Specifies the time interval at which to renew crawl authentication cookies.
crawler.monitor_checkpoint_interval	Time interval at which to checkpoint (seconds).
crawler.monitor_delay_type	Type of delay to use during crawl (dynamic or fixed).
crawler.monitor_halt	Specifies if a crawl should stop running.
crawler.monitor_preferred_servers_list	Specifies an optional list of servers to prefer during crawling.
crawler.monitor_time_interval	Specifies a time interval at which to output monitoring information (seconds).
crawler.monitor_url_reject_list	Optional parameter listing URLs to reject during a running crawl.
crawler.non_html	Which non-html file formats to crawl (e.g. pdf, doc, xls etc.).
crawler.ntlm.domain	NTLM domain to be used for web crawler authentication.
crawler.ntlm.password	NTLM password to be used for web crawler authentication.
crawler.ntlm.username	NTLM username to be used for web crawler authentication.
crawler.num_crawlers	Number of crawler threads which simultaneously crawl different hosts.
crawler.overall_crawl_timeout	Specifies the maximum time the crawler is allowed to run. When exceeded, the crawl will stop and the update will continue.
crawler.overall_crawl_units	Specifies the units for the crawl timeout.
crawler.parser.mimeTypes	Extract links from these comma-separated or regexp: content-types.
crawler.predirects_enabled	Enable crawler predirects.
crawler.protocols	Crawl URLs via these protocols.
crawler.reject_files	Do not crawl files with these extensions.
crawler.remove_parameters	Optional list of parameters to remove from URLs.
crawler.request_delay	Milliseconds between HTTP requests per crawler thread.
crawler.request_header	Optional additional header to be inserted in HTTP(S) requests made by the webcrawler.
crawler.request_header_url_prefix	Optional URL prefix to be applied when processing the crawler.request_header parameter
crawler.request_timeout	Timeout for HTTP page GETs (milliseconds)
crawler.revisit.edit_distance_threshold	Threshold for edit distance between two versions of a page when deciding whether it has changed or not.
crawler.revisit.num_times_revisit_skipped_threshold	Threshold for number of times a page revisit has been skipped when deciding whether to revisit it.
crawler.revisit.num_times_unchanged_threshold	Threshold for the number of times a page has been unchanged when deciding whether to revisit it.
crawler.robotAgent	Matching is case-insensitive over the length of the name in a robots.txt file
crawler.secondary_store_root	Location of secondary (previous) store - used in incremental crawling.
crawler.send-http-basic-credentials-without-challenge	Specifies whether HTTP basic credentials should be sent without the web server sending a challenge.
crawler.server_alias_file	Path to optional file containing server alias mappings e.g. www.daff.gov.au=www.affa.gov.au
crawler.sslClientStore	Specifies a path to a SSL Client certificate store.
crawler.sslClientStorePassword	Password for the SSL Client certificate store.
crawler.sslTrustEveryone	Trust ALL Root Certificates and ignore server hostname verification.
crawler.sslTrustStore	Specifies the path to a SSL Trusted Root store.
crawler.start_urls_file	Path to a file that contains a list of URLs (one per line) that will be used as the starting point for a crawl.
crawler.store_all_types	If true, override accept/reject rules and crawl and store all file types encountered
crawler.store_empty_content_urls	Specifies if URLs that contain no content after filtering should be stored.
crawler.store_headers	Whether HTTP header information should be written at the top of HTML files.
crawler.use_sitemap_xml	Specifies whether to process sitemap.xml files during a web crawl.
crawler.user_agent	The browser ID that the crawler uses when making HTTP requests.
crawler.verbosity	Verbosity level (0-6) of crawler logs. Higher number results in more messages.
crawler_binaries	Specifies the location of the crawler files.
custom.base_template	The template used when a custom collection was created.
data_report	A switch that can be used to enable or disable the data report stage during a collection update.
data_root	The directory under which the documents to index reside.
db.bundle_storage_enabled	Allows storage of data extracted from a database in a compressed form.
db.custom_action_java_class	Allows a custom java class to modify data extracted from a database before indexing.
db.full_sql_query	The SQL query to perform on a database to fetch all records for searching.
db.incremental_sql_query	The SQL query to perform to fetch new or changed records from a database.
db.incremental_update_type	Allows the selection of different modes for keeping database collections up to date.
db.jdbc_class	The name of the Java JDBC driver to connect to a database.
db.jdbc_url	The URL specifying database connection parameters such as the server and database name.
db.password	The password for connecting to the database.
db.primary_id_column	The primary id (unique identifier) column for each database record.
db.single_item_sql	An SQL command for extracting an individual record from the database
db.update_table_name	The name of a table in the database which provides a record of all additions, updates and deletes.
db.use_column_labels	Flag to control whether column labels are used in JDBC calls in the database gatherer.
db.username	The username for connecting to the database.
db.xml_root_element	The top level element for records extracted from the database.
directory.context_factory	Sets the java class to use for creating directory connections.
directory.domain	Sets the domain to use for authentication in a directory collection.
directory.exclude_rules	Sets the rules for excluding content from a directory collection.
directory.page_size	Sets the number of documents to fetch from the directory in each request.
directory.password	Sets the password to use for authentication in a directory collection.
directory.provider_url	Sets the URL for accessing the directory in a directory collection.
directory.search_base	Sets the base from which content will be gathered in a directory collection.
directory.search_filter	Sets the filter for selecting content to gather in a directory collection.
directory.username	Sets the username to use for authentication in a directory collection.
exclude_patterns	The crawler will ignore a URL if it matches any of these exclude patterns.
facebook.access-token	Specify an optional access token
facebook.app-id	Specifies the Facebook application ID.
facebook.app-secret	Specifies the Facebook application secret.
facebook.debug	Enable debug mode to preview Facebook fetched records.
facebook.event-fields	Specify a list of Facebook event fields as specified in the Facebook event API documentation
facebook.page-fields	Specify a list of Facebook page fields as specified in the Facebook page API documentation
facebook.page-ids	Specifies a list of IDs of the Facebook pages/accounts to crawl.
facebook.post-fields	Specify a list of Facebook post fields as specified in the Facebook post API documentation
faceted_navigation.black_list	Exclude specific values for facets.
faceted_navigation.black_list.facet	Exclude specific values for a specific facet.
faceted_navigation.date.sort_mode	(deprecated) Specify how to sort date based facets.
faceted_navigation.white_list	Include only a list of specific values for facets.
faceted_navigation.white_list.facet	Include only a list of specific values for a specific facet.
filecopy.cache	Enable/disable using the live view as a cache directory where pre-filtered text content can be copied from.
filecopy.discard_filtering_errors	Whether to index the file names of files that failed to be filtered.
filecopy.domain	Filecopy sources that require a username to access files will use this setting as a domain for the user.
filecopy.exclude_pattern	Filecopy collections will exclude files which match this regular expression.
filecopy.filetypes	The list of filetypes (i.e. file extensions) that will be included by a filecopy collection.
filecopy.include_pattern	If specified, filecopy collections will only include files which match this regular expression.
filecopy.max_files_stored	If set, this limits the number of documents a filecopy collection will gather when updating.
filecopy.num_fetchers	Number of fetcher threads for interacting with the fileshare in a filecopy collection.
filecopy.num_workers	Number of worker threads for filtering and storing files in a filecopy collection.
filecopy.passwd	Filecopy sources that require a password to access files will use this setting as a password.
filecopy.request_delay	Specifies how long to delay between copy requests in milliseconds.
filecopy.security_model	Sets the plugin to use to collect security information on files.
filecopy.source	This is the file system path or URL that describes the source of data files.
filecopy.source_list	If specified, this option is set to a file which contains a list of other files to copy, rather than using the filecopy.source.
filecopy.store_class	Specifies which storage class to be used by a filecopy collection (e.g. WARC, Mirror).
filecopy.user	Filecopy sources that require a username to access files will use this setting as a username.
filecopy.walker_class	Main class used by the filecopier to walk a file tree.
filter.classes	Specifies which java classes should be used for filtering documents.
filter.csv-to-xml.custom-header	Defines a custom header to use for the CSV.
filter.csv-to-xml.format	Sets the CSV format to use when filtering a CSV document.
filter.csv-to-xml.has-header	Controls if the CSV file has a header or not.
filter.csv-to-xml.url-template	The template to use for the URLs of the documents created in the CSVToXML Filter.
filter.document_fixer.timeout_ms	Controls the maximum amount of time the document fixer may spend on a document.
filter.ignore.mimeTypes	Specifies a list of MIME types for the filter to ignore.
filter.jsoup.classes	Specify which java/groovy classes will be used for filtering, and operate on JSoup objects rather than byte streams.
filter.jsoup.undesirable_text-source.key_name	Specify sources of undesirable text strings to detect and present within content auditor.
filter.text-cleanup.ranges-to-replace	Specify Unicode blocks for replacement during filtering (to avoid 'corrupt' character display).
filter.tika.types	Specifies which file types to filter using the TikaFilterProvider.
flickr.api-key	Flickr API key
flickr.api-secret	Flickr API secret
flickr.auth-secret	Flickr authentication secret
flickr.auth-token	Flickr authentication token
flickr.debug	Enable debug mode to preview Flickr fetched records.
flickr.groups.private	List of Flickr group IDs to crawl within a "private" view.
flickr.groups.public	List of Flickr group IDs to crawl within a "public" view.
flickr.user-ids	Comma delimited list of Flickr user accounts IDs to crawl.
ftp_passwd	Password to use when gathering content from an FTP server.
ftp_user	Username to use when gathering content from an FTP server.
gather	Specifies if gathering is enabled or not.
gather.max_heap_size	Set Java heap size used for gathering documents.
gather.slowdown.days	Days on which gathering should be slowed down.
gather.slowdown.hours.from	Start hour for slowdown period.
gather.slowdown.hours.to	End hour for slowdown period.
gather.slowdown.request_delay	Request delay to use during slowdown period.
gather.slowdown.threads	Number of threads to use during slowdown period.
groovy.extra_class_path	Specify extra class paths to be used by Groovy when using $GROOVY_COMMAND.
group.customer_id	The customer group under which collection will appear - Useful for multi-tenant systems.
group.project_id	The project group under which collection will appear in selection drop down menu on main Administration page.
gscopes.options	Specify options for the "padre-gs" gscopes program.
gscopes.other_gscope	Specifies the gscope to set when no other gscopes are set.
http_passwd	Password used for accessing password protected content during a crawl.
http_proxy	The hostname (e.g. proxy.company.com) of the HTTP proxy to use during crawling.
http_proxy_passwd	The proxy password to be used during crawling.
http_proxy_port	Port of HTTP proxy used during crawling.
http_proxy_user	The proxy user name to be used during crawling.
http_source_host	IP address or hostname used by crawler, on a machine with more than one available.
http_user	Username used for accessing password-protected content during a crawl.
include_patterns	Specifies the pattern that URLs must match in order to be crawled.
index	A switch that can be used to enable or disable the indexing stage during a collection update.
index.target	For datasources, indicate which index the data is sent to.
indexer	The name of the indexer program to be used for this collection.
indexer_options	Indexer command line options, each separated by whitespace and thus cannot contain embedded whitespace characters.
indexing.additional-metamap-source.key_name	Declares additional sources of metadata mappings to be used when indexing HTML documents.
indexing.collapse_fields	Define which fields to consider for result collapsing.
indexing.use_manifest	Specifies if a manifest file should be used for indexing.
java_libraries	The path where the Java libraries are located when running most gatherers.
java_options	Command line options to pass to the Java virtual machine.
knowledge-graph.max_heap_size	Set Java heap size used for Knowledge Graph update process.
logging.hostname_in_filename	Control whether hostnames are used in log filenames.
logging.ignored_x_forwarded_for_ranges	Defines all IP ranges in the X-Forwarded-For header to be ignored by Funnelback when choosing the IP address to Log.
mail.on_failure_only	Specifies whether to always send collection update emails or only when an update fails.
matrix_password	Username for logging into Matrix and the Squiz Suite Manager.
matrix_username	Password for logging into Matrix and the Squiz Suite Manager.
mcf.authority-url	URL for contacting a ManifoldCF authority.
mcf.domain	Default domain for users in the ManifoldCF authority.
noindex_expression	Optional regular expression to specify content that should not be indexed.
post_archive_command	Command to run after archiving query and click logs.
post_collection_create_command	Command to run after collection was created.
post_delete-list_command	Command to run after deleting documents during an instant delete update.
post_delete-prefix_command	Command to run after deleting documents during an instant delete update.
post_gather_command	Command to run after the gathering phase during a collection update.
post_index_command	Command to run after the index phase during a collection update.
post_instant-gather_command	Command to run after the gather phase during an instant update.
post_instant-index_command	Command to run after the index phase during an instant update.
post_meta_dependencies_command	Command to run after a component collection updates its meta parents during a collection update.
post_recommender_command	Command to run after the recommender phase during a collection update.
post_reporting_command	Command to run after query analytics runs.
post_swap_command	Command to run after live and offline views are swapped during a collection update.
post_update_command	Command to run after an update has successfully completed.
pre_archive_command	Command to run before archiving query and click logs.
pre_collection_delete_command	Command to run before collection will be permanently deleted.
pre_delete-list_command	Command to run before deleting documents during an instant delete update.
pre_delete-prefix_command	Command to run before deleting documents during an instant delete update.
pre_gather_command	Command to run before the gathering phase during a collection update.
pre_index_command	Command to run before the index phase during a collection update.
pre_instant-gather_command	Command to run before the gather phase during an instant update.
pre_instant-index_command	Command to run before the index phase during an instant update.
pre_meta_dependencies_command	Command to run before a component collection updates its meta parents during a collection update.
pre_recommender_command	Command to run before the recommender phase during a collection update.
pre_report_command	Command run before query or click logs are to be used during an update.
pre_reporting_command	Command to run before query analytics runs.
pre_swap_command	Command to run before live and offline views are swapped during a collection update.
progress_report_interval	Interval (in seconds) at which the gatherer will update the progress message for the Admin UI.
push.auto-start	Specifies whether the the Push collection will start with the web server.
push.commit-type	The type of commit that push should use by default.
push.commit.index.parallel.max-index-thread-count	The maximum number of threads that can be used during a commit for indexing.
push.commit.index.parallel.min-documents-for-parallel-indexing	The minimum number of documents required in a single commit for parallel indexing to be used during that commit.
push.commit.index.parallel.min-documents-per-thread	The minimum number of documents each thread must have when using parallel indexing in a commit.
push.init-mode	The initial mode in which push should start.
push.max-generations	The maximum number of generations push can use.
push.merge.index.parallel.max-index-thread-count	The maximum number of threads that can be used during a merge for indexing.
push.merge.index.parallel.min-documents-for-parallel-indexing	The minimum number of documents required in a single merge for parallel indexing to be used during that merge.
push.merge.index.parallel.min-documents-per-thread	The minimum number of documents each thread must have when using parallel indexing in a merge.
push.replication.compression-algorithm	The compression algorithm to use when transferring compressible files to Push slaves.
push.replication.ignore.data	When set Query processors will ignore the 'data' section in snapshots, which is used for serving cached copies.
push.replication.ignore.delete-lists	When set Query processors will ignore the delete lists.
push.replication.ignore.index-redirects	When set Query processors will ignore the index redirects file in snapshots.
push.replication.master.host-name	A query processor push collection's master's hostname.
push.replication.master.push-api.port	The master's push-api port for a query processor push collection.
push.run	Controls if a Push collection is allowed to to run or not.
push.scheduler.auto-click-logs-processing-timeout-seconds	Number of seconds before a Push collection will automatically trigger processing of click logs.
push.scheduler.auto-commit-timeout-seconds	Number of seconds a Push collection should wait before a commit is automatically triggered.
push.scheduler.changes-before-auto-commit	Number of changes to a Push collection before a commit is automatically triggered.
push.scheduler.delay-between-content-auditor-runs	Minimum time in milliseconds between each executions of the Content Auditor summary generation task.
push.scheduler.delay-between-meta-dependencies-runs	Minimum time in milliseconds between each executions of updating the Push collection's meta parents.
push.scheduler.generation.re-index.killed-percent	The percentage of killed documents in a single generation for it to be considered for re-indexing.
push.scheduler.generation.re-index.min-doc-count	The minimum number of documents in a single generation for it to be considered for re-indexing.
push.scheduler.killed-percentage-for-reindex	Percentage of killed documents before Push re-indexes.
push.store.always-flush	Used to stop a Push collection from performing caching on PUT or DELETE calls.
query_processor	The name of the query processor executable to use.
query_processor_options	Query processor command line options.
quicklinks	Turn quicklinks functionality on or off.
quicklinks.blacklist_terms	List of words to ignore as the link title.
quicklinks.depth	The number of sub pages to search for link titles.
quicklinks.domain_searchbox	Turn on or off the inline domain restricted search box on the search result page.
quicklinks.max_len	Maximum character length for the link title.
quicklinks.max_words	Maximum number of link titles to display.
quicklinks.min_len	Minimum character length for the link title.
quicklinks.min_links	Minimum number of links to display.
quicklinks.rank	The number of search results to enable quick links on.
quicklinks.total_links	Total number of links to display.
recommender	Specifies if the the recommendations system is enabled.
retry_policy.max_tries	Maximum number of times to retry an operation that has failed.
rss.copyright	Sets the copyright element in the RSS feed
rss.ttl	Sets the ttl element in the RSS feed.
schedule.incremental_crawl_ratio	The number of scheduled incremental crawls that are performed between each full crawl.
search_user	The email address to use for administrative purposes.
security.earlybinding.locks-keys-matcher.ldlibrarypath	Full path to security plugin library
security.earlybinding.locks-keys-matcher.name	Name of security plugin library that matches user keys with document locks at query time
security.earlybinding.user-to-key-mapper	Selected security plugin for translating usernames into lists of document keys.
security.earlybinding.user-to-key-mapper.cache-seconds	Number of seconds for which a users's list of keys may be cached
security.earlybinding.user-to-key-mapper.groovy-class	Name of a custom Groovy class to use to translate usernames into lists of document keys
service.thumbnail.max-age	Specify how long thumbnails may be cached for.
service_name	Human readable name of the collection.
slack.channel-names-to-exclude	List of Slack channel names to exclude from search.
slack.hostname	The hostname of the Slack instance.
slack.target-collection	Specify the push collection which messages from a Slack collection should be stored into.
slack.target-push-api	The push API endpoint to which slack messages should be added.
slack.user-names-to-exclude	Slack user names to exclude from search.
spelling.suggestion_lexicon_weight	Specify weighting to be given to suggestions from the lexicon relative to other sources.
spelling.suggestion_sources	Specify sources of information for generating spelling suggestions.
spelling.suggestion_threshold	Threshold which controls how suggestions are made.
spelling_enabled	Whether to enable spell checking in the search interface.
squizapi.target_url	URL of the Squiz Suite Manager for a Matrix collection.
start_url	A list of URLs from which the crawler will start crawling.
store.push.collection	Name of a push collection to push content into when using a PushStore or Push2Store.
store.push.host	Hostname of the machine to push documents to if using a PushStore or Push2Store.
store.push.password	The password to use when authenticating against push if using a PushStore or Push2Store.
store.push.port	Port that Push is configured to listen on (if using a PushStore).
store.push.url	The URL that the push api is located at (if using a Push2Store).
store.push.user	The user name to use when authenticating against push if using a PushStore or Push2Store.
store.raw-bytes.class	Fully qualified classname of a raw bytes Store class to use.
store.record.type	This parameter defines the type of store that Funnelback uses to store its records.
store.temp.class	Fully qualified classname of a class to use for temporary storage.
store.xml.class	Fully qualified classname of an XML storage class to use
trim.collect_containers	Whether to collect the container of each TRIM records or not.
trim.database	The 2-digit identifier of the TRIM database to index.
trim.default_live_links	Whether search results links should point to a copy of TRIM document, or launch TRIM client.
trim.domain	Windows domain for the TrimPush crawl user.
trim.extracted_file_types	A list of file extensions that will be extracted from TRIM databases.
trim.filter_timeout	Timeout to apply when filtering binary documents.
trim.free_space_check_exclude	Volume letters to exclude from free space disk check.
trim.free_space_threshold	Minimal amount of free space on disk under which a TRIMPush crawl will stop.
trim.gather_direction	Whether to go forward or backward when gathering TRIM records.
trim.gather_end_date	The date at which to stop the gather process.
trim.gather_mode	Date field to use when selecting records (registered date or modified date).
trim.gather_start_date	The date from which newly registered or modified documents will be gathered.
trim.license_number	TRIM license number as found in the TRIM client system information panel.
trim.max_filter_errors	The maximum number of filtering errors to tolerate before stopping the crawl.
trim.max_size	The maximum size of record attachments to process.
trim.max_store_errors	The maximum number of storage errors to tolerate before stopping the crawl.
trim.passwd	Password for the TRIMPush crawl user.
trim.properties_blacklist	List of properties to ignore when extracting TRIM records.
trim.push.collection	Specifies the Push collection to store the extracted TRIM records in.
trim.request_delay	Milliseconds between TRIM requests (for a particular thread).
trim.stats_dump_interval	Interval (in seconds) at which statistics will be written to the monitor.log file name.
trim.store_class	Class to use to store TRIM records.
trim.threads	Number of simultaneous TRIM database connections to use.
trim.timespan	Interval to split the gather date range into.
trim.timespan.unit	Number of time spans to split the gather date range into.
trim.user	Username for the TRIMPush crawl user.
trim.userfields_blacklist	List of user fields to ignore when extracting TRIM records.
trim.verbose	Defines how verbose the TRIM crawl is.
trim.version	Configure the version of TRIM to be crawled.
trim.web_server_work_path	Location of the temporary folder used by TRIM to extract binary files.
trim.workgroup_port	The port on the TRIM workgroup server to connect to when gathering content from TRIM.
trim.workgroup_server	The name of the TRIM workgroup server to connect to when gathering content from TRIM.
twitter.debug	Enable debug mode to preview Twitter fetched records.
twitter.oauth.access-token	Twitter OAuth access token.
twitter.oauth.consumer-key	Twitter OAuth consumer key.
twitter.oauth.consumer-secret	Twitter OAuth consumer secret.
twitter.oauth.token-secret	Twitter OAuth token secret.
twitter.users	Comma delimited list of Twitter user names to crawl.
ui.modern.content-auditor.count_urls	Define how deep into URLs Content Auditor users can navigate using facets.
ui.modern.content-auditor.date-modified.ok-age-years	Define how many years old a document may be before it is considered problematic.
ui.modern.content-auditor.duplicate_num_ranks	Define how many results should be considered in detecting duplicates for Content Auditor.
ui.modern.content-auditor.reading-grade.lower-ok-limit	Define the reading grade below which documents are considered problematic.
ui.modern.content-auditor.reading-grade.upper-ok-limit	Define the reading grade above which documents are considered problematic.
ui.modern.curator.custom_field	Configure custom fields for Curator messages.
ui.modern.extra_searches	Configure extra searches to be aggregated with the main result data, when using the Modern UI.
ui.modern.form.rss.content_type	Sets the content type of the RSS template.
ui.modern.padre_response_size_limit_bytes	Sets the maxmimum size of padre-sw responses to process.
ui_cache_disabled	Disable the cache controller from accessing any cached documents.
ui_cache_link	Base URL used by PADRE to link to the cached copy of a search result. Can be an absolute URL.
update-pipeline-groovy-pre-post-commands.max_heap_size	Set Java heap size used for groovy scripts in pre/post update commands.
update-pipeline.max_heap_size	Set Java heap size used for update pipelines.
update.restrict_to_host	Specify that collection updates should be restricted to only run on a specific host.
userid_to_log	Controls how logging of IP addresses is performed.
vital_servers	Changeover only happens if vital servers exist in the new crawl.
warc.compression	Control how content is compressed in a WARC file.
workflow.publish_hook	Name of the publish hook Perl script
workflow.publish_hook.meta	Name of the publish hook Perl script that will be called each time a meta collection is modified
youtube.api-key	YouTube API key retrieved from the Google API console.
youtube.channel-ids	YouTube channel IDs to crawl.
youtube.debug	Enable debug mode to preview YouTube fetched records.
youtube.liked-videos	Enables fetching of YouTube videos liked by a channel ID.
youtube.playlist-ids	YouTube playlist IDs to crawl.

top

Collection Options (collection.cfg)

Location

Format

Configuration options

Contents