Collection Cfg (collection.cfg)

Introduction

Name

collection.cfg

Collection Location

~/conf//

Collection Defaults

~/conf/

Description

Main configuration file for a collection.The collection.cfg configuration file is created when a collection is created and may be updated whenever a collection is updated.

Format

The format of the file is a simple name=value pair per line, with the values $SEARCH_HOME and $COLLECTION_NAME automatically expanded to the funnelback installation path and the name of the current collection automatically.

Configuration options

The following tables contain descriptions of the options that are used in the configuration file. Note that some are specific to the collection's type, while others are used for every collection.

Standard Funnelback default values for each configuration option are defined in $SEARCH_HOME/conf/collection.cfg.default and server-wide default values may be configured by adding them to the file at $SEARCH_HOME/conf/collection.cfg

A

OptionDescription
access_alternateSwitch the user to an alternate collection if access_restriction applies.
access_restrictionRestricts access by listing allowable hostnames or IP addresses. Only users with matching hostname or IP address can search.
access_restriction.ignored_ip_rangesDefines all IP ranges in the X-Forwarded-For header to be ignored by Funnelback when applying access restrictions.
access_restriction.prefer_x_forwarded_forDetermines if access restrictions should be applied to the last IP address in the X-Forwarded-For header.
accessibility-auditor.checkTurns modern accessibility checks on or off.
admin.undeletableIf set to "true" this collection can not be deleted from the Administration interface.
admin_emailEmail address of administrator to whom an email is sent after each collection update.
analytics.data_miner.range_in_daysLength of time range (in days) the analytics data miner will go back from the current date when mining query and click log records.
analytics.max_heap_sizeSet Java heap size used for analytics.
analytics.outlier.day.minimum_average_countControl the minimum number of occurrences of a query required before a day pattern can be detected.
analytics.outlier.day.thresholdControl the day pattern detection threshold.
analytics.outlier.exclude_collectionDisable query spike detection for a collection
analytics.outlier.exclude_profilesDisable query spike detection for a profile
analytics.outlier.hour.minimum_average_countControl the minimum number of occurrences of a query required before a hour pattern can be detected.
analytics.outlier.hour.thresholdControl the hour pattern detection threshold.
analytics.reports.max_day_resolution_daterangeLength of time range (in days) to allow in a custom daterange in the query reports UI.
analytics.reports.max_facts_per_dimension_combinationAdvanced setting: controls the amount of data that is stored by query reports.
analytics.reports.checkpoint_rateAdvanced setting: controls the rate at which the query reports system checkpoints data to disk.
analytics.reports.disable_incremental_reportingDisable incremental reports database updates. If set all existing query and click logs will be processed for each reports update.
analytics.scheduled_database_updateControl whether reports for the collection are updated on a scheduled basis
annie.index_optsSpecify options for the "annie-a" annotation indexing program.
auto-completionEnable or disable auto completion.
auto-completion.alphaAdjust balance between relevancy and length for auto completion suggestions.
auto-completion.delayDelay to wait (ms) before triggering auto completion.
auto-completion.formatSet the display format of the suggestions in the search results page.
auto-completion.lengthMinimum length of query term before triggering auto completion.
auto-completion.programProgram to use for auto completion.
auto-completion.search.enabledTurn in search based auto completion.
auto-completion.search.programProgram to use for search based auto completion.
auto-completion.showMaximum number of auto completion to show.
auto-completion.sortSets the quto completion suggestions sort order.
auto-completion.sourceSets the source of the data for auto completion suggestions
auto-completion.source.extraSets extra sources of data for auto completion suggestions
auto-completion.standard.enabledEnables the standard auto completion feature.

B

OptionDescription
build_autoc_optionsSpecifies additional configuration options that can be supplied to the auto completion builder.

C

OptionDescription
changeover_percentThe new crawl only goes live if the ratio of new vs. old documents gathered is greater than this amount (e.g. 50%).
click_data.archive_dirsThe directories that contain archives of click logs to be included in producing indexes.
click_data.num_archived_logs_to_useThe number of archived click logs to use from each archive directory
click_data.use_click_data_in_indexA boolean value indicating whether or not click information should be included in the index.
click_data.week_limitOptional restriction of click data to a set number of weeks into the past.
click_trackingEnable or disable click tracking.
collectionThe internal name of a collection.
collection-update.step.StepTechnicalName.runDetermines if an update step should be run or not.
group.project_idSet group under which collection will appear in selection drop down menu on main Administration page.
collection_rootLocation of a collection's crawl, index, query logs etc
collection_typeType of collection.
crawler.accept_cookiesCookie policy. Default is false i.e. do not accept cookies. Requires HTTPClient if true.
crawler.accept_filesOnly crawl files with these extensions. Not normally used - default is to accept all valid content.
crawler.allowed_redirect_patternSpecify a regex to allow crawler redirections that would otherwise by disallowed by the current include/exclude patterns.
crawler.cache.DNSCache_max_sizeMaximum size of internal DNS cache. Upon reaching this size the cache will drop old elements.
crawler.cache.LRUCache_max_sizeMaximum size of LRUCache. Upon reaching this size the cache will drop old elements.
crawler.cache.URLCache_max_sizeMaximum size of URLCache. May be ignored by some cache implementations.
crawler.check_alias_existsCheck if aliased URLs exists - if not, revert back to original URL
crawler.checkpoint_toLocation of crawler checkpoint files.
crawler.classes.CrawlerJava class used by crawler - defines top level behaviour, which protocols are supported etc.
crawler.classes.FrontierJava class used for the frontier (a list of URLs not yet visited)
crawler.classes.PolicyJava class used for enforcing the include/exclude policy for URLs
crawler.classes.RevisitPolicyJava class used for enforcing the revisit policy for URLs
crawler.classes.statisticsList of statistics classes to use during a crawl in order to generate figures for data reports
crawler.classes.URLStoreJava class used to store content on disk e.g. create a mirror of files crawled
crawler.eliminate_duplicatesWhether to eliminate duplicate documents while crawling (default is true)
crawler.extract_links_from_javascriptWhether to extract links from Javascript while crawling (default is true)
crawler.follow_links_in_commentsWhether to follow links in HTML comments while crawling (default is false)
crawler.frontier_num_top_level_dirsOptional setting to specify number of top level directories to store disk based frontier files in
crawler.frontier_use_ip_mappingWhether to map hosts to frontiers based on IP address. (default is false)
crawler.frontier_hostsLists of hosts running crawlers if performing a distributed web crawl
crawler.frontier_portPort on which DistributedFrontier will listen on
crawler.form_interaction_filePath to optional file which configures interaction with form-based authentication
crawler.form_interaction_in_crawlSpecify whether crawler should submit web form login details during crawl rather than in a pre-crawl phase
crawler.header_loggingOption to control whether HTTP headers are written out to a separate log file (default is false)
crawler.incremental_loggingOption to control whether a list of new and changed URLs should be written to a log file during incremental crawling
crawler.inline_filtering_enabledOption to control whether text extraction from binary files is done "inline" during a web crawl
crawler.link_extraction_groupThe group in the crawler.link_extraction_regular_expression which should be extracted as the link/URL.
crawler.link_extraction_regular_expressionThe expression used to extract links from each document. This must be a Perl compatible regular expression.
crawler.logfileThe crawler's log path and filename.
crawler.lowercase_iis_urlsWhether to lowercase all URLs from IIS web servers (default is false)
crawler.max_dir_depthA URL with more than this many sub directories will be ignored (too deep, probably a crawler trap)
crawler.max_download_sizeMaximum size of files crawler will download (in MB)
crawler.max_files_per_areaMaximum files per "area" e.g. number of files in one directory or generated by one dynamic generator e.g. index.asp?doc=123. This parameter used to be called crawler.max_dir_size
crawler.max_files_per_serverMaximum files per server (default is unlimited)
crawler.max_files_storedMaximum number of files to download (default, and less than 1, is unlimited)
crawler.max_individual_frontier_sizeMaximum size of an individual frontier (unlimited if not defined)
crawler.max_link_distanceHow far to crawl from the start_url (default is unlimited). e.g. if crawler.max_link_distance = 1, only crawl the links on start_url. NB: Turning this on drops crawler to single-threaded operation.
crawler.max_parse_sizeCrawler will not parse documents beyond this many megabytes in size
crawler.max_timeout_retriesMaximum number of times to retry after a network timeout (default is 0)
crawler.max_url_lengthA URL with more characters than this will be ignored (too long, probably a crawler trap)
crawler.max_url_repeating_elementsA URL with more than this many repeating elements (directories) will be ignored (probably a crawler trap or incorrectly configured web server)
crawler.monitor_authentication_cookie_renewal_intervalOptional time interval at which to renew crawl authentication cookies
crawler.monitor_checkpoint_intervalTime interval at which to checkpoint (seconds)
crawler.monitor_delay_typeType of delay to use during crawl (dynamic or fixed)
crawler.monitor_haltChecked during a crawl - if set to "true" then crawler will cleanly shutdown
crawler.monitor_preferred_servers_listOptional list of servers to prefer during crawl
crawler.monitor_time_intervalTime interval at which to output monitoring information (seconds)
crawler.monitor_url_reject_listOptional parameter listing URLs to reject during a running crawl
crawler.non_htmlWhich non-html file formats to crawl (e.g. pdf, doc, xls etc.)
crawler.num_crawlersNumber of crawler threads which simultaneously crawl different hosts
crawler.overall_crawl_timeoutMaximum crawl time after which the update continues with indexing and changeover. The units of this parameter depend on the value of the crawler.overall_crawl_units parameter.
crawler.overall_crawl_unitsThe units for the crawler.overall_crawl_timeout parameter. A value of hr indicates hours and min indicates minutes.
crawler.packages.httplibJava library for HTTP/HTTPS support.
crawler.parser.mimeTypesExtract links from these comma-separated or regexp: content-types.
crawler.predirects_enabledEnable crawler predirects. (boolean)
crawler.protocolsCrawl URLs via these protocols (comma separated list)
crawler.reject_filesDo not crawl files with these extensions
crawler.remove_parametersOptional list of parameters to remove from URLs
crawler.request_delayMilliseconds between HTTP requests (for a particular thread)
crawler.request_headerOptional additional header to be inserted in HTTP(S) requests made by the webcrawler.
crawler.request_header_url_prefixOptional URL prefix to be applied when processing the crawler.request_header parameter
crawler.request_timeoutTimeout for HTTP page GETs (milliseconds)
crawler.revisit.edit_distance_thresholdThreshold for edit distance between two versions of a page when deciding whether it has changed or not
crawler.revisit.num_times_revisit_skipped_thresholdThreshold for number of times a page revisit has been skipped when deciding whether to revisit it.
crawler.revisit.num_times_unchanged_thresholdThreshold for number of times a page has been unchanged when deciding whether to revisit it.
crawler.robotAgentMatching is case-insensitive over the length of the name in a robots.txt file
crawler.secondary_store_rootLocation of secondary (previous) store - used in incremental crawling
crawler.server_alias_filePath to optional file containing server alias mappings e.g. www.daff.gov.au=www.affa.gov.au
crawler.sslClientStorePath to a SSL Client certificate store (absolute or relative). Empty/missing means no client certificate store. Certificate stores can be managed by Java's keytool
crawler.sslClientStorePasswordPassword for the SSL Client certificate store. Empty/missing means no password, and may prevent client certificate validation. Certificate stores can be managed by Java's keytool
crawler.sslTrustEveryoneTrust ALL Root Certificates and ignore server hostname verification if true. This bypasses all certificate and server validation by the HTTPS library, so every server and certificate is trusted. It can be used to overcome problems with unresolveable external certificate chains and poor certificates for virtual hosts, but will allow server spoofing.
crawler.sslTrustStorePath to a SSL Trusted Root store (absolute or relative). Empty/missing means use those provided with Java. Certificate stores can be managed by Java's keytool
crawler.start_urls_filePath to a file that contains a list of URLs (one per line) that will be used as the starting point for a crawl. Note that this setting overrides the start_url that the crawler is passed on startup (usually stored in the crawler.start_url configuration option).
crawler.store_all_typesIf true, override accept/reject rules and crawl and store all file types encountered
crawler.store_empty_content_urlsIf true, store URLs even if, after filtering, they contain no content.
crawler.store_headersWrite HTTP header information at top of HTML files if true. Header information is used by indexer.
crawler.user_agentThe browser ID that the crawler uses when making HTTP requests. We use a browser signature so that web servers will return framed content etc. to us.
crawler.use_sitemap_xmlOptional parameter specifying whether to process sitemap.xml files during a web crawl.
crawler.verbosityVerbosity level (0-6) of crawler logs. Higher number results in more messages.
crawlerThe name of the crawler binary.
crawler_binariesLocation of the crawler files.
custom.base_templateThe template used when the collection was created.

D

OptionDescription
data_reportA switch that can be used to enable or disable the data report stage during a collection update.
data_rootThe directory under which the documents to index reside
datasourceIndicates if the collection is a datasource
db.bundle_storage_enabledAllows storage of data extracted from a database in a compressed form.
db.custom_action_java_classAllows a custom java class to modify data extracted from a database before indexing.
db.full_sql_queryThe SQL query to perform on a database to fetch all records for searching.
db.incremental_sql_queryThe SQL query to perform to fetch new or changed records from a database.
db.incremental_update_typeAllows the selection of different modes for keeping database collections up to date.
db.jdbc_classThe name of the Java JDBC driver to connect to a database.
db.jdbc_urlThe URL specifying database connection parameters such as the server and database name.
db.passwordThe password for connecting to the database.
db.primary_id_columnThe primary id (unique identifier) column for each database record.
db.xml_root_elementThe top level element for records extracted from the database.
db.single_item_sqlAn SQL command for extracting an individual record from the database
db.update_table_nameThe name of a table in the database which provides a record of all additions, updates and deletes.
db.usernameThe username for connecting to the database.
db.use_column_labelsFlag to control whether column labels are used in JDBC calls in the database gatherer
db.use_column_labelsFlag to control whether column labels are used in JDBC calls in the database gatherer
directory.context_factorySets the java class to use for creating directory connections.
directory.domainSets the domain to use for authentication in a directory collection.
directory.exclude_rulesSets the rules for excluding content from a directory collection.
directory.page_sizeSets the number of documents to fetch from the directory in each request.
directory.passwordSets the password to use for authentication in a directory collection.
directory.provider_urlSets the URL for accessing the directory in a directory collection.
directory.search_baseSets the base from which content will be gathered in a directory collection.
directory.search_filterSets the filter for selecting content to gather in a directory collection.
directory.usernameSets the username to use for authentication in a directory collection.

E

OptionDescription
exclude_patternsThe crawler will ignore a URL if it matches any of these exclude patterns

F

OptionDescription
faceted_navigation.date.sort_modeSpecify how to sort date based facets.
faceted_navigation.white_listInclude only a list of specific values for a facet (Modern UI only).
faceted_navigation.black_listExclude specific values for a facet (Modern UI only).
filecopy.cacheEnable/disable using the live view as a cache directory where pre-filtered text content can be copied from.
filecopy.domainFilecopy sources that require a username to access files will use this setting as a domain for the user.
filecopy.discard_filtering_errorsWhether to index or not the file names of files that failed to filter.
filecopy.exclude_patternFilecopy collections will exclude files which match this regular expression.
filecopy.filetypesThe list of filetypes (i.e. file extensions) that will be included by a filecopy collection.
filecopy.include_patternIf specified, filecopy collections will only include files which match this regular expression.
filecopy.max_files_storedIf set, this limits the number of documents a filecopy collection with gather when updating.
filecopy.num_workersNumber of worker threads for filtering and storing files in a filecopy collection.
filecopy.num_fetchersNumber of fetchers threads for interacting with the fileshare in a filecopy collection.
filecopy.walker_classMain class used by the filecopier to walk a file tree
filecopy.passwdFilecopy sources that require a password to access files will use this setting as a password.
filecopy.request_delayOptional parameter to specify how long to delay between copy requests in milliseconds.
filecopy.sourceThis is the file system path or URL that describes the source of data files.
filecopy.security_modelSets the plugin to use to collect security information on files (Early binding Document Level Security.
filecopy.source_listIf specified, this option is set to a file which contains a list of other files to copy, rather than using the filecopy.source. NOTE: Specifying this option will cause the filecopy.source to be ignored.
filecopy.store_classSpecifies which storage class to be used by a filecopy collection (e.g. WARC, Mirror).
filecopy.userFilecopy sources that require a username to access files will use this setting as a username.
filter.classesOptionally specify which java classes should be used for filtering documents.
filter.csv-to-xml.custom-headerDefines a custom header to use for the CSV.
filter.csv-to-xml.formatSets the CSV format to use when filtering a CSV document.
filter.csv-to-xml.has-headerControls if the CSV file has a header or not.
filter.csv-to-xml.url-templateThe template to use for the URLs of the documents created in the CSVToXML Filter.
filter.document_fixer.timeout_msControls the maximum about of time the document fixer may spend on a document.
filter.ignore.mimeTypesOptional list of MIME types for the filter to ignore
filter.jsoup.classesSpecify which java/groovy classes will be used for filtering, and operate on JSoup objects rather than byte streams.
filter.jsoup.undesirable_text-source.*Specify sources of undesirable test strings to detect and present within content auditor.
filter.num_worker_threadsSpecify number of parallel threads to use in document filtering (text extraction)
filter.text-cleanup.ranges-to-replaceSpecify Unicode blocks for replacement during filtering (to avoid 'corrupt' character display).
filter.tika.typesSpecify which file types to filter using the TikaFilterProvider
ftp_passwdPassword to use when gathering content from an FTP server.
ftp_userUsername to use when gathering content from an FTP server.

G

OptionDescription
gatherThe mechanism used to gather documents for indexing. "crawl" indicates Web retrieval whereas "filecopy" indicates a local or remote file copy.
gather.max_heap_sizeSet Java heap size used for gathering documents.
gather.slowdown.daysDays on which gathering should be slowed down.
gather.slowdown.hours.fromStart hour for slowdown period.
gather.slowdown.hours.toEnd hour for slowdown period.
gather.slowdown.threadsNumber of threads to use during slowdown period.
gather.slowdown.request_delayRequest delay to use during slowdown period.
groovy.extra_class_pathSpecify extra class paths to be used by Groovy when using $GROOVY_COMMAND.
group.customer_idThe customer group under which collection will appear - Useful for multi-tenant systems.
group.project_idThe project group under which collection will appear in selection drop down menu on main Administration page.
gscopes.optionsSpecify options for the "padre-gs" gscopes program.
gscopes.other_bit_numberSpecifies the gscope bit to set when no other bits are set.

H

OptionDescription
http_passwdPassword used for accessing password protected content during a crawl
http_proxyThe hostname (e.g. proxy.company.com) of the HTTP proxy to use during crawling. This hostname should not be prefixed with 'http://'.
http_proxy_passwdThe proxy password to be used during crawling
http_proxy_portPort of HTTP proxy used during crawling
http_proxy_userThe proxy user name to be used during crawling
http_source_hostIP address or hostname used by crawler, on a machine with more than one available
http_userUsername used for accessing password-protected content during a crawl

I

OptionDescription
include_patternsURLs matching this are included in crawl (unless exclude_patterns) e.g. usyd.edu.au, anu.edu.au, www.anutech.com.au/ELC/
indexA switch that can be used to enable or disable the indexing stage during a collection update.
index.targetFor datasources, indicate which index the data is sent to.
indexerThe name of the indexer program to be used for this collection.
indexer_optionsIndexer command line options, each separated by whitespace and thus cannot contain embedded whitespace characters.
indexing.additional-metamap-source.*Declare additional sources of metadata mappings to be used when indexing HTML documents.
indexing.collapse_fieldsDefine which fields to consider for result collapsing
indexing.use_manifestFlag to turn on use of a manifest file for indexing

J

OptionDescription
java_librariesThe path where the Java libraries are located.
java_optionsCommand line options to pass to the Java virtual machine when the crawler is launched.

L

OptionDescription
logging.hostname_in_filenameControl whether hostnames are used in log filenames
logging.ignored_x_forwarded_for_rangesDefines all IP ranges in the X-Forwarded-For header to be ignored by Funnelback when choosing the IP address to Log.

M

OptionDescription
mail.on_failure_onlyWhether to always send collection update emails or only when an update fails)
matrix_passwordUsername for logging into Matrix and the Squiz Suite Manager
matrix_usernamePassword for logging into Matrix and the Squiz Suite Manager
mcf.authority-urlURL for contacting a ManifoldCF authority
mcf.domainDefault domain for users in the ManifoldCF authority

N

OptionDescription
noindex_expressionOptional regular expression to specify content that should not be indexed

P

OptionDescription
post_gather_commandOptional command to execute after gathering phase finishes.
post_index_commandCommand to execute after indexing finishes.
post_update_commandCommand to execute once an update has finished (update email will already have been sent).
pre_gather_commandCommand to execute before gathering starts.
pre_index_commandCommand to execute before indexing commences.
pre_reporting_commandCommand to execute before reports updating commences.
progress_report_intervalInterval (in seconds) at which the gatherer will update the progress message for the Admin UI.
push.auto-startSet if the Push collection will start with the web server.
push.commit-typeThe type of commit that push should use.
push.commit.index.parallel.max-index-thread-countThe maximum number of threads that can be used during a commit for indexing.
push.commit.index.parallel.min-documents-for-parallel-indexingThe minimum number of documents required in a single commit for parallel indexing to be used during that commit.
push.commit.index.parallel.min-documents-per-threadThe minimum number of documents each thread must have when using parallel indexing in a commit.
push.init-modeThe initial mode in which push should start.
push.max-generationsThe maximum number of generations push can use.
push.merge.index.parallel.max-index-thread-countThe maximum number of threads that can be used during a merge for indexing.
push.merge.index.parallel.min-documents-for-parallel-indexingThe minimum number of documents required in a single merge for parallel indexing to be used during that merge.
push.merge.index.parallel.min-documents-per-threadThe minimum number of documents each thread must have when using parallel indexing in a merge.
push.replication.compression-algorithmThe compression algorithm to use when transferring compressible files to Push slaves.
push.replication.ignore.dataWhen set Query processors will ignore the data, which is used for cached copies.
push.replication.ignore.delete-listsWhen set Query processors will ignore the delete lists
push.replication.master.host-nameA query processor push collection's master's hostname.
push.replication.master.push-api.portThe master's push-api port for a query processor push collection.
push.scheduler.auto-click-logs-processing-timeout-secondsNumber of seconds before a Push collection will automatically trigger processing of click logs.
push.scheduler.auto-commit-timeout-secondsNumber of seconds a Push collection should wait before a commit is automatically triggered.
push.scheduler.changes-before-auto-commitNumber of changes to a Push collection before a commit is automatically triggered.
push.scheduler.delay-between-content-auditor-runsMinimum time in milliseconds between each executions of the Content Auditor summary generation task.
push.scheduler.delay-between-meta-dependencies-runsMinimum time in milliseconds between each executions of updating the Push collection's meta parents.
push.scheduler.generation.re-index.killed-percentThe percentage of killed documents in a single generation for it to be considered for re-indexing.
push.scheduler.generation.re-index.min-doc-countThe minimum number of documents in a single generation for it to be considered for re-indexing.
push.scheduler.killed-percentage-for-reindexPercentage of killed documents before Push re-indexes.
push.store.always-flushUsed to stop a Push collection from performing caching on PUT or DELETE calls.
push.worker-thread-countThe number of worker threads Push should use.

Q

OptionDescription
query_processorThe name of the query processor executable to use.
query_processor_optionsQuery processor command line options.

R

OptionDescription
recommenderEnables/disables the recommendations system
retry_policy.max_triesMaximum number of times to retry an operation that has failed.
rss.copyrightSets the copyright element in the RSS feed
rss.ttlSets the ttl element in the RSS feed.

S

OptionDescription
schedule.incremental_crawl_ratioThe number of scheduled incremental crawls that are performed between each full crawl (e.g. a value of '10' results in an update schedule consisting of every ten incremental crawls being followed by a full crawl).
search_userName of user who runs collection updates
security.earlybinding.user-to-key-mapperSelected security plugin for translating usernames into lists of document keys
security.earlybinding.user-to-key-mapper.cache-secondsNumber of seconds for which a users's list of keys may be cached
security.earlybinding.user-to-key-mapper.groovy-className of a custom Groovy class to use to translate usernames into lists of document keys
security.earlybinding.locks-keys-matcher.nameName of security plugin library that matches user keys with document locks at query time
security.earlybinding.locks-keys-matcher.ldlibrarypathFull path to security plugin library
service_nameName of collection as displayed to users e.g. Intellectual Property Portal. Please note - This is not the same as the Administration Interface concept of services.
service.thumbnail.max-ageSpecify how long thumbnails may be cached for.
spelling.suggestion_lexicon_weightSpecify weighting to be given to suggestions from the lexicon (list of words from indexed documents) relative to other sources (e.g. annotations)
spelling.suggestion_sourcesSpecify sources of information for generating spelling suggestions.
spelling.suggestion_thresholdThreshold which controls how suggestions are made.
spelling_enabledWhether to enable spell checking in the search interface (true or false).
start_urlCrawler seed URL. Crawler follows links in this page, and then the links of those pages and so on.
store.push.collectionName of a push collection to push content into (if using a PushStore or Push2Store).
store.push.hostHostname of machine where a specified push collection exists (if using a PushStore).
store.push.passwordThe password to use when authenticating against push (if using a PushStore or Push2Store).
store.push.portPort that Push collections listen on (if using a PushStore).
store.push.urlThe URL that the push api is located at (if using a Push2Store).
store.push.userThe user name to use when authenticating against push (if using a PushStore or Push2Store).
store.raw-bytes.classFully qualified classname of a raw bytes class to use
store.record.typeThis parameter defines the type of store that Funnelback uses to store its records.
store.temp.classFully qualified classname of a class to use for temporary storage.
store.xml.classFully qualified classname of an XML storage class to use
squizapi.target_urlURL of the Squiz Suite Manager for a Matrix collection.

T

OptionDescription
text_miner_enabledControl whether text mining is enabled or not
trim.collect_containersWhether to collect the container of each TRIM records or not (Significantly slows down the crawl)
trim.databaseThe 2-digit identifier of the TRIM database to index
trim.default_live_linksWhether search results links should point to a copy of TRIM document, or launch TRIM client.
trim.domainWindows domain for the TRIMPush crawl user
trim.extracted_file_typesA list of file extensions that will be extracted from TRIM databases.
trim.filter_timeoutTimeout to apply when filtering binaries document
trim.free_space_check_excludeVolume letters to exclude from free space disk check
trim.free_space_thresholdMinimal amount of free space on disk under which a TRIMPush crawl will stop
trim.gather_directionWhether to go forward or backward when gathering records.
trim.gather_modeDate field to use when selecting records (registered date or modified date)
trim.gather_start_dateThe date from which newly registered or modified documents will be gathered.
trim.gather_end_dateThe date at which stop the gather process.
trim.license_numberTRIM license number as found in the TRIM client system information panel.
trim.max_filter_errorsThe maximum number of filtering errors to tolerate before stopping the crawl
trim.max_sizeThe maximum size of record attachments to process
trim.max_store_errorsThe maximum number of storage errors to tolerate before stopping the crawl
trim.passwdPassword for the TRIMPush crawl user
trim.properties_blacklistList of properties to ignore when extracting TRIM records
trim.push.collectionPush collection where to store the extracted TRIM records
trim.request_delayMilliseconds between TRIM requests (for a particular thread)
trim.stats_dump_intervalInterval (in seconds) at which statistics will be written to the monitor.log file name
trim.store_classClass to use to store TRIM records
trim.timespanInterval to split the gather date range into
trim.timespan.unitNumber of time spans to split the gather date range into
trim.threadsNumber of simultaneous TRIM database connections to use
trim.userUsername for the TRIMPush crawl user
trim.userfields_blacklistList of user fields to ignore when extracting TRIM records
trim.verboseDefine how verbose the TRIM crawl is.
trim.versionConfigure the version of TRIM to be crawled.
trim.web_server_work_pathLocation of the temporary folder used by TRIM to extract binary files
trim.workgroup_portThe port on the TRIM workgroup server to connect to when gathering content from TRIM.
trim.workgroup_serverThe name of the TRIM workgroup server to connect to when gathering content from TRIM.

U

OptionDescription
ui.integration_urlURL to use to reach the search service, when wrapped inside another system (e.g. CMS)
ui.modern.accessibility-auditor.daat_limitDefine how many matching results are scanned for creating Accessibility Auditor reports.
ui.modern.authenticationEnable Windows authentication on the Modern UI
ui.modern.cache.form.content_typeSpecify a custom content type header for the cache controller file (Modern UI only).
ui.modern.click_linkReferences the URL used to log result clicks (Modern UI only)
ui.modern.content-auditor.collapsing-signatureDefine how duplicates are detected within Content Auditor.
ui.modern.content-auditor.count_urlsDefine how deep into URLs Content Auditor users can navigate using facets.
ui.modern.content-auditor.daat_limitDefine how many matching results are scanned for creating Content Auditor reports.
ui.modern.content-auditor.date-modified.ok-age-yearsDefine how many years old a document may be before it is considered problematic.
ui.modern.content-auditor.display-metadata.*Define metadata and labels for use displaying result metadata within Content Auditor.
ui.modern.content-auditor.duplicate_num_ranksDefine how many results should be considered in detecting duplicates for Content Auditor.
ui.modern.content-auditor.facet-metadata.*Define metadata and labels for use in reporting and drilling down within Content Auditor.
ui.modern.content-auditor.num_ranksDefine how many results are displayed in Content Auditor's search results tab.
ui.modern.content-auditor.max-metadata-facet-categoriesDefine the maximum number of categories to display in Content Auditor's facets.
ui.modern.content-auditor.overview-category-countDefine how many category values should be displayed on the Content Auditor overview.
ui.modern.content-auditor.reading-grade.lower-ok-limitDefine the reading grade below which documents are considered problematic.
ui.modern.content-auditor.reading-grade.upper-ok-limitDefine the reading grade above which documents are considered problematic.
ui.modern.cors.allow_originSets the value for the CORS allow origin header for Modern UI.
ui.modern.curator.custom_fieldsConfigure custom fields for Curator messages.
ui.modern.curator.query-parameter-patternControls which URL parameters basic curator triggers will trigger against.
ui.modern.extra_searchesConfigure extra searches to be aggregated with the main result data, when using the Modern UI.
ui.modern.form.content_typeSpecify a custom content type header for a form file (Modern UI only).
ui.modern.form.headers.countSpecify the count of custom headers for a form file (Modern UI only).
ui.modern.form.headersSpecify custom headers for a form file (Modern UI only).
ui.modern.freemarker.display_errorsWhether to display form files error messages on the browser or not (Modern UI only).
ui.modern.freemarker.error_formatFormat of form files error messages displayed on the browser (Modern UI only).
ui.modern.geolocation.enabledEnable/disable location detection from user's IP addresses using MaxMind Geolite. (Modern UI only).
ui.modern.geolocation.set_originWhether the origin point for the search is automatically set if not specified by the user's request. (Modern UI only).
ui.modern.i18nDisable localisation support on the Modern UI.
ui.modern.form.rss.content_typeSets the content type of the RSS template.
ui.modern.search_linkBase URL used by search.html to link to itself e.g. the next page of search results. Allows search.html (or a pass-through script) to have a different name.
ui.modern.serve.filecopy_linkReferences the URL used to serve filecopy documents (Modern UI only)
ui.modern.serve.trim_link_prefixReferences the prefix to use for the URL used to serve TRIM documents and references (Modern UI only)
ui.modern.sessionEnable or disable Search session and history
ui.modern.session.timeoutConfigures the session timeout
ui.modern.session.search_history.sizeConfigures the size of the search and click history
ui.modern.session.search_history.suggestEnable or disable search history suggestions in auto completion
ui.modern.session.search_history.suggest.display_templateTemplate to use to display search history suggestions in auto completion
ui.modern.session.search_history.suggest.categoryCategory containing the search history suggestions in auto completion
ui.modern.session.set_userid_cookieAssign unique IDs to users in an HTTP cookie
ui.modern.metadata-alias.*Creates aliases for metadata class names.
ui_cache_disabledDisable the cache controller from accessing any cached documents.
ui_cache_linkBase URL used by PADRE to link to the cached copy of a search result. Can be an absolute URL.
update-pipeline.max_heap_sizeSet Java heap size used for update pipelines.
update-pipeline-groovy-pre-post-commands.max_heap_sizeSet Java heap size used for groovy scripts in pre/post update commands.
update.restrict_to_hostSpecify that collection updates should be restricted to only run on a specific host.
userid_to_logControls how logging of IP addresses is performed.

V

OptionDescription
vital_serversChangeover only happens if vital_servers exist in the new crawl.

W

OptionDescription
warc.compressionControl how content is compressed in a WARC file.
workflow.publish_hookName of the publish hook Perl script
workflow.publish_hook.metaName of the publish hook Perl script that will be called each time a meta collection is modified

See also

top