Skip to content

Indexer options (collection.cfg setting)

Description

This option specifies additional configuration options that can be supplied to the indexer when indexing collections. The PArallel Document Retrieval Engine indexer is a powerful engine that can be finely controlled through a large list of options that can be given to it. These options can be specified in this collection configuration parameter. The list of options available is given here.

Caveats

  • Indexing will not occur if the indexer is given an invalid option.
  • Indexer options can affect Funnelback's performance, so change them with caution.
  • Options in group A are generally useful only when running PADRE from a command line and usually should not be included in the index_options

Indexer options

A. Getting information about PADRE and its operation.

-V
Print PADRE version number and exit.
-ixform
Print index format version created by this indexer exit.
-help
Print this list and exit.
-debug
Generate debugging output.
-show_each_word_indexed
For debugging. Show each word occurrence (with field) as it is indexed.
-show_each_word_to_file
For debugging. Print each word occurrence (with field) to <filestem>.words_in_docs
-hashlog
Create a .hashlog file with incremental hashing stats.
-quiet
Use terse logging.
-ankdebug
Generate debugging output relating to anchortext.
-termdeb<term>
print debugging messages relating to the indexing of <term>.

B. Controlling what is indexed.

-nometa
Don't index any metadata except t, d and k (titles, dates and links).
-nomdsfconcat
Don't concatenate strings in the mdsf file. Record first only. (Others are still indexed.)
-diwimuu
Don't index words in made-up URLs (those constructed from filepath).
-dias
Don't index link anchor source as part of source documents (<a> only).
-ibd
Index all documents even if they appear to be binary.
-ixcom
Index words in HTML and XML comments.
-select<num1>,<num2>
Index every num1th file/bundle starting from num2th (from zero).
-select-doc-in-bundle=<interval>,<offset>
Index every <interval> document within a bundle starting from <offset> (which starts at zero). Only works with warc store.
-tarpat<regex>
Filenames in a tarfile being indexed must match regex. Default is match-everything.
-csv=<fsep><skipfirst>[<quote>]
Deprecated. Use the CSV to XML filter instead. Files which are not clearly something else are assumed to be CSV format. fsep is ascii field separator, typically comma. (tab is represented by t.) skipfirst is either y or n, telling padre whether the first line in a CSV file should be skipped. quote is the character used to quote strings in fields which may contain separators. (You probably have to escape it on the command line.) If not specified, no quote character is defined. To include a quote character within a quoted section, the quote may be doubled.
-csv_fields=<comma_separated_descriptor_list>
Deprecated. Use the CSV to XML filter instead. This is a list of comma separated descriptors describing how to index each column ofthe csv file. To index terms in a column as document text use '-'. To index terms in a column as metadata use the format: <metadata class name><content type>; To skip terms in a column use 'X'. For example: 't1,-,X', would set the first column to title, the second column would be indexed as document content and the third column would be skipped.; Content-type defined in this argument should be the same as the content type in metamap.cfg
-check_url_exclusion=<on|off>
URLs matching url_exclusion pattern will not be searchable. (Default on.)
-url_exclusion_pattern=<regex>
exclusion pattern to use if URLs are vetted. (Default 'file://$SEARCH_HOME/')
-filepath_exclusion_pattern=<regex>
exclusion pattern to use if files are to be excluded from indexing on the basis of the filepath. If applicable, this is more efficient than excluding by URL because the URL can't be finally determined until the content has been scanned. (Default: not set)
-index_subversion_dirs
Normally the .svn directories created by the subversion version control system are not indexed. Override this default.

C. Controlling how things are indexed.

-noax
Don't conflate accents.
-unimap=<mapname>
specify a Unicode mapping to be applied when indexing and when query processing. Supported values: tosimplified, and totraditional. (Chinese only.)
-deutsch=<i>
How much extra processing is done for umlaut and sz: 0 - none. München is indexed as München and Munchen; 1 - München is indexed as München, Muenchen and Munchen (Dflt); 2 - As for 1 but also Muenchen is indexed as München, Muenchen and Munchen (As a side-effect to allow for compounds, SORT_SIGNIF is increased to 40
-nz=<i>
How much extra processing is done for Māori: 0 - none. Māori and Mäori are indexed as Māori or Mäori resp. and Maori (Dflt); 1 - Māori is indexed as Māori, Maaori and Maori; 2 - Mäori is indexed as Mäori, Māori, Maaori and Maori
-no_cjkt_grams
Suppress the indexing of bigrams/unigrams in CJKT text. It is assumed that said text has been pre-segmented into words, and that normal word-based indexing is needed.
-QL_depth=<i>
Activate quicklinks on default pages of up to depth i. Use internal QL defaults. (Dflt 0 = Off)
-QL_config=<f>
Activate quicklinks. Read quicklinks configuration options from file f.
-docscan_depth=<i>
When trying to determine doc type and charset indexer will look up to i char.s into the fdoc. (Dflt 20480)
-forcexml
Use the XML parser on all documents.
-case
Store case information in postings. Currently unsupported. Note that setting this reduces the approximate max number of unique terms from ~950M to ~240M.
-SORTSIG<num>
How many [UTF-8] characters in a word are significant. Default 20
-dilw
Don't index words or use words in summaries that are longer than what is set by -SORTSIG.
-nocanon
Don't canonicalise URLs when storing URLs or matching anchortext.
-ignore_link_rel_canonical
Ignore canonical URL declarations in HTML link elements.

D. Controlling metadata indexing.

-XMF<file>
<file> specifies a file defining XML field mappings.
-MMF<file>
<file> specifies a file defining meta tag mappings.
-ifb
Index a special word '$++' at the start and end of each metadata field (on by default).
-noifb
Do not index a special word '$++' at the start and end of each metadata field.
-facet_item_sepchars=<string>
Which chars are used to separate metadata facet items. [Dflt '|']
-map[<f>]
Map anchor text in source file to metafield f. If <f> is absent, outgoing anchortext is unfielded content. (dflt <f> absent)
-EM<file>
<file> is a file of external metadata.
-NIM
Ignore explicitly specified internal metadata.
-collfield=<f>
Index the name of a collection as metadata in each doc and assign to field f.
-collection_name=
Set the name of the collection being indexed.
-noank_record
Don't extract, record or index anchortext. .anchors.gz file not processed. No link counts possible.
-noank_index
Extract and record but don't index anchortext. .anchors.gz file can be post-processed by annie-a
-noank
Temporary synonym for -noank_index. Deprecated.
-dpdf
Produce but don't process the anchors distilled file.
-nep_action=<0|1|2>
Action to take for nepotistic links: 0 - treat the same as other links; 1 - ignore links of types greater than nep_limit; - limit the number of repetitions of links of types greater than nep_limit. (dflt)
-nep_limit=<0|1|2|3>
Ignore nepotistic links of types greater than the limit: 0 - unaffiliated links from outside the target domain; 1 - links from a different host; 2 - links from the same or a closely affiliated host; 3 - dynamically generated links from such a host.
-nep_cachebits=<i>
Don't let the low-value link cache grow above 2^i
-noaltanx
Don't index image alt as anchortext when an image is an anchor.
-nosrcanx
Don't index image src as anchortext when an image is an anchor.
-BL<f>
<f> is a file of source URL patterns from which links should be ignored or treated with suspicion (Blacklist).
-AD<f>
<f> is a file of SECD (single entity controlled domain) affiliations. e.g. griffith.edu.au --> gu.edu.au. Links to an affiliated SECD are classified as within-domain.
-RP<f>
<f> is a file of CGI parameters which should be removed from source and target URLs: padre generates a regular expression from the lines in <f>; if <f> is "conf_file" the regex be taken from crawler.remove_parameters in the Funnelback config file.
-A<pat>
<pat> is an acceptable link target pattern: URLs not matching pat will not be stored in anchors.gz file; if pat is "conf_file" pat will be taken from include_patterns in Funnelback config file.
-F<file>
<file> is an additional anchor text file.
-FN<file>
Like -F but source URLs should need not be looked up.
-RD<dir>
<dir> is a directory in which to look: for redirects and duplicates files (produced by FunnelBack etc. & PADRE).
-igmaf
Ignore main anchors file.
-mule<n>
Discard links to URL targets longer than <n> chars. Default is no limit.
-rmat
Record targets of failed anchor lookups via stdout.
-create_phrase_metadata_terms=<b>
Enables the creation of phrase terms like "$++ foo bar $++" in the dictionary for metadata. These phrase terms can be used to speed up queries like a:"$++ foo bar $++". Phrases will only be created if indexing of field boundaries is enabled, which it is by default. Disabling may reduce indexing time and index size.

F. Controlling which index files are generated.

-nomdsf
Suppress generation of the .mdsf file.
-nolex
Suppress generation of the .lex file.
-noqicf
Suppress computation of QIC features and .qicf file.
-nohostf
Suppress computation of host features and .ghosts file.
-cleanup
Remove superfluous files from the index directory after index has completed.

G. Setting size limits.

-GSB<n>
How many gscope bytes to allow for. Default/Min: 8/2.
-big<N>
Multiply word table sizes by 2^N from base of 256K. Default table size is 8M (ie. -big5).
-small
Divide word table sizes by 4 from base of 256K (i.e. use 64K).
-chamb<num>
Set decompression chamber size to <num> MB. Default 32
-RSDTF<num>
Set maximum characters in description & title fields in .results to <num>. Default 256.
-RSTAG<num>
Set number of bytes to reserve for tags in .results to num. Default 0.
-RSTXT<num>
Set maximum characters in summarisable text per doc in .results to <num>. Default 50000.
-W<num>
Index-writing window will be <num> MB (Larger windows mean faster indexing at the expense of using more RAM). Default for a 64bit system is 2800
-MWIPD<num>
Maximum words indexed per document (excluding anchors). By default all words are indexed.
-maxdocs<num>
Maximum no. of documents to index. Others are ignored.
-mdsfml<n>
Set the number of bytes used for MetaData Summary Field Maximum Lengths. Fields larger than this number will be truncated. Default is 2048.
-99%
Limit on how full the word hash table can get.

H. Special indexing modes.

-duplicate_urls=flag|ignore
(Default is flag.) Documents whose URL checksum is identical to that of another document are normally flagged and suppressed from results.
-urlchecksums=case_sensitive|case_insensitive
(Default is case_insensitive).
-paidads
If set, documents known to contain paid ads will be flagged specially (with the DOC_HAS_PAID_ADS flag).
-doc_feature_regex=<Regex>
Documents matching the supplied pattern will be flagged as DOC_MATCHES_REGEX. The presence or absence of this feature can be used in the ranking function, controlled by cool.29 and cool.30.
-iolap
Overlap reading of bundles with processing them.
-utf8input
Assume all input files whose charset is not specified are UTF-8 encoded. (Default is WINDOWS-1252.)
-isoinput
Assume all input files whose charset is not specified are ISO_8859-1 encoded.
-force_iso
Forcibly assume all input files are ISO_8859-1 encoded.
-URLP<str>
When storing documents URLs, prepend <str>. (This is only used if the document does not indicate it's own URL with a BASE HREF element, such as in local collections)
-lmd
HTTP LastModified date takes priority over metadata dates.
-lmd_never
Completely ignore HTTP LastModified dates.
-future_dates_ok
Option is ignored (future dates are always ok).
-DT<str>
Interpret <str> as start of new doc within bundle. (Not a regular expression). Note that there is a separate mechanism for XML.
-annie[<exec>]
After normal indexing is complete, attempt to build an annotation index (annie) and a spelling suggestion file. Default executables are annie-a and build-spelling-index from whence padre-iw was run.
-speller[<exec>]
Allows the explicit specification of a spelling_index builder to run after annie-a.
-spelleroff
turns of spelling-index building even if annie-a runs.
-spelling_threshold<i>
Annotations with fewer than i occurrences will not be considered as spelling suggestions. (dflt 1)
-bigweb
Space saving option for bundled large crawl indexes. Roughly equivalent to: -nomdsf -big8 -MWIPD2000 -W6000 -SORTSIG16 -nep_action=2 -nep_limit=2 -nep_cachebits=20 -chamb64 -RSTXT2000 -mule128 -noaltanx -nosrcanx -nometa -quiet. Results in: A shorter average wordlength is assumed; You can add e.g. -Axxx.com to cut anchor processing time; (Don't forget to make dupredrex.txt in index directory.)

I. Miscellaneous options.

-O<name>
<name> is the name of this organisation.
-T<path>
Specify a large temporary filespace for use by the indexer.
-redis_host=<str>
Hostname/IP of a Redis server where progress status should be written
-redis_port=<i>
Port of the Redis server. Default is 6379

S. Security options.

-security_level=<i>
Any non-zero value requires every document to have at least one lock. If set to 1 documents without locks will be excluded, if set to greater than 1 indexing will stop.
-security_mindocs=<i>
Must be at least this number of docs with at least one lock.

See also url_exclusion options in Section B above.

Default value

indexer_options=

Examples

Increase the size of metadata fields to 10000 characters and index no more than 1000 documents

indexer_options=-mdsfml10000 -maxdocs1000

See also

top

Funnelback logo
v15.12.0