indexer_options
Indexer command line options, each separated by whitespace and thus cannot contain embedded whitespace characters.
Key: indexer_options
Type: String
Can be set in: collection.cfg
Description
This option specifies additional configuration options that can be supplied to the indexer when indexing collections. The PArallel Document Retrieval Engine indexer is a powerful engine that can be finely controlled through a large list of options that can be given to it. These options can be specified in this collection configuration parameter. The list of options available is given here.
Default Value
indexer_options=
Examples
Increase the size of metadata fields to 10000 characters and index no more than 1000 documents
indexer_options=-mdsfml10000 -maxdocs1000
⚠ Caveats
- Indexing will not occur if the indexer is given an invalid option.
- Indexer options can affect Funnelback's performance, so change them with caution.
- Options in group A are generally useful only when running PADRE from a command line and usually should not be included in the index_options
Indexer options
A. Getting information about PADRE and its operation.
-V
- Print PADRE version number and exit.
-ixform
- Print index format version created by this indexer exit.
-help
- Print this list and exit.
-debug
- Generate debugging output.
-show_each_word_indexed
- For debugging. Show each word occurrence (with field) as it is indexed.
-show_each_word_to_file
- For debugging. Print each word occurrence (with field) to <filestem>.words_in_docs
-hashlog
- Create a .hashlog file with incremental hashing stats.
-quiet
- Use terse logging.
-ankdebug
- Generate debugging output relating to anchortext.
-termdeb<term>
- print debugging messages relating to the indexing of <term>.
B. Controlling what is indexed.
-nometa
- Don't index any metadata except t, d and k (titles, dates and links).
-nomdsfconcat
- Don't concatenate strings in the mdsf file. Record first only. (Others are still indexed.)
-diwimuu
- Don't index words in made-up URLs (those constructed from filepath).
-dias
- Don't index link anchor source as part of source documents (<a> only).
-ibd
- Index all documents even if they appear to be binary.
-ixcom
- Index words in HTML and XML comments.
-select<num1>,<num2>
- Index every num1th file/bundle starting from num2th (from zero).
-select-doc-in-bundle=<interval>,<offset>
- Index every <interval> document within a bundle starting from <offset> (which starts at zero). Only works with warc store.
-tarpat<regex>
- Filenames in a tarfile being indexed must match regex. Default is match-everything.
-csv=<fsep><skipfirst>[<quote>]
- Deprecated. Use the CSV to XML filter instead. Files which are not clearly something else are assumed to be CSV format. fsep is ascii field separator, typically comma. (tab is represented by t.) skipfirst is either y or n, telling padre whether the first line in a CSV file should be skipped. quote is the character used to quote strings in fields which may contain separators. (You probably have to escape it on the command line.) If not specified, no quote character is defined. To include a quote character within a quoted section, the quote may be doubled.
-csv_fields=<comma_separated_descriptor_list>
- Deprecated. Use the CSV to XML filter instead. This is a list of comma separated descriptors describing how to index each column ofthe csv file. To index terms in a column as document text use '-'. To index terms in a column as metadata use the format: <metadata class name><content type>; To skip terms in a column use 'X'. For example: 't1,-,X', would set the first column to title, the second column would be indexed as document content and the third column would be skipped.; Content-type defined in this argument should be the same as the content type in the metadata mappings
-check_url_exclusion=<on|off>
- URLs matching url_exclusion pattern will not be searchable. (Default on.)
-url_exclusion_pattern=<regex>
- exclusion pattern to use if URLs are vetted. (Default 'file://$SEARCH_HOME/')
-filepath_exclusion_pattern=<regex>
- exclusion pattern to use if files are to be excluded from indexing on the basis of the filepath. If applicable, this is more efficient than excluding by URL because the URL can't be finally determined until the content has been scanned. (Default: not set)
-index_subversion_dirs
- Normally the .svn directories created by the subversion version control system are not indexed. Override this default.
C. Controlling how things are indexed.
-noax
- Don't conflate accents.
-unimap=<mapname>
- specify a Unicode mapping to be applied when indexing and when query processing. Supported values: tosimplified, and totraditional. (Chinese only.)
-deutsch=<i>
- How much extra processing is done for umlaut and sz: 0 - none. München is indexed as München and Munchen; 1 - München is indexed as München, Muenchen and Munchen (Dflt); 2 - As for 1 but also Muenchen is indexed as München, Muenchen and Munchen (As a side-effect to allow for compounds, SORT_SIGNIF is increased to 40
-nz=<i>
- How much extra processing is done for M\u0101ori: 0 - none. M\u0101ori and Mäori are indexed as M\u0101ori or Mäori resp. and Maori (Dflt); 1 - M\u0101ori is indexed as M\u0101ori, Maaori and Maori; 2 - Mäori is indexed as Mäori, M\u0101ori, Maaori and Maori
-no_cjkt_grams
- Suppress the indexing of bigrams/unigrams in CJKT text. It is assumed that said text has been pre-segmented into words, and that normal word-based indexing is needed.
-QL_depth=<i>
- Activate quicklinks on default pages of up to depth i. Use internal QL defaults. (Dflt 0 = Off)
-QL_config=<f>
- Activate quicklinks. Read quicklinks configuration options from file f.
-docscan_depth=<i>
- When trying to determine doc type and charset indexer will look up to i char.s into the fdoc. (Dflt 20480)
-forcexml
- Use the XML parser on all documents.
-case
- Store case information in postings. Currently unsupported. Note that setting this reduces the approximate max number of unique terms from ~950M to ~240M.
-SORTSIG<num>
- How many [UTF-8] characters in a word are significant. Default 20
-dilw
- Don't index words or use words in summaries that are longer than what is set by -SORTSIG.
-nocanon
- Don't canonicalise URLs when storing URLs or matching anchortext.
-ignore_link_rel_canonical
- Ignore canonical URL declarations in HTML link elements.
D. Controlling metadata indexing.
-XMF<file>
- <file> specifies a file defining XML field mappings.
-MMF<file>
- <file> specifies a file defining meta tag mappings.
-ifb
- Index a special word '$++' at the start and end of each metadata field (on by default).
-noifb
- Do not index a special word '$++' at the start and end of each metadata field.
-facet_item_sepchars=<string>
- Which chars are used to separate metadata facet items. [Dflt '|']
-map[<f>]
- Map anchor text in source file to metafield f. If <f> is absent, outgoing anchortext is unfielded content. (dflt <f> absent)
-EM<file>
- <file> is a file of external metadata.
-NIM
- Ignore explicitly specified internal metadata.
-collfield=<f>
- Index the name of a collection as metadata in each doc and assign to field f.
-collection_name=
- Set the name of the collection being indexed.
E. Controlling link and anchortext handling.
-noank_record
- Don't extract, record or index anchortext. .anchors.gz file not processed. No link counts possible.
-noank_index
- Extract and record but don't index anchortext. .anchors.gz file can be post-processed by annie-a
-noank
- Temporary synonym for -noank_index. Deprecated.
-dpdf
- Produce but don't process the anchors distilled file.
-nep_action=<0|1|2>
- Action to take for nepotistic links: 0 - treat the same as other links; 1 - ignore links of types greater than nep_limit; - limit the number of repetitions of links of types greater than nep_limit. (dflt)
-nep_limit=<0|1|2|3>
- Ignore nepotistic links of types greater than the limit: 0 - unaffiliated links from outside the target domain; 1 - links from a different host; 2 - links from the same or a closely affiliated host; 3 - dynamically generated links from such a host.
-nep_cachebits=<i>
- Don't let the low-value link cache grow above 2^i
-noaltanx
- Don't index image alt as anchortext when an image is an anchor.
-nosrcanx
- Don't index image src as anchortext when an image is an anchor.
-BL<f>
- <f> is a file of source URL patterns from which links should be ignored or treated with suspicion (Blacklist).
-AD<f>
- <f> is a file of SECD (single entity controlled domain) affiliations. e.g. griffith.edu.au --> gu.edu.au. Links to an affiliated SECD are classified as within-domain.
-RP<f>
- <f> is a file of CGI parameters which should be removed from source and target URLs: padre generates a regular expression from the lines in <f>; if <f> is "conf_file" the regex be taken from crawler.remove_parameters in the Funnelback config file.
-A<pat>
- <pat> is an acceptable link target pattern: URLs not matching pat will not be stored in anchors.gz file; if pat is "conf_file" pat will be taken from include_patterns in Funnelback config file.
-F<file>
- <file> is an additional anchor text file.
-FN<file>
- Like -F but source URLs should need not be looked up.
-RD<dir>
- <dir> is a directory in which to look: for redirects and duplicates files (produced by Funnelback etc. & PADRE).
-igmaf
- Ignore main anchors file.
-mule<n>
- Discard links to URL targets longer than <n> chars. Default is no limit.
-rmat
- Record targets of failed anchor lookups via stdout.
-create_phrase_metadata_terms=<b>
- Enables the creation of phrase terms like "$++ foo bar $++" in the dictionary for metadata. These phrase terms can be used to speed up queries like a:"$++ foo bar $++". Phrases will only be created if indexing of field boundaries is enabled, which it is by default. Disabling may reduce indexing time and index size.
F. Controlling which index files are generated.
-nomdsf
- Suppress generation of the .mdsf file.
-nolex
- Suppress generation of the .lex file.
-noqicf
- Suppress computation of QIC features and .qicf file.
-nohostf
- Suppress computation of host features and .ghosts file.
-cleanup
- Remove superfluous files from the index directory after index has completed.
G. Setting size limits.
-GSB<n>
- How many gscope bytes to allow for. Default/Min: 8/2.
-big<N>
- Multiply word table sizes by 2^N from base of 256K. Default table size is 8M (ie. -big5).
-small
- Divide word table sizes by 4 from base of 256K (i.e. use 64K).
-chamb<num>
- Set decompression chamber size to <num> MB. Default 32
-RSDTF<num>
- Set maximum characters in description & title fields in .results to <num>. Default 256.
-RSTAG<num>
- Set number of bytes to reserve for tags in .results to num. Default 0.
-RSTXT<num>
- Set maximum characters in summarisable text per doc in .results to <num>. Default 50000.
-W<num>
- Index-writing window will be <num> MB (Larger windows mean faster indexing at the expense of using more RAM). Default for a 64bit system is 2800
-MWIPD<num>
- Maximum words indexed per document (excluding anchors). By default all words are indexed.
-maxdocs<num>
- Maximum no. of documents to index. Others are ignored.
-mdsfml<n>
- Set the number of bytes used for MetaData Summary Field Maximum Lengths. Fields larger than this number will be truncated. Default is 2048.
-99%
- Limit on how full the word hash table can get.
H. Special indexing modes.
-duplicate_urls=flag|ignore
- (Default is flag.) Documents whose URL checksum is identical to that of another document are normally flagged and suppressed from results.
-urlchecksums=case_sensitive|case_insensitive
- (Default is case_insensitive).
-paidads
- If set, documents known to contain paid ads will be flagged specially (with the DOC_HAS_PAID_ADS flag).
-doc_feature_regex=<Regex>
- Documents matching the supplied pattern will be flagged as DOC_MATCHES_REGEX. The presence or absence of this feature can be used in the ranking function, controlled by cool.29 and cool.30.
-iolap
- Overlap reading of bundles with processing them.
-utf8input
- Assume all input files whose charset is not specified are UTF-8 encoded. (Default is WINDOWS-1252.)
-isoinput
- Assume all input files whose charset is not specified are ISO_8859-1 encoded.
-force_iso
- Forcibly assume all input files are ISO_8859-1 encoded.
-URLP<str>
- When storing documents URLs, prepend <str>. (This is only used if the document does not indicate it's own URL with a BASE HREF element, such as in local collections)
-lmd
- HTTP LastModified date takes priority over metadata dates.
-lmd_never
- Completely ignore HTTP LastModified dates.
-future_dates_ok
- Option is ignored (future dates are always ok).
-DT<str>
- Interpret <str> as start of new doc within bundle. (Not a regular expression). Note that there is a separate mechanism for XML.
-annie[<exec>]
- After normal indexing is complete, attempt to build an annotation index (annie) and a spelling suggestion file. Default executables are annie-a and build-spelling-index from whence padre-iw was run.
-speller[<exec>]
- Allows the explicit specification of a spelling_index builder to run after annie-a.
-spelleroff
- turns of spelling-index building even if annie-a runs.
-spelling_threshold<i>
- Annotations with fewer than i occurrences will not be considered as spelling suggestions. (dflt 1)
-bigweb
- Space saving option for bundled large crawl indexes. Roughly equivalent to:
-nomdsf -big8 -MWIPD2000 -W6000 -SORTSIG16 -nep_action=2 -nep_limit=2 -nep_cachebits=20 -chamb64 -RSTXT2000 -mule128 -noaltanx -nosrcanx -nometa -quiet
. Results in: A shorter average wordlength is assumed; You can add e.g. -Axxx.com to cut anchor processing time; (Don't forget to make dupredrex.txt in index directory.)
I. Miscellaneous options.
-O<name>
- <name> is the name of this organisation.
-T<path>
- Specify a large temporary filespace for use by the indexer.
-redis_host=<str>
- Hostname/IP of a Redis server where progress status should be written
-redis_port=<i>
- Port of the Redis server. Default is 6379
S. Security options.
-security_level=<i>
- Any non-zero value requires every document to have at least one lock. If set to 1 documents without locks will be excluded, if set to greater than 1 indexing will stop.
-security_mindocs=<i>
- Must be at least this number of docs with at least one lock.
See also url_exclusion options in Section B above.