Indexer Options (collection.cfg)

Description

This option specifies additional configuration options that can be supplied to the indexer when indexing collections. The PArallel Document Retrieval Engine indexer is a powerful engine that can be finely controlled through a large list of options that can be given to it. These options can be specified in this collection configuration parameter. The list of options available is given here.

Caveats

  • Indexing will not occur if the indexer is given an invalid option.
  • Indexer options can affect Funnelback's performance, so change them with caution.
  • Options in group A are generally useful only when running PADRE from a command line and usually should not be included in the index_options

A. Getting information about PADRE and its operation

OptionExplanation
-VPrint PADRE version number and exit.
-ixformPrint index format version created by this indexer exit.
-helpPrint this list and exit.
-debugGenerate debugging output.
-showShow code bits generated (for debugging).
-quietUse terse logging.
-ankdebugGenerate debugging output relating to anchortext.
-termdebprint debugging messages relating to the indexing of .

B. Controlling what is indexed

OptionExplanation
-nometaDon't index any metadata except t, d and k (titles, dates and links).
-diasDon't index link anchor source as part of source documents ( only).
-ibdIndex all documents even if they appear to be binary.
-ixcomIndex words in HTML and XML comments.
-select,Index every num1th document/bundle starting from num2th(from zero).
-check_url_exclusion=URLs matching url_exclusion pattern will not be searchable. (Default on.)
-url_exclusion_pattern=exclusion pattern to use if URLs are vetted. (Default 'file://$SEARCH_HOME/')
-filepath_exclusion_pattern=exclusion pattern to use if files are to be excluded from indexing on the basis of the filepath. If applicable, this is more efficient than excluding by URL because the URL can't be finally determined until the content has been scanned. (Default: off.)

C. Controlling how things are indexed

OptionExplanation
-csv=[]Index content as CSV. Setting this value to ,y\" would treat commas as separators of fields wrapped in quotes, with the first row being skipped.
-csv_fields=Comma-separated list indicating which CSV fields are to be ignored, and which are to be indexed. Setting this value to a0,-X,t1 would assume a three-column input file, mapping the first and last fields to metadata, whilst ignoring the second field. See also metamap.cfg.
-noaxDon't conflate accents.
-QL_depth=Activate quicklinks on default pages of up to depth . Use internal QL defaults. (Default 0 = Off).
-QL_config=Activate quicklinks. Read quicklinks configuration options from file .
-forcexmlUse the XML parser on all documents.
-vbyteUse variable-byte compression of inverted file. (Default).
-SORTSIGHow many [UTF-8] characters in a word are significant. (Defaults to 20).
-dilwDon't index words or use words in summaries that are longer than what is set by -SORTSIG.

D. Controlling metadata indexing

OptionExplanation
-nomdsfconcatDon't concatenate stored metadata strings (store only the first). Note: subsequent strings are still indexed.
-XMFspecifies a file defining XML field mappings.
-MMFspecifies a file defining meta tag mappings.
-ifbIndex a special word '$++' (index field boundary) at the start and end of each metadata field (used in facets).
-facet_item_sepchars=Specify which chars are used to separate metadata facet items. The default value is the pipe character (i.e.
-mapAMap anchor text in source file to A:
-EMis a file of external metadata.
-NIMIgnore explicitly specified internal metadata.
-collfield=Index the name of a collection as metadata in each document and assign to field f.
OptionExplanation
-noank_recordDon't extract, record or index anchortext.
-noank_indexExtract and record but don't index anchortext.
-dpdfProduce but don't process the anchors distilled file.
-nep_action={0,1,2}Controls handling of nepotistic links.
0 - Handle as normal links,
1 - Ignore links of types greater than nep_limit.
2 - Limit the number of repetitions of links of types greater than nep_limit (default).
-nep_limit={0,1,2,3}Controls the types of links which are considered to be nepotistic.
0 - Unaffiliated links from outside the target domain.
1 - Links from a different host.
2 - Links from the same or a closely affiliated host.
3 - Dynamically generated links from such a host.
-nep_cachebits={i}Limits the size of the 'low value' link cache to 2^i.
-noaltanxDon't index image alt as anchortext when an image is an anchor.
-nosrcanxDon't index image src as anchortext when an image is an anchor.
-BLis a file of source URL patterns from which links should be ignored or treated with suspicion (Blacklist).
-ADis a file of SECD (single entity controlled domain) affiliations. e.g. griffith.edu.au --> gu.edu.au. Links to an affiliated SECD are classified as within-domain.
-RPis a file of CGI parameters which should be removed from source and target URLs. The special value conf_file can be used (ie. "-RPconf_file") to tell the indexer to use the value of crawler.remove_parameters from collection.cfg instead of specifying an external file.
-Ais an acceptable link target pattern.
-F* is an additional anchor text file.
-FNLike -F but source URLs should need not be looked up.
-RD* is a directory in which to look for redirects and duplicates files. (produced by FunnelBack etc. & PADRE).
-igmafIgnore main anchors file.
-muleDiscard links to URL targets longer than chars. Default is no limit.
-rmatRecord targets of failed anchor lookups via stdout.

F. Controlling which index files are generated

OptionExplanation
-nomdsfSuppress generation of the .mdsf file.
-nolexSuppress generation of the .lex file.
-exlensCreate .dlx file with explicit lengths for each document field.
-cleanupRemove superfluous files from the index directory after index has completed.
-nosigsSuppress the calculation of document text signatures and the production of .textsig file.

G. Setting size limits

OptionExplanation
-GSBHow many bytes to allocate for gscope flags. The default is 8 bytes (i.e 64 flags) and the minimum is 2 bytes (i.e. 16 flags). Note: all collections that are part of a meta collection must have the same GSB value or the indexes will become incompatible.
-bigMultiply word table sizes by 2^N from base of 256K. Default table size is 8M (ie. -big5).
-smallDivide word table sizes by 4 from base of 256K (i.e. use 64K).
-chambSet decompression chamber size to MB.
-RSDTFSet maximum characters in description & title fields in .results to . Default 256.
-RSTXTSet maximum characters in summarisable text per doc in .results to . Default 10000.
-WIndex-writing window will be MB (Larger windows mean faster indexing at the expense of using more RAM).
-MWIPDMaximum words indexed per document (excluding anchors).
-maxdocsMaximum no. of documents to index. Others are ignored.
-mdsfmlSet maximum length for strings in .mdsf file. Default 512, Maximum is 2^31
-99%Limit on how full the word hash table can get.

H. Special indexing modes

OptionExplanation
-future_dates_okAllow indexing of documents with dates in the future (instead of assuming the current date in these cases).
-paidadsIf set, documents known to contain paid ads will be flagged specially (with the DOC_HAS_PAID_ADS flag).
-doc_feature_regex=Documents matching the supplied pattern will be flagged as DOC_MATCHES_REGEX. The presence or absence of this feature can be used in the ranking function, controlled by cool29 and cool30.
-nz=Adjusts special processing for Māori, specifically handling of macrons. Valid values are 0 (default - no special processing), 1 (some processing) and 2 (all processing).
-iolapOverlap reading of bundles with processing them.
-utf8inputAssume all input files whose charset is not specified are UTF-8 encoded.
-isoinputAssume all input files whose charset is not specified are ISO_8859-1 encoded (This is the default).
-force_isoForcibly assume all input files are ISO_8859-1 encoded.
-URLPWhen storing documents URLs, prepend . (This is only used if the document does not indicate its own URL with a BASE HREF element)
-lmdHTTP LastModified date takes priority over metadata dates.
-lmd_neverCompletely ignore HTTP LastModified dates.
-DTInterpret as start of new doc within bundle. (Not a regular expression). (note that there is a separate mechanism for XML).
-annie[]After normal indexing is complete, build an ANNIE (annotation) index using the specified executable. Default executable is annie-i
-bigwebSpace saving option for bundled large crawl indexes. Roughly equivalent to: -nomdsf -big7 -MWIPD2000 -W2000 -SORTSIG16 -chamb64 -RSTXT2000 -mule128 -noaltanx -nosrcanx -nometa -quiet . A broader definition is taken of link nepotism. You can add e.g. -Axxx.com to cut anchor processing time. (Don't forget to make dupredrex.txt in index directory.)

I. Miscellaneous options

OptionExplanation
-Ois the name of this organisation.
-TSpecify a large temporary filespace for use by the indexer.

Default value

That is, no additional options.

Examples

To allocate 80 (10 x 8) gscopes and index no more than 1000 documents

indexer_options=-GSB10 -maxdocs1000

See also

top