Padre Usage
This page lists all the PADRE binaries and their corresponding usage messages.
1. FineTune
Purpose: Tuning padre-sw ranking parameters based on a C-TEST file.
Usage: /data/jenkins/workspace/funnelback-padre-linux-64-15.24.x/install/FineTune <collection>[.<profile>] ... [-perl_bin=path_to_perl_bin] [-help] [-verbose[=<level>]] [-timeout=<hours>] [-query_limit=<num_queries>] [-alpha=<f>] [-rvalues=<i>] [-adjust=<i>] [-sample=on|<number>][<mode> ...] [-conf] [-qp=padre_subpath] [-index_dir.<collection>=<index directory>] [-lock_file=<file to lock>]
e.g. /data/jenkins/workspace/funnelback-padre-linux-64-15.24.x/install/FineTune lse -daat -annieonly
e.g. /data/jenkins/workspace/funnelback-padre-linux-64-15.24.x/install/FineTune agosp.doha -timeout=7.5
(Use -daat0 to tune term-at-a-time.)
(Timeouts and query limits: Apply separately to each mode. Defaults
are 5 hours and 1 million queries. After a timeout, or when the
query limit has been exceeded, the best tuning found so far for
that mode will be recorded in the .best file for the mode.)
(-perl_bin=/path/to/perl used to set the path to perl binary to use)
(-alpha sets the balance between success rate and wmum1 in tuning. [Dflt 0.75]
Value must lie between 0 (ignore success rate) and 1 (ignore wmum1))
(-rvalues - sets no. of values to explore for optype=2 (real)) [Dflt 11]
(-adjust - sets no. of steps to remove when adjusting exploration range
for optype=2 (real) dimensions.) [Dflt 5]
(-conf extracts the mode to tune from collection.cfg. (N/A for multituning.))
(-help gives more detailed instructions and exits.)
(-index_dir can be used to set the index directory for a particular collection.
The directory must contain a index prefixed by 'index'.)
(-lock_file can be used to lock a file for the entire duration of tuning, if
the lock can not be acquired tuning will not start.)
(-redirect_stdout can be used to redirect stdout to a given file.
) (-redirect_stderr can be used to redirect stderr to a given file.
) (-write_finish_time_to writes the tuning finish time in ISO-8601 to the given
file.)
2. QiTune
Purpose: Tuning padre-sw query-independent ranking parameters based on a C-TEST file.
Usage: /data/jenkins/workspace/funnelback-padre-linux-64-15.24.x/install/QiTune stem C-TESTfile
Given a PADRE index and a file of useful URLs (extracted from the
C-TEST file) compute a set of query-independent cool settings
suitable for passing to padre-do which (hopefully) optimise the
difference in ave scores between the useful docs and the general collection.
3. SpellTune
Purpose: Tuning PADRE spelling suggestion system based on a test file e.g. mycoll.spelltest.
Usage: /data/jenkins/workspace/funnelback-padre-linux-64-15.24.x/install/SpellTune <collection>[.<profile>] ... [-tune_bsi] [-help] [-verbose[=<level>]] [-timeout=<hours>] [-query_limit=<num_queries>] [-rvalues=<i>] [-adjust=<i>] [-sample=on|<number>][-qp=padre_subpath]
e.g. /data/jenkins/workspace/funnelback-padre-linux-64-15.24.x/install/SpellTune lse -annieonly
e.g. /data/jenkins/workspace/funnelback-padre-linux-64-15.24.x/install/SpellTune agosp.doha -timeout=7.5
(Timeouts and query limit defaults are 5 hours
and 1 million queries. After a timeout, or when the
query limit has been exceeded, the best tuning found so far
will be recorded in the .bestspell file for the mode.)
(-tune_bsi - tunes the build_spelling_index params. Slow.)
(-rvalues - sets no. of values to explore for optype=2 (real)) [Dflt 11]
(-adjust - sets no. of steps to remove when adjusting exploration range
for optype=2 (real) dimensions.) [Dflt 5]
(-help gives detailed instructions.)
Invalid usage.
4. annie-a
Annie version 1.13 (11 Mar 2010)
Purpose: Builds an annotation index for a collection, specified by <stem>, from a list of files in anchors.gz format.
Usage: /data/jenkins/workspace/funnelback-padre-linux-64-15.24.x/install/annie-a <stem> [<stem_or_file> ...] [-phrasefile=<filename>] [-deb] [-hashbits <10..30>] [-maxlines <n>] [-wts <wt0> <wt1> <wt2> <wt3> <wt4>] [-stripstops] [-STOP=<filename>] [-canon] [-rejecturls] [-rejectnumeric] [-quicken] [-maxwds <i>] [-maxlen <i>] [-build_annou=on/off] [-build_lcache=on/off] [-nep_limit=0|1|2|3]
<stem> must reference either a meta collection or a primary index.
<stem_or_file> may be either a stem as above or the name of a file in anchors.gz format.
In the case <stem> or <stem_or_file> is a meta collection, /data/jenkins/workspace/funnelback-padre-linux-64-15.24.x/install/annie-a will look for the anchors.gz files from each of the component collections and use them for creating the annotation index for the collection specified by <stem>. If any anchors.gz file changes for a component collection, /data/jenkins/workspace/funnelback-padre-linux-64-15.24.x/install/annie-a will need to be run again for the meta collection.
-quicken improves query performance by using <coll id, doc id> pairs. The coll id is dependent on the sdinfo file, if the sdinfo file is changed /data/jenkins/workspace/funnelback-padre-linux-64-15.24.x/install/annie-a will need to be run again for the meta collection with this option. It is recommended that the most recent collection is placed at the top of the sdinfo file.
5. annie-quicken
Purpose: Convert URL references in an annotation index into (component, docno) to speed query processing.
Usage: /data/jenkins/workspace/funnelback-padre-linux-64-15.24.x/install/annie-quicken anno_stem index_stem
6. build_autoc
Purpose: To build a query completion file (.autoc) from a list of input files.
Usage: /data/jenkins/workspace/funnelback-padre-linux-64-15.24.x/install/build_autoc stem input_file ... [-collection name -profile name] [-partials] [-label_organics] [-debug]
e.g. /data/jenkins/workspace/funnelback-padre-linux-64-15.24.x/install/build_autoc index blah.csv
where blah.csv will be sorted and indexed into index.autoc. Input_file(s)
must end in .csv, .suggest, or .cfg
-profile <name> - generate scoped .autoc file for the specified
profile. A previous run of build_autoc must have
been called with -index.
-collection <name> - generate scoped .autoc file for the specified
collection. Both -profile and -collection need
to be specified when generating scoped suggestions
-partials - this version allows multi-word organic
suggestions to be triggered either from the full
suggestions or from trailing word sequences. E.g.
'big fat cat' triggered from 'fat cat' and 'cat' as
well as the full string. This option turns that on.
-label_organics - present a category label for all the organic completions
-sample <val> - Sample postings of suggestion terms, to handle large
collections, <val> ranges 0 - 300; speeds up processing
with the effect of sampling the suggestions.
(1/val postings are used).
Note: /data/jenkins/workspace/funnelback-padre-linux-64-15.24.x/install/build_autoc can now build a single .autoc file from multiple input
files of the same or different types. Files with very simple
format can be combined with hand-crafted files containing
complex actions. Completion weights from a .suggest file
are automatically determined, while they can be manually specified
in a CSV file. Completion weights from Best Bets
default to 100.
SUGGEST FORMAT
--------------
.suggest files built by build_spelling_index can be supplied as
input. Reasons for doing this include taking advantage of an index
optimised for completion purposes; and integrating automated spelling
suggestions with hand-crafted entries.
CFG FORMAT
----------
Input files with .cfg suffix are expected to be in Best Bets
format. Only exact-match lines (beginning with '+') are considered
and the target of each such line is recorded as a suggestion.
CSV FORMAT
----------
Each line of a .csv file must contain eight fields (7 commas),
corresponding to: key, weight, display, display_type, category,
category_type, action, and action_type. Fields except key and
weight may be empty.
Two meta characters are recognized within a field: backslash and
double quote. These are handled as follows in the two cases:
(A) Unquoted text: A single backslash is not passed through, while
the character following it is passed through without applying
any tests. This means that a double backslash in input leads to
a single backslash in output and that commas or double quotes
preceded by a backslash do not have their normal meaning.
(B) Quoted text: The double quotes beginning and ending a quoted
section are not passed through. Within a quoted section a double
quote may be passed through by either doubling it ("") or by
preceding it with a backslash (\").
By these means it is possible to pass through HTML and/or JSON
containing quotes and/or commas
7. build_match_only_index
Purpose: create a match only index, which build_autoc can use to build
profiled query suggestions.
Usage: /data/jenkins/workspace/funnelback-padre-linux-64-15.24.x/install/build_match_only_index stem
8. build_spelling_index
Purpose: To build a spelling suggestion file (.suggest/.suggest2) for a collection.
Usage: /data/jenkins/workspace/funnelback-padre-linux-64-15.24.x/install/build_spelling_index index_stem num_thresh [<metadata_class_names> [[<lexiweight>] [[<blacklist_file>] [<whitelist_file>]]]
e.g. /data/jenkins/workspace/funnelback-padre-linux-64-15.24.x/install/build_spelling_index index 2 [@,t,c]
where the listed comma separated metadata class names,
'@,t,c', are the ones to be scanned for
suggestions. '@' means use the .anno file. '%' means use
unfielded words from index. lex. + means use phrases from
index.phrases (if present). If no fields are listed, "@,+,t,%"
is assumed.
num_thresh - minimum weight of suggestions recorded in suggest index
lexiweight - controls the weight of lexicon suggestions relative
to annotations. wt = lexiweight * sqrt(df) (dflt lexiweight = 1.00)
blacklist_file - manual list of suggestions which should NOT be included
in the index. (one per line)
whitelist_file - manual list of suggestions to include in the index.
(one per line.)
9. csv2ctest
Purpose: To convert a tuning file in CSV format into a C-TEST file for use with e.g. FineTune.
Usage: /data/jenkins/workspace/funnelback-padre-linux-64-15.24.x/install/csv2ctest: infile.csv [-utils=recip|-utils=equal] [-queryweights]
Output in C-TEST format will be in infile.ctest
The input file is assumed to be a syntactically correct comma separated
value (CSV) file in which cells are separated by commas. Double quotes
around all or part of a cell allow inclusion of commas. The quotes are
stripped off before processing. The input file may contain comment
lines starting with a hash.
The first column in infile.csv is always assumed to contain a query.
If no options are given, then the remaining columns contain desired
answer URLs for that query, in descending order of utility. Utility
scores start at 4 and then gradually decline to 1: 4, 3, 2, 1, 1, 1 ...
This behaviour may be modified as follows:
-utils=equal - All of the answers are given equal utility values.
-utils=sqrt - Utility values drop off as 1/sqrt(rank).
-utils=recip - Utility values drop off faster -- as 1/rank.
-queryweights - if this is given, the second column is expected to
contain the numerical weight associated with the query
and the remaining columns contain the answer URLs
10. dump_annotation_file
Purpose: To display the contents of an annotation index in geek-readable form.
Usage: /data/jenkins/workspace/funnelback-padre-linux-64-15.24.x/install/dump_annotation_file <annotationfile>
11. dump_autoc
Purpose: To display the contents of a query completion file in geek-readable format.
Usage: /data/jenkins/workspace/funnelback-padre-linux-64-15.24.x/install/dump_autoc <stem|collection|autoc_file>
12. dump_suggestion_file
Purpose: To display the contents of a spelling suggestion file in geek-readable format.
Usage: /data/jenkins/workspace/funnelback-padre-linux-64-15.24.x/install/dump_suggestion_file <index_stem>
- dumps contents of <index_stem>.suggest
13. get_docnum_from_url
Purpose: Map a URL to that document's number within an index stem.
Usage: /data/jenkins/workspace/funnelback-padre-linux-64-15.24.x/install/get_docnum_from_url <index_stem> <url>
Prints the docnum for a given URL to standard out.
Prints "notfound" if the URL is not found.
<filestem> - the common prefix (including path) of the index files
<url> - the URL to look up the document number for
14. get_url_from_component_document_pair
Purpose: Within an index, output the URL of the document identified by component number and document number.
Usage: get_url_from_component_document_pair <index_stem> <component_number> <document_number>
Warning: Doesn't handle nested .sdinfo files. (Hierarchical meta collections.)
15. get_url_from_docnum
Purpose: print the URL of a document in a primary index, given its URL. Inverse of get_docnum_from_url.
Usage: get_url_from_docnum <index_stem> <doc_num>
16. harvest_anchortext
Purpose: Extract a subset of entries in a list of anchors.gz files which match a specified pattern.
Usage 1: /data/jenkins/workspace/funnelback-padre-linux-64-15.24.x/install/harvest_anchortext -targ|-text|-any|-source pattern anchor_text_file ...
Usage 2: /data/jenkins/workspace/funnelback-padre-linux-64-15.24.x/install/harvest_anchortext -noneps <affiliates_file> anchor_text_file ...
Extracts a subset of lines in the anchor_text_files. The composition
of the subset depends upon the match_type argument as follows:
-any - Any line (source or target) which matches pattern
-targ - Any target line whose URL target matches pattern
-text - Any target line whose anchortext matches pattern
-source - Any source line which matches pattern
+ In usage 2, links within the same SECD (single-entity-controlled-domain)
are suppressed, as are links between affiliated pairs of hosts listed in
the affiliates_file. If there is no affiliates_file, use '-'.
+ Whenever a target line matches the corresponding source line is also output.
+ Whenever a source line matches its corresponding target lines are also output.
+ NOTE: nepotistic links are included unless -noneps is used.
17. hierarchical_navpaths
Purpose: Extract hierarchical navigation paths from a list of anchors files.
Usage: /data/jenkins/workspace/funnelback-padre-linux-64-15.24.x/install/hierarchical_navpaths <stem> [-verbose] [<anchor_text_file> ...]
Reads <stem>.anchors.gz, plus any additional anchortext files to
identify hierarchical navigation paths (HNPs). These are output
to <stem>.hnp.anchors.gz in standard anchors.gz format:
<target_url> --- [H]<concatenated anchors from path>
+ NOTES:
1. inter-host links are ignored.
2. -verbose prints the actual HN paths to stdout.
3. All targets in the .hnp.anchors file have http://hnp as source.
Warning: Not ready for use. Development of this utility is incomplete.
18. host_host_link_counts
Purpose: Analyse a list of anchors.gz files and report on frequencies of inter-host links.
Usage: /data/jenkins/workspace/funnelback-padre-linux-64-15.24.x/install/host_host_link_counts [-targ|-source <pattern>] [-report] <stem> [anchor_text_file ...]
Reads the anchor_text_files and outputs a table of host-host links,
in descending order of link count. By default, all lines are
processed, but a pattern can optionally be applied to either
targets or sources.
If -report is given, short and full HTML reports will be generated.
-targ - Process only links whose target host matches pattern
-source - Process only links whose source host matches pattern
Nowadays, the first option not starting with a - is an index stem.
A file <stem>.hosts is created with a table of host-related feature
scores which can be used in ranking. The order of entries must
correspond to the hostnum order assigned by padre-iw.
+ NOTE: within-host links are excluded.
19. padre-arg-sw
Purpose: To help with conversion of padre-sw argument lists from old to key=value format
20. padre-cc
Purpose: To build an index.collapsig file to permit use of collapsed rankings.
Usage: /data/jenkins/workspace/funnelback-padre-linux-64-15.24.x/install/padre-cc <index_stem> [-collapse_control=<string>] [-debug=on]
Utility for building a .collapsig file of collapsing
signatures. If no control_string is given, a one-column
file is built using the signatures from the .textsig file.
The collapse_control string must consist of sets of sequences of
metadata class names. Each set should be surrounded by square
brackets, and sets should be separated by commas. Metadata class
names are the elements of the sets and must be separated by
commas.
The characters $ and # may be used as special metadata class
names and represent document summarisable text and
document URL respectively.
In future, it is planned to allow special metadata class
names to be followed by a regular expression,
indicating that only the part of the metadata string which matches
the regex should be used in calculating the signature.
Example current control string: '[$],[t,a]'. In this case the .collapsig
will have two signatures per document: Column 0 is the normal document
signature and column 1 is a signature derived from the concatenation of
metadata fields t and a, in that order.
21. padre-ct
Purpose: Report on the titles in a PADRE index. Eventually to improve them.
Usage: /data/jenkins/workspace/funnelback-padre-linux-64-15.24.x/install/padre-ct <index_stem>
Warning: Development of this utility is not yet complete.
22. padre-cw
Purpose: To check the correctness of an index, compare two indexes, or display postings for a term within and index.
Usage0: /data/jenkins/workspace/funnelback-padre-linux-64-15.24.x/install/padre-cw -v - print PADRE version
Usage1: /data/jenkins/workspace/funnelback-padre-linux-64-15.24.x/install/padre-cw stem1 stem2 [-io] - Compare two indexes.
-io means ignore diff.s in offsets into .idx
Usage2: /data/jenkins/workspace/funnelback-padre-linux-64-15.24.x/install/padre-cw stem1 -show term - show postings for term.
Also shows term before and afterward. (if applic.)
Usage3: /data/jenkins/workspace/funnelback-padre-linux-64-15.24.x/install/padre-cw stem1 -check [-stemsuff] [-show_all]- Check index files for stem1 (default)
use -stemsuff <suffix> to supply an additional suffix for the .idx and dct files
use -show_all to print every terms summary information.
23. padre-di
Purpose: To display the metadata for documents in an index. (main purpose)
Usage: /data/jenkins/workspace/funnelback-padre-linux-64-15.24.x/install/padre-di <index_stem> [-check]|[-trecids]|[-metao [<docno>]][-meta [<pattern>] ] | [-metad [<pattern>] ]
-check - check whether the document table appears to be internally consistent
-trecid - make a mapping between trec DOCNO stored in title field and URL
-meta [<pattern>] - print title and metadata information for each document
whose "URL" contains pattern (case-insensitive)
If no pattern is given, all docs are shown, in collection order.
-metad - as for -meta but show document numbers.
-metao - as for -meta but show all documents, in collection order starting
from docno (default zero).
-doc_per_meta - prints in JSON the number of documents each metadata class appears in.
default - read in URLs and look them up, using sorted table
24. padre-do
Purpose: Print a permutation of the document numbers in an index corresponding to descending static score.
Usage: /data/jenkins/workspace/funnelback-padre-linux-64-15.24.x/install/padre-do <stem> <docorderfile> [-deb] [-cool_param ...]
Output is a list of docnums in descending order of cool score,
printed to docorderfile. cool_param values are expected to lie in 0 - 1.
Default values are the same as for padre-sw though. Of course,
query-dependent cool values cool0, 7, 12, 15, 16, 17, 18, 19 are ignored
because there is no query.
25. padre-fl
Purpose: Display or operate on the document flags in an index.
Usage1: /data/jenkins/workspace/funnelback-padre-linux-64-15.24.x/install/padre-fl <index_stem> [-clearall|-clearbits|-clearkill|-killall|-show|-sumry|-quicken]
Usage2: /data/jenkins/workspace/funnelback-padre-linux-64-15.24.x/install/padre-fl <index_stem> <file_of_url_patterns> [-exactmatch] -unkill
Usage3: /data/jenkins/workspace/funnelback-padre-linux-64-15.24.x/install/padre-fl <index_stem> <file_of_url_patterns> [-exactmatch] -kill
Usage4: /data/jenkins/workspace/funnelback-padre-linux-64-15.24.x/install/padre-fl <index_stem> <file_of_url_patterns> [-exactmatch] -bits hexbits OR|AND|XOR
Usage5: /data/jenkins/workspace/funnelback-padre-linux-64-15.24.x/install/padre-fl <index_stem> -kill-docnum-list <file_of_docnums>
Usage6: /data/jenkins/workspace/funnelback-padre-linux-64-15.24.x/install/padre-fl -v
Note: Specify '-' as the file of url patterns to supply a single URL to standard input.
26. padre-gs
Purpose: Display or manipulate document gscopes in an index.
Usage0: /data/jenkins/workspace/funnelback-padre-linux-64-15.24.x/install/padre-gs -v|-V|-help # print version info or detailed help
on types of instructions and on program operation.
Usage1: /data/jenkins/workspace/funnelback-padre-linux-64-15.24.x/install/padre-gs index_stem -clear # clear all gscopes
Usage2: /data/jenkins/workspace/funnelback-padre-linux-64-15.24.x/install/padre-gs index_stem -show # show all gscopes
Usage3: /data/jenkins/workspace/funnelback-padre-linux-64-15.24.x/install/padre-gs index_stem file_of_instructions [-separate] [other_gscope]
[-regex|-url|-docnum] [-verbose] [-quiet] [-dont_backup]
Where:
* index_stem may also be the name of a collection
* file_of_instructions may be '-' to accept instructions from stdin
* -separate indicates that gscope changes should be made to a
copy of the .dt file first, and then copied over the original file
when changes are complete. In this mode the number of gscope bits
can NOT be expanded you will be required to ensure enough is available.
* other_gscope specifies a gscope to be set on documents which
end up with no gscopes set.
* By default instruction patterns are expected to be regexes
but this may be made explicit with -regex or altered with -url
or -docnum. Use /data/jenkins/workspace/funnelback-padre-linux-64-15.24.x/install/padre-gs -help to obtain more information about
instruction formats and pattern types.
* gscope names may consist of alphanumeric ascii characters up to a length
of 64 characters.
* -dont_backup prevents backing up of the .dt file
* -quiet don't show the before and after summary of gscopes
27. padre-i4u
Purpose: Display aggregated information about a URL from a PADRE index.
Usage: /data/jenkins/workspace/funnelback-padre-linux-64-15.24.x/install/padre-i4u -v | /data/jenkins/workspace/funnelback-padre-linux-64-15.24.x/install/padre-i4u stem=<stem_or_collname> [fields=<alnum_string>] [debug=<int>] [iters=<int>] [format=json/old] [coll=collection_name] url=<url> ...
Note: The functionality is implemented by a dynamic library which is usually called directly.
coll= option should be set to the name of the collection corresponding to the stem= option
28. padre-iw
FUNNELBACK_PADRE_15.24.0.15-IFUL MDPLFS (Web/Ent) $Revision: 42926 $ [64 bit]
Today is: 20220602 (according to the OS)
Purpose: Index a collection of documents
Usage1: /data/jenkins/workspace/funnelback-padre-linux-64-15.24.x/install/padre-iw -V|-help|-ixform (print version or help info.)
Usage2: /data/jenkins/workspace/funnelback-padre-linux-64-15.24.x/install/padre-iw [-f|-tar|-reo<pf>] <dir>|<file>|<url> <filestem> [<option>|<tfdir> ...]
Usage3: /data/jenkins/workspace/funnelback-padre-linux-64-15.24.x/install/padre-iw -secondary_update <dir>|<file> <filestem>
<pf> is a text file containing a permutation of the document numbers in the original index.
<dir> is a hierarchical directory of optionally gzipped files.
<file> contains a list of names of optionally gzipped files.
-f says that <file> is a single datafile to be indexed. For historic reasons
-tar means the same as -f.
Files to be indexed may be tar or WARC files (optionally gzipped).
Note that individual files in a tarfile are expected to be uncompressed.
text. Gzipped files, unfiltered PDFs etc. are not supported yet.
-reo says that <file> is the stem of a previous index to be
reordered and reindexed. <pf> is a text file containing
a permutation of the document numbers in the original index. Eventually,
it may be possible to compute the permutation internally. For now, it
must be specified via <pf>.
<filestem> prefixes the names of output files.
<tfdir> is the dir. in which tmp files will be writ.
-secondary_update creates a secondary index using the data directory specified, and using the options used in creating the primary index.
Available options:
A. Getting information about PADRE and its operation.
-V - Print PADRE version number and exit.
-ixform - Print index format version created by this indexer exit.
-help - Print this list and exit.
-debug - Generate debugging output.
-show_each_word_indexed - For debugging. Show each word occurrence (with field) as it is indexed.
-show_each_word_to_file - For debugging. Print each word occurrence (with field) to <filestem>.words_in_docs
-hashlog - Create a .hashlog file with incremental hashing stats.
-quiet - Use terse logging.
-ankdebug - Generate debugging output relating to anchortext.
-termdeb<term> - print debugging messages relating to the indexing of <term>.
B. Controlling what is indexed.
-nometa - Don't index any metadata except t, d and k (titles, dates and links).
-nomdsfconcat - Don't concatenate strings in the mdsf file. Record first only. (Others are still indexed.)
-diwimuu - Don't index words in made-up URLs (those constructed from filepath).
-dias - Don't index link anchor source as part of source documents (<a> only).
-ibd - Index all documents even if they appear to be binary.
-ixcom - Index words in HTML and XML comments.
-select<num1>,<num2> - Index every num1th file/bundle starting from
num2th(from zero).
-select-doc-in-bundle=<interval>,<offset> - Index every <interval> document
within a bundlestarting from <offset> (which starts at zero).
Only works with warc store.
-tarpat<regex> - Filenames in a tarfile being indexed must match regex. Default is match-everything.
-csv=<fsep><skipfirst>[<quote>] - Deprecated. Use the CSV to XML filter instead. Files which are
not clearly something else
are assumed to be CSV format.
fsep is ascii field separator, typically comma.
(tab is represented by t.)
skipfirst is either y or n, telling padre whether the first
line in a CSV file should be skipped.
quote is the character used to quote strings in fields
which may contain separators. (You probably have to
escape it on the command line.) If not specified,
no quote character is defined. To include a quote
character within a quoted section, the quote may be doubled.
-csv_fields=<comma_separated_descriptor_list> - Deprecated. Use the CSV to XML filter instead.
This is a list of comma
separated descriptors describing how to index each column of
the csv file.
To index terms in a column as document text use '-'.
To index terms in a column as metadata use the format:
<metadata class name><content type>
To skip terms in a column use 'X'.
For example: 't1,-,X', would set the first column to title,
the second column would be indexed as document content and the third
column would b skipped.
Content-type defined in this argument should be the same as the content type in
the metadata mappings
-check_url_exclusion=<on|off> - URLs matching url_exclusion pattern will not be searchable. (Default on.)
-url_exclusion_pattern=<regex> - exclusion pattern to use if URLs are vetted. (Default 'file://$SEARCH_HOME/')
-filepath_exclusion_pattern=<regex> - exclusion pattern to use if files are to be excluded from indexing
on the basis of the filepath. If applicable, this is more efficient than excluding by URL
because the URL can't be finally determined until the content has been scanned. (Default: not set)
-index_subversion_dirs - Normally the .svn directories created by
the subversion version control system are not indexed. Override this default.
C. Controlling how things are indexed.
-noax - Don't conflate accents.
-unimap=<mapname> - specify a Unicode mapping to be applied when indexing
and when query processing. Supported values:
tosimplified, and totraditional. (Chinese only.)
-deutsch=<i> - How much extra processing is done for umlaut and sz.
0 - none. München is indexed as München and Munchen
1 - München is indexed as München, Muenchen and Munchen (Dflt)
2 - As for 1 but also Muenchen is indexed as München, Muenchen and Munchen
(As a side-effect to allow for compounds, SORT_SIGNIF is increased to 40
-nz=<i> - How much extra processing is done for Māori.
0 - none. Māori and Mäori are indexed as Māori or Mäori resp. and Maori (Dflt)
1 - Māori is indexed as Māori, Maaori and Maori
Mäori is indexed as Mäori, Māori, Maaori and Maori
-no_cjkt_grams - Suppress the indexing of bigrams/unigrams in CJKT text. It is assumed that
said text has been pre-segmented into words, and that normal word-based indexing is needed.
-QL_depth=<i> - Activate quicklinks on default pages of up to depth i. Use internal QL defaults. (Dflt 0 = Off)
-QL_config=<f> - Activate quicklinks. Read quicklinks configuration options from
file f.
-docscan_depth=<i> - When trying to determine doc type and charset
indexer will look up to i char.s into the fdoc. (Dflt 20480)
-forcexml - Use the XML parser on all documents.
-case - Store case information in postings. Currently unsupported. Note that setting
this reduces the approximate max number of unique terms from ~950M to ~240M.
-SORTSIG<num> - How many [UTF-8] characters in a word are significant. Default 20
-dilw - Don't index words or use words in summaries that are longer
than what is set by -SORTSIG.
D. Controlling metadata indexing.
-xml-config=<file> - <file> specifies a file defining XML indexing configurations in json format.
-MM=<file> - <file> specifies a file defining metadata mappings for both HTML and XML documents.
-XMF<file> - (Deprecated) <file> specifies a file defining XML field mappings.
-MMF<file> - (Deprecated) <file> specifies a file defining meta tag mappings.
-ifb - Index a special word '$++' at the start and end of each metadata field (on by default).
-noifb - Do not index a special word '$++' at the start and end of each metadata field.
-facet_item_sepchars=<string> - Which chars are used to separate metadata facet items. [Dflt '|']
-map[<f>] - Map anchor text in source file to metafield f. If <f> is absent,
outgoing anchortext is unfielded content. (dflt <f> absent)
-EM<file> - <file> is a file of external metadata.
-NIM - Ignore explicitly specified internal metadata.
-collfield=<f> - Index the name of a collection as metadata in each doc and assign to field f.
-collection_name= - Set the name of the collection being indexed.
-metadata_topk_capacity=<I> - Sets the maximum number of metadata names or XML paths padre
will keep track of for counting the most frequent metadata or
xpath that could be mapped.
-metadata_topk_k=<I> - Sets the number of the most frequent metadata names or XML paths padre
should report on after indexing.
E. Controlling link and anchortext handling.
-noank_record - Don't extract, record or index anchortext.
- .anchors.gz file not processed. No link counts possible.
-noank_index - Extract and record but don't index anchortext.
- .anchors.gz file can be post-processed by annie-a
-noank - Temporary synonym for -noank_index. Deprecated.
-dpdf - Produce but don't process the anchors distilled file.
-nep_action=<0|1|2> - Action to take for nepotistic links.
0 - treat the same as other links.
1 - ignore links of types greater than nep_limit.
2 - limit the number of repetitions of links of types
greater than nep_limit. (dflt)
-nep_limit=<0|1|2|3> - Ignore nepotistic links of types greater than the limit.
0 - unaffiliated links from outside the target domain.
1 - links from a different host.
2 - links from the same or a closely affiliated host.
3 - dynamically generated links from such a host.
-nep_cachebits=<i> - Don't let the low-value link cache grow above 2^i
-noaltanx - Don't index image alt as anchortext when an image is an anchor.
-nosrcanx - Don't index image src as anchortext when an image is an anchor.
-BL<f> - <f> is a file of source URL patterns from which links should be ignored or treated with suspicion (Blacklist).
-AD<f> - <f> is a file of SECD (single entity controlled domain) affiliations.
e.g. griffith.edu.au --> gu.edu.au
Links to an affiliated SECD are classified as within-domain.
-RP<f> - <f> is a file of CGI parameters which should be removed from source and target URLs.
- padre generates a regular expression from the lines in <f>.
- if <f> is "conf_file" the regex be taken from crawler.remove_parameters;
in the FunnelBack config file.
-A<pat> - <pat> is an acceptable link target pattern.
- URLs not matching pat will not be stored in anchors.gz file.
- if pat is "conf_file" pat will be taken from include_patterns
in FunnelBack config file.
-F<file> - *<file> is an additional anchor text file.
-FN<file> - Like -F but source URLs should need not be looked up.
-RD<dir> - *<dir> is a directory in which to look
- for redirects and duplicates files.
- (produced by FunnelBack etc. & PADRE).
-igmaf - Ignore main anchors file.
-mule<n> - Discard links to URL targets longer than <n> chars. Default is no limit.
-rmat - Record targets of failed anchor lookups via stdout.
-create_phrase_metadata_terms=<b> - Enables the creation of phrase terms like "$++ foo bar $++" in the dictionary
for metadata. These phrase terms can be used to speed up queries like a:"$++ foo bar $++".
Phrases will only be created if indexing of field boundaries is enabled, which it is by default.
Disabling may reduce indexing time and index size.
F. Controlling which index files are generated.
-nomdsf - Suppress generation of the .mdsf file.
-nolex - Suppress generation of the .lex file.
-noqicf - Suppress computation of QIC features and .qicf file.
-nohostf - Suppress computation of host features and .ghosts file.
-cleanup - Remove superfluous files from the index directory after index has completed.
G. Setting size limits.
-GSB<n> - How many gscope bytes to allow for. Default/Min: 8/2.
-big<N> - Multiply word table sizes by 2^N from base of 256K. Default table size is 8M (ie. -big5).
-small - Divide word table sizes by 4 from base of 256K (i.e. use 64K).
-chamb<num> - Set decompression chamber size to <num> MB. Default 32
-RSDTF<num> - Set maximum characters in description & title fields in .results to <num>. Default 256.
-RSTAG<num> - Set number of bytes to reserve for tags in .results to num. Default 0.
-RSTXT<num> - Set maximum characters in summarisable text per doc in .results to <num>. Default 50000.
-W<num> - Index-writing window will be <num> MB (Larger windows mean faster indexing at the expense of using more RAM). Default for a 64bit system is 2800
-MWIPD<num>- Maximum words indexed per document (excluding anchors). By default all words are indexed
-maxdocs<num>- Maximum no. of documents to index. Others are ignored.
-mdsfml<n> - Set the number of bytes used for MetaData Summary Field Maximum Lengths. Fields larger than this number will be truncated. Default is 2048.
-lock_string_mod_mode=[legacy|raw] - Sets how padre should modify the lockstring before it is stored, 'legacy' mode which removes some characters, replaces unquoted commas into new lines and removes consecutive new lines. 'raw' mode stores the lock string as is up to the first null.
-99% - Limit on how full the word hash table can get.
H. Special indexing modes.
-duplicate_urls=flag|ignore (Default is flag.)
- Documents whose URL checksum is identical to that of another document
are normally flagged and suppressed from results.
-urlchecksums=case_sensitive|case_insensitive (Default is case_insensitive).
-paidads - If set, documents known to contain paid ads will be flagged specially (with the DOC_HAS_PAID_ADS flag).
-doc_feature_regex=<Regex> - Documents matching the supplied pattern will be flagged as DOC_MATCHES_REGEX.
The presence or absence of this feature can be used in the ranking function, controlled by cool29 and cool30.
-iolap - Overlap reading of bundles with processing them.
-utf8input - Assume all input files whose charset is not specified are UTF-8 encoded. (Default is WINDOWS-1252.)
-isoinput - Assume all input files whose charset is not specified are ISO_8859-1 encoded.
-force_iso - Forcibly assume all input files are ISO_8859-1 encoded.
-URLP<str> - When storing documents URLs, prepend <str>. (This is only used if the document does not indicate it's own URL with a BASE HREF element, such as in local collections)
-lmd - HTTP LastModified date takes priority over metadata dates.
-lmd_never - Completely ignore HTTP LastModified dates.
-ignore_link_rel_canonical - Ignore canonical URL declarations in HTML link elements.
-future_dates_ok - Option is ignored (future dates are always ok).
-DT<str> - Interpret <str> as start of new doc within bundle. (Not a regular expression).
(note that there is a separate mechanism for XML).
-annie[<exec>] - After normal indexing is complete, attempt to build an annotation index (annie)
and a spelling suggestion file.
Default executables are annie-a and build-spelling-index from whence padre-iw was run.
-speller[<exec>] - Allows the explicit specification of a spelling_index builder to run after annie-a.
-spelleroff - turns of spelling-index building even if annie-a runs.
-spelling_threshold<i> - Annotations with fewer than i occurrences will not be considered
as spelling suggestions. (dflt 1)
-bigweb - Space saving option for bundled large crawl indexes. Roughly equivalent to:
-nomdsf -big8 -MWIPD2000 -W6000 -SORTSIG16 -nep_action=2 -nep_limit=2
-nep_cachebits=20 -chamb64 -RSTXT2000 -mule128 -noaltanx -nosrcanx -nometa -quiet
* A shorter average wordlength is assumed.
* You can add e.g. -Axxx.com to cut anchor processing time.
* (Don't forget to make dupredrex.txt in index directory.)
I. Miscellaneous options.
-O<name> - <name> is the name of this organisation.
-T<path> - Specify a large temporary filespace for use by the indexer.
-redis_host=<str> - Hostname/IP of a Redis server where progress status should be written
-redis_port=<i> - Port of the Redis server. Default is 6379
S. Security options.
-security_level=<i> - Any non-zero value requires every document to have at least one lock. If set to 1 documents without locks will be excluded, if set to greater than 1 indexing will stop.
-security_mindocs=<i> - Must be at least this number of docs with at least one lock.
*** See also url_exclusion options in Section B above.
29. padre-mi
Purpose: To merge a list of PADRE indexes into a single such index.
Usage: /data/jenkins/workspace/funnelback-padre-linux-64-15.24.x/install/padre-mi outstem instem instem ... [-overwrite] [-cleargscopes]
-overwrite overrides protection against destroying existing outstem
-cleargscopes clears all set gscopes from the resulting index
Make a merged index (outstem) from the list of at least two input indexes.
This version assumes that input indexes have exactly the same format,
i.e. that the index format strings are the same and that they have
identical numbers of gscope bits, numerical metadata fields and so on.
Future versions may check this compatibility, but currently exact compa-
tibility is assumed. All manner of pestilence may descend upon you if
you use /data/jenkins/workspace/funnelback-padre-linux-64-15.24.x/install/padre-mi on incompatible indexes. You have been warned :-)
30. padre-qi
Purpose: To setup a query-independent-evidence file for use in query processing.
Usage: /data/jenkins/workspace/funnelback-padre-linux-64-15.24.x/install/padre-qi index_stem file_of_url_patterns dflt_score [profile_name] [-verbose]
- if a profile name is given, qiefile will be stem.qie_profile
Each URL in the index is matched against the patterns, in the
order in which they are listed in the pattern file. Once a match
is found, matching ceases for that URL. This behaviour can be
exploited to apply a general pattern (later in the file) if
no more specific pattern (earlier in the file) matches.
To achieve exact matching use ^ (matches start of URL) and
$(matches end of URL
Lines in the patterns file consist of:
<qie score> <url-pattern>
qie-score - a floating point number (assumed normalised to the range 0-1),
specifying the qie score to be applied.
url-pattern - a perl5 regular expression to be matched against name
strings in the .urls file (usually URLs).
Example:
0.25 ^(https://)?[^/]*nsw.gov.au/
1.0 ^(https://)?[^/]*wa.gov.au/
0.25 ^(https://)?[^/]*sa.gov.au/
0.25 ^(https://)?[^/]*nt.gov.au/
31. padre-qs
Purpose: To generate query suggestions given an index and a partially typed query.
Usage: /data/jenkins/workspace/funnelback-padre-linux-64-15.24.x/install/padre-qs -v | /data/jenkins/workspace/funnelback-padre-linux-64-15.24.x/install/padre-qs stem=<stem>|collection=<collname> partial_query=<partial_query> [alpha=<f>] [show=<d>] [fmt=xml|json|json++] [callback=foo] [sort=0|1|2] [profile=<profile>] [debug=0|1|2|3], e.g.
Note: The functionality is implemented by a dynamic library which is usually called directly.
/data/jenkins/workspace/funnelback-padre-linux-64-15.24.x/install/padre-qs stem=/opt/funnelback/data/abc/live/idx/index partial_query=kevi alpha=0.5 show=10
- sort=0 (by weight), 1 (by length), 2 (in alphabetic order), 3 (by weighted combo of weight and length).
- fmt=json => simple JSON array of suggestion strings;
=json++ => full JSON object with all fields shown.
- callback=foo => In JSON or JSON++ output will wrap
the response with the supplied callback (for JSONP).
- show=<d> => how many suggestions to show.
- alpha=<f> => if sort=3, score = alpha * weight + (1 - alpha) * length_score.
32. padre-query-parser
Usage: /data/jenkins/workspace/funnelback-padre-linux-64-15.24.x/install/padre-query-parser -query=[Query to canon]
Returns to standard out a mostly canonicalised query.
33. padre-rf
Purpose: Generate a relevance-feedback query given a list of relevant documents in a collection. Powers the Explore feature.
Usage: /data/jenkins/workspace/funnelback-padre-linux-64-15.24.x/install/padre-rf -v | /data/jenkins/workspace/funnelback-padre-linux-64-15.24.x/install/padre-rf -idx_stem=<index_stem> [-exp=<7..50>] [-comp=<comp_num> -dox=<docnum_list> | -url=<url>] ...
Details of available options:
R. -collection=<X> - The name of a collection, either meta
or primary.
R. -script=<S> - Name of the CGI script to which
padre-rf.cgi should redirect. (dflt "(null)")
R. -idx_stem=<Y> - The index stem for this collection,
either meta or primary. [Not CGI]
R. -exp=<I> - Maximum complexity of generated query
(no. of words). (range 7 - 50) (dflt 10)
R. -deb_rf=<I> - Activate debugging output. Higher
values give more verbose output. (range 0 - 10) (dflt 0)
R. -comp=<I> - Component number within a meta
collection. (range 0 - unlimited) (dflt 0)
R. -dox=<D> - Comma separated list of document
numbers within current component.
R. -url=<E> - URL of document to be included in
generation of RF query.
34. padre-show
Purpose: Display the contents of a padre index file in readable format.
Usage: /data/jenkins/workspace/funnelback-padre-linux-64-15.24.x/install/padre-show <padre_index_file>
-- if poss. displays contents of index file in text form.
e.g. /data/jenkins/workspace/funnelback-padre-linux-64-15.24.x/install/padre-show index.urls
35. padre-sk
Purpose: Create a skip block index from a regular padre index.
Usage: /data/jenkins/workspace/funnelback-padre-linux-64-15.24.x/install/padre-sk <stem> <skip>
Output will be in <stem>.idx_skip and <stem>.dct_skip
<stem> String: the index stem to use
<skip> Integer: the minimum number of postings between each skip block
36. padre-sr
Purpose: Display all or part of the content of the .results file. (Title, URL, Description metadata, and candidate sentences for summary generation.
Usage: /data/jenkins/workspace/funnelback-padre-linux-64-15.24.x/install/padre-sr stem|results_file [-titleonly] [-unco] [-ifff|-embedded|-text|-html|-textsigs]
[starting_doc|starting_url] [urlpat=<regex>] [num_docs_to_show]
padre-sr sequentially reads the .results file and outputs all or part of the file to stdout in a choice of formats:
. html (default)
. embedded (incomplete html suitable for embedding in another html document)
. text
.textsigs (generate stem.textsigs file suitable for neardup detection.)
If -titleonly is given only the document titles are output. (not applic. to textsigs)
Use -unco to specify that the input doc. is in old uncompressed format.
If a starting document number or URL is given, output commences
only when that point in the file is reached. Output continues
to the end of the file unless num_docs_to_show is given.
If urlpat= is given, only documents whose URL matches the pattern are
considered for display. Case-sensitive unless specified otherwise
in the pattern. Don't include 'http://' in the pattern.
37. padre-sw
Purpose: Process queries using a PADRE index.
Usage: /data/jenkins/workspace/funnelback-padre-linux-64-15.24.x/install/padre-sw <filestem> [option ...]
<filestem> - the common prefix of all the index files, or possibly
the name of a Funnelback collection.
Available options:
A. Getting information about PADRE and its operation.
-V - Print version number and exit.
-ixform - Print index format version expected by this
query processor and exit.
-help - Print this list.
Notation:
---------
<B> - Boolean. Will be interpreted as TRUE unless arg is 'off',
'false' or '0' (case insensitive).
<I> - Integer. eg. 7 or 100000. Whole number in specified range.
<F> - Number. e.g. 1 or 0.537 or 99.5. Some inputs of this type
are constrained to lie within [0.0 - 1.0].
<C> - Character. e.g. a or A or : A single character.
<S> - String. eg. abc or "a b c". Quotes needed around the
key and value if spaces or punctuation included - for
example: "-optionname=a b c".
<K> - Key/value pair. These options take a key and a value,
for example, -optionname.KEY=VALUE
I. Contextual navigation options:
-categorise_clusters=<B> - Whether contextual navigation suggestions are grouped by type.
-cnto=<F> - Set contextual navigation time-out to s seconds (s floating point). processing
may be omitted entirely if elapsed time for a query already exceeds s seconds.
(dflt 1.0). (range 0.000000 - unlimited)
-contextual_navigation=<B> - Whether or not to activate the contextual navigation system.
-contextual_navigation_fields=<S> - String s lists the metadata fields, separated by commas surrounded by square
brackets, to scan for contextual navigation suggestions. (dflt '[c,t]'). Note
that scanning of document text can be suppressed by including a minus, for
example '[-,c,t]'.
-max_phrase_length=<I> - Maximum length (in words) of contextual navigation suggestions. (range 3 - 7)
-max_phrases=<I> - After this number of candidate phrases have been checked, contextual navigation
processing will stop. (range 0 - unlimited)
-max_results_to_examine=<I> - Maximum number of search results to scan for contextual navigation suggestions. (range 0 - 200)
-site_max_clusters=<I> - Maximum number of site clusters to present in contextual navigation. (range 0 - unlimited)
-topic_max_clusters=<I> - Maximum number of topic clusters to present in contextual navigation. (range 0 - unlimited)
-type_max_clusters=<I> - Maximum number of type clusters to present in contextual navigation. (range 0 - unlimited)
J. Geospatial options:
-geospatial_ranges=<B> - Calculate geospatial distance from origin and bounding box ranges when
geospatial data is configured and available.
-maxdist=<F> - Exclude results not within <f> km of origin. (range 0.000000 - unlimited)
-origin=<S> - <lat,long> Set origin to lat, long (floating point degrees).
K. Informational options:
-canq=<B> - Write reordered queries to log. (dflt off)
-countIndexedTerms=<S> - Metadata fields to have their indexed terms counted in the result set (DAAT
only). Unlike rmcf multiple term occurrences in a single document are counted
e.g. if metadata 'author' has 'Bob Ada|Bob|Bob' in two documents the resulting
counts would be 'Ada': 2, 'Bob': 6. As this counts indexed terms long terms may
be truncated depending on the indexer options used. To count fields 'a' and
'c', set this to '[a,c]'. [Not CGI]
-countUniqueByGroup=<S> - Counts the number of unque metadata values grouped by another metadata. Syntax:
-countUniqueByGroup=[classToCount]:[groupBy],[classToCount]:[groupBy]. Example:
-countUniqueByGroup=[author]:[project] would show us the number of authors
contributing to each project. classToCount is a regex and will be expanded to
all matching metadata classes e.g. [autho.*]:[project] might exapand to
-countUniqueByGroup=[author]:[project],[authors]:[project]. [Not CGI]
-count_dates=<S> - Report facet counts for dates such as 'today', 'last week', 'this year'. Note
that date categories may overlap. Only value currently supported is 'd'.
-count_urls=<I> - Display counts of results grouped by the URL path (Up to depth i). If <I> is 0,
then the default value is used. Dflt 5. If <I> is not present count urls is
turned off. [Not CGI]
-docsPerColl=<B> - Show the number documents each collection contributed to the result set.
-rmcf=<S> - Metadata fields to have their words counted in result sets (fields representing
facets). If metadata 'author' has 'Bob Ada|Bob|Bob' in two documents the counts
would be 'Bob Ada': 2 'Bob': 2. To count fields 'a' and 'c', set this to
'[a,c]'.
-rmrf=<S> - Numerical and geospatial fields listed will have their ranges calculated in
result sets. To see the ranges of field 'height' and the bounding box
geospatial field 'X' set this to '[height,X]'.
-showtimes=<B> - Print elapsed times for each stage of query processing.
-sum=<S> - The sum of a numeric metadata in result set. Syntax: -sum=[sumOn],[sumOn].
Example: -sum=[size] would sum all values of numeric metadata 'size' in the
result set. Note somON my be a regex which expands sumOn to all matching
metadata classes e.g. -sum[size.*] might expanded to -sum=[sizeInKb],[sizeLoc]. [Not CGI]
-sumByGroup=<S> - The sum of a numeric metadata by a group. Syntax:
-sumByGroup=[sumOn]:[groupBy],[sumOn]:[groupBy]. Example:
-sumByGroup=[size]:[project] would sum all values of numeric metadata 'size'
grouped by 'project' giving output project 'Foo' has size '128', project 'Bar'
has size '12'. Note sumOn my be a regex which expands sumOn to all matching
metadata classes e.g. -sumByGroup[size.*]:[project] might expanded to
-sumByGroup=[sizeInKb]:[project],[sizeLoc]:[project]. [Not CGI]
L. Logging options:
-ip_to_log=<S> - What form of ip to include in log files: (nothing|ip|ip_hash|remote_user).
-log=<B> - Write query log entries (dflt on). [Not CGI]
-qlog_file=<S> - If writing query log entries, write them to <FILE>. [Not CGI]
-username=<S> - A string identifying the current user to be used in padre's query log.
M. Miscellaneous options:
-countgbits=<S> - s is either "all" or a comma-separated list of gscope bitnumbers for which
counts are needed. (Bits numbered from zero.)
-exit_on_bad_component=<B> - Fail when a component has an incompatible index relative to the first (rather
than skip).
-flock=<B> - Use flock when locking the query logfile. If set to no, lockf is used instead.
Default on Solaris is 'no', all other systems 'yes'.
-mat=<I> - Set matchset size to n million (dflt 24). Only need to increase on very large
collections. (range 0 - 2147) [Not CGI]
-ndt=<B> - Don't do tests on docs, e.g. phantom, zombie, *scope, binary, expired. [Not CGI]
-unbuf=<B> - Don't buffer the standard output stream. In some specific cases, setting this
to 'no' can improve performance.
-view=<S> - The collection view the perform the query against when in CGI mode. Normally
'live' (default), 'offline' or 'snapshot###'.
N. Presentation options:
-EORDER=<I> - Specify presentation order of query biased summary excerpts. 0: natural order
in doc. 1: sorted by score. (dflt 0) (range 0 - 1)
-MBL=<I> - Set buffer length per displayed metadata field to n bytes (dflt 250 bytes).
Warning: setting very large values will increase query processor memory demands
and may cause problems. (range 1 - unlimited)
-SBL=<I> - Set summary buffer length to n bytes. (dflt 250 bytes) (range 1 - unlimited)
-SF=<S> - Metadata fields to include in summaries. (if applicable). To include fields 'a'
and 'd' set this to '[a,d]'. This option also supports regex to include all
metadata classes set this to '[.*]' to include fields prefixed with 'Fun' and
metadata class 'a' set '[Fun.*,a]'.
-SHLM=<I> - Select highlighting method within snippets in XML. 0 - No highlighting ; 1 -
HTML strong tags ; 2 - Show highlighting regexp. and unhighlighted summary
[dflt]; 5 - Use HTML strong tags but remove accents from summary before
highlighting, provided query was not accented. (range 0 - 7)
-SM=<S> - Summary mode (off;snip;debug;meta;qb;def;auto;both) - both means qb and meta.
-SQE=<I> - Set max no. of query biased summary excerpts to n (dflt 3). (range 1 - 10000)
-all_summary_text=<B> - Is text used for generating summaries required in the result
-bb=<B> - If set, the query processor will may insert "best bets" (formerly known as
"featured pages") suggestions from best_bets.cfg.
-countUniqueByGroupSensitive=<B> - Treat group names and metadata items case sensitively (default no). [Not CGI]
-ctest_mode=<I> - Controls behaviour of padre-sw when -ctest is used. 0: no internal evaluation;
1 - internal evaluation only. Output is brief plain text report of measures;
2 - internal evaluation only. Output in plain text with QBQ output followed by
measures; 3 - internal evaluation plus normal CTOUT output in XML (with
measures presented as comments) (range 0 - 3)
-explain=<B> - Explain rankings by showing score components. (Note that -explain=on turns off
result set diversification).
-explore=<I> - Show 'explore' links against results. The value specifies how many terms to
include in the expanded query. (range 7 - 50)
-gscoperesult=<S> - Specifies the bit number that results will be set to in -res gscope or -res
docnums modes (dflt 1).
-mdsfhl=<B> - Are query terms only highlighted in MDSF metadata summaries
-num_ranks=<I> - Limit number of results to n (min = 0, dflt = 10). (range 0 - unlimited)
-num_tiers=<I> - Limit number of result list tiers to n (min = 0 (no ,limit), max = 50, dflt no
limit) (range 0 - 50)
-qieval=<F> - Set the value presented for query independent evidence when using the qiecfg
result format. (dflt 0.5). (range 0.000000 - 1.000000)
-qwhl=<S> - Determines which parts of a search result are highlighted. S - snippet, M -
metadata, U - URL, T - title. E.g. -qwhl=MUT
-res=<S> - Set result format. Possible values are: trec, web, xml, urls, qiez, qieo,
gscope, docnums, ctest, qiecfg or flcfg.
-results_in_facet_categories=<I> - Include the specified number of pre-computed search results under the rmc count
element for metadata facet categories. (range 0 - 100)
-rmc_maxperfield=<I> - Set maximum number of RMC items to display per field at n (dflt 100). (range 0 - unlimited)
-rmc_sensitive=<B> - Treat facet categories (RMC items) case sensitively (default no). [Not CGI]
-show_qsyntax_tree=<B> - Include an SVG representation of the query-as-processed in output.
-start_rank=<I> - Present results starting from n (dflt 1). (range 1 - unlimited)
-sumByGroupSensitive=<B> - Treat group names case sensitively (default no). [Not CGI]
-tierbars=<B> - Display tierbars in result list output (XML and HTML). When turned on (for all
-res modes) and -sort is used, results will be first sorted by tier then by the
sorting mode, otherwise if -sortall is used then all results will be sorted
regardless of tier.
-translucent_DLS_fields=<S> - Metadata fields which are translucent. Translucent fields are visible on
documents which the user can not see. To include fields 'a' and 'd' set this to
'[a,d]'. If collapsing is enabled and the collapsing signature contains only
fields defined here than collapsing will be permitted on documents the user can
not see. [Not CGI]
O. Query interpretation options:
-STOP=<S> - Use the stoplist specified in <file> (one word per line) [Not CGI]
-binary=<I> - Determines whether or not binary documents are returned in the results. 0 -
show all documents; 1 - show only binary documents; 2 - show only non-binary
documents. (range 0 - 3)
-clive=<S> - Dynamic metacollections. Specifies a component name within a .sdinfo file(s)
to make active. Can be set multiple times to enable multiple collections.
-daat_termination_type=<I> - Selects how DAAT early exit is determined. 0 - try for d results with every
metafield and every component; 1 - try for d results over every component but
not necessarily every metafield; 2 - stop a soon as d results are obtained.
(d is the parameter to -daat.) (range 0 - 2)
-daat_timeout=<F> - Impose a soft timeout (in seconds) on the time taken by the DAAT machinery for
one query. (range 0.000000 - 3600.000000) [Not CGI]
-dont_estimate_full_matches=<B> - In DAAT mode don't guess the number of full matches when the DAAT depth did not
let us processes an entire postings list.
-events=<B> - Must be set if event search is to be used
-fmo=<B> - Present full matches only.
-lang=<S> - If a 2-character language code is specified by this means, then stemmers etc
specific to that language will be used, IF AVAILABLE. It is also permissible
to use a 5-character code like en_GB, but padre behaviour will be the same as
for en. Specifying lang also makes title and metadata sorting of results
locale-specific, however support for this on Windows platforms is limited and
problematic.
-loose=<I> - Phrase looseness in words (min = 0, dflt = 0). (range 0 - unlimited)
-max_qbatch=<I> - Terminate batch query processing after the specified number of queries have
been processed. (range 1 - unlimited)
-max_terms=<I> - Truncate queries after the specified number of terms. If the query is
reordered, truncation will occur after reordering. (range 1 - unlimited)
-min_truncated_len=<I> - The text part of a query term with a right truncation operator must have at
least this length. E.g. if min_truncated_len were 4 funnel* would be accepted
but fun* would be processed as fun. (range 0 - 20) [Not CGI]
-noexpired=<B> - Exclude expired docs from results. (Nullified by -zom) [Not CGI]
-nulqok=<B> - An empty query submitted via CGI will be processed as a null query. The system
query must be empty as well. (dflt is to ignore the request). [Not CGI]
-phrase_prox_word_limit=<I> - Phrase or proximity terms with more than this number of words will be shortened
by deleting words from the right. E.g. If this limit were 4 then `to be or not
to be` would be processed as `to be or not` (range 1 - unlimited) [Not CGI]
-prox=<I> - Proximity limit in words (min = 0, dflt = 15). (range 0 - unlimited)
-qsup=<S> - When blending queries, determines sources of supplementary queries to be tried,
with corresponding weights assigned to each source (ranging from 0 to 1). No
spaces. 'off' may be specified to disable supplementary queries. E.g.
-qsup=SPEL/0.9+USUK/0.4+SYNS/0.1+LANG/0.1. Available sources are: SPEL
(spelling suggestions); USUK (table of spelling differences between US and UK
English); SYNS (synonyms as defined by the blending.cfg file); LANG
(experimental German decompunding)
-query_reorder=<B> - Reorder terms in query so that the most discriminating (least common) appear
first. Often coupled with -max_terms=
-ras=<I> - Remove any stopwords from the query. Possible values: 0 - remove none; 1 -
remove dynamically depending on the query; 2 - remove all stopwords (dflt 1). (range 0 - 2)
-service_volume=<S> - Either 'high' or 'low'. A convenience setting to increase or reduce allowable
query complexity and timeouts according to service volumes -- large or small
indexes, high or low query volumes. [Not CGI]
-stem=<I> - Controls stemming of queries. 0 - do not stem (dflt); 1 - do not stem (replaces
obsolete option); 2 - Stem all query words (light - English/French
plural/singular only); 3 - Stem all query words(heavier). (range 0 - 3)
-stem_lconly=<B> - When stemming, stem only lowercase query words (to avoid stemming proper names
and acronyms).
-strip_invalid_utf8=<B> - Normally, invalid UTF-8 characters are removed during indexing. If this hasn't
happened. This option allows them to be removed from result packets.
-synonyms=<B> - If set, the query processor will expand queries using thesaurus in synonyms.cfg.
-truncation_allowed=<I> - Enables the use of the * operator, binary valued, it is only valid in use with
an option that disables DAAT mode such as, -service_volume='lo' or -daat=0.
When applied all contexts are available such as: *:funnelback, funnel*, *back,
and *:*elba*. (range 0 - 3) [Not CGI]
-wildcard_thresh=<I> - If the postings list for a term is longer than the specified value (in MB) it
will be treated as a wildcard. (range 0 - unlimited)
-zom=<B> - Include docs in results even if noindex or killed.
P. Query source options:
-ctest=<S> - Read a batch of queries from testfile (in C_TEST format). Sets output format to
RM_CTEST, but that may be overridden. (See es.csiro.au/C-TEST/ for information
about C-TEST.) [Not CGI]
-s=<S> - System-generated query inserted behind the scenes by a form or front-end.
Q. Quicklinks options:
-QL=<I> - Activate QuickLinks facility for default pages down to the specified level. 0 -
off; 1 - server root pages; 2 - next level down. (range 0 - 5)
-QL_rank=<I> - If QuickLinks capability is active, show quick links for search results down to
the specified rank. (range 1 - unlimited)
-QL_rank_is_relative=<B> - If true, the value of QL_rank will be interpreted relative to the start_rank.
E.g. if QL_rank=2, the first two results on each page may show QuickLinks.
R. Ranking options:
-SameSiteSuppressionExponent=<F> - Same site suppression penalty exponent (dflt 0.5, recommended range 0.2 - 0.7). (range 0.000000 - unlimited)
-SameSiteSuppressionOffset=<I> - Number of additional documents from a site beyond the first that are allowed
their full score before applying a same site suppression penalty (dflt 0) (range 0 - 1000)
-absscores=<B> - Report content scores as % of max possible Okapi score (Intended for use with
-vsimple=on).
-anniemode=<I> - Control the use of annotation indexes. 0 - do not use annotation indexes ; 1 -
Process queries using annotation indexes only; 2 - Process queries using
annotation indexes, falling back to normal indexes if insufficient results.
(Most query op.s stripped.) 3 - Process queries using both annotation and
normal indexes (Most operators stripped from queries.). Default 0. (range 0 - 3)
-b=<F> - Set Okapi b to f (dflt 0.75) (range 0.000000 - unlimited)
-cgscope1=<S> - Documents matching this gscope expression (reverse Polish) can be upweighted
with -cool68. Those not matching can be upweighted with -cool.70.
-cgscope2=<S> - Documents matching this gscope expression (reverse Polish) can be upweighted
with -cool69. Those not matching can be upweighted with -cool.71.
-cool=<B> - Whether to use topic distillation scoring (cool and cooler). Dflt on.
-cool.<K=V> - cool.N=V Set a value for the Nth tune parameter.
Possible values for N are:
0 | content: content weight
1 | onlink: onsite link weight
2 | offlink: offsite link weight
3 | urllen: URL length weight
4 | qie: external evidence (qie) weight
5 | date_proximity: proximity to current date weight
6 | urltype: URL attractiveness (Homepages favoured. Copyright pages and URLS with lots of punctuation deprecated.)
7 | annie: annotation weight (annie)
8 | domain_weight: weight associated with this domain
9 | geoprox: geographical proximity to origin
10 | nonbin: non-binariness (1 for html, xml, txt, 0 otherwise)
11 | no_ads: freedom from ads
12 | imp_phrase: implicit phrase match score
13 | consistency: consistency of evidence. (Extra reward for docs with non-zero scores on both content and annie.)
14 | log_annie: logarithm of annotation weight (log(annie))
15 | anlog_annie: absolute-normalised logarithm of annotation weight.
16 | annie_rank: annotation rank = (k - rank)/ k. where k = 2 x highest rank requested - if rank > k, rank = k
17 | BM25F: field-weighted Okapi score
18 | an_okapi: absolute-normalised Okapi score.
19 | BM25F_rank: field-weighted Okapi rank.
20 | mainhosts: bias in favour of principal servers (web search only).
21 | comp_wt: component collection weighting. (meta collections only).
22 | document_number: document number in the crawl. An early position in the crawl may correlate with importance
23 | host_incoming_link_score
24 | host_click_score
25 | host_linking_hosts_score
26 | host_linked_hosts_score
27 | host_rank_in_crawl_order_score
28 | host_domain_shallowness_score
29 | doc_matches_regex: document matches administrator supplied regex
30 | doc_does_not_match_regex: document does not match administrator supplied regex
31 | titleWords: number of words in title
32 | contentWords: number of indexed words in document
33 | compressionFactor: compressibility of document text
34 | entropy: entropy of document
35 | stopwordFraction: fraction of stopwords in the document
36 | stopwordCover: fraction of stopword list present in the document
37 | averageTermLen: average term length
38 | distinctWords: number of distinct words in the document
39 | maxFreq: frequency of most frequently occurring term
40 | titleWords_neg: Neg number of words in title
41 | contentWords_neg: Neg number of indexed words in document
42 | compressionFactor_neg: Neg compressibility of document text
43 | entropy_neg: Neg entropy of document
44 | stopwordFraction_neg: Neg fraction of stopwords in the document
45 | stopwordCover_neg: Neg fraction of stopword list present in the document
46 | averageTermLen_neg: Neg average term length
47 | distinctWords_neg: Neg number of distinct words in the document
48 | maxFreq_neg: Neg frequency of most frequently occurring term
49 | titleWords_abs: Abs number of words in title
50 | contentWords_abs: Abs number of indexed words in document
51 | compressionFactor_abs: Abs compressibility of document text
52 | entropy_abs: Abs entropy of document
53 | stopwordFraction_abs: Abs fraction of stopwords in the document
54 | stopwordCover_abs: Abs fraction of stopword list present in the document
55 | averageTermLen_abs: Abs average term length
56 | distinctWords_abs: Abs number of distinct words in the document
57 | maxFreq_abs: Abs frequency of most frequently occurring term
58 | titleWords_abs_neg: Abs number of words in title
59 | contentWords_abs_neg: Neg abs number of indexed words in document
60 | compressionFactor_abs_neg: Neg abs compressibility of document text
61 | entropy_abs_neg: Neg abs entropy of document
62 | stopwordFraction_abs_neg: Neg abs fraction of stopwords in the document
63 | stopwordCover_abs_neg: Neg abs fraction of stopword list present in the document
64 | averageTermLen_abs_neg: Neg abs average term length
65 | distinctWords_abs_neg: Neg abs number of distinct words in the document
66 | maxFreq_abs_neg: Neg abs frequency of most frequently occurring term
67 | lexical_span_score
68 | doc_matches_cgscope1: Documents which match gscope defined by -cgscope1 (if defined)
69 | doc_matches_cgscope2: Documents which match gscope defined by -cgscope2 (if defined)
70 | doc_does_not_match_cgscope1: Documents which do not match gscope defined by -cgscope1 (if defined)
71 | doc_does_not_match_cgscope2: Documents which do not match gscope defined by -cgscope2 (if defined)
72 | raw_annie: Untransformed annie score linealry scaled to 0..1
-daat=<I> - Specifies the maximum number of full matches for Document-At-A-Time processing.
If set to 0, Term-At-A-Time is used instead (dflt 5000). (range 0 - 10000000)
-diversity_rank_limit=<I> - Diversification won't alter ranking beyond rank n (default 200, min 10). (range 10 - unlimited)
-facet_url_prefix=<S> - Present only results whose URL is prefixed by the given URL. Note that the
scheme and hostname part are case insensitive, for URI with scheme smb:// the
entire prefix is case insensitive. The behaviour of this option may change in
the future to suit facets, this should not be used outside of faceted
navigation. [Not CGI]
-gscope1=<S> - Present only results whose gscope bits match reverse Polish expression e (Bits
numbered from zero). If set to 'off', disable any previous expression.
-k1=<F> - Set Okapi K1 to <f>. (dflt 2.0) (range 0.000000 - unlimited)
-kmod=<I> - Select special scoring function i for special fields. 0 = normal, 1 = AF1
(dflt 1). (range 0 - 1)
-lscope=<S> - Present only results whose URL matches a sort-of left-anchored pattern.
-lscorrect=<B> - Whether to correct link scores across meta collection components (default yes).
-main_homepage_factor=<F> - Penalise score of the homepage of a single-entity-controlled domain to prevent
over representation in results sets. E.g. www.anu.edu.au/ in an index of ANU. (range 0.000000 - 1.000000)
-meta_suppression_field=<S> - If same_meta_suppression is activated, the specified metadata field will be the
field to which it applies. Only one metadata field can be treated in this way.
-near_dup_factor=<F> - The query processor will penalise a result which is a near-duplicate of a
previous result by multiplying by the factor specified. The penalty stiffens
with more repetition. (range 0.000000 - 1.000000)
-promote_urls=<S> - Insert the specified URLs at or near the top of the results list for a query.
Value is a space separated list of URLs. URLs must correspond to those
recorded by padre-iw. (dflt Inactive)
-quanta=<I> - Set the number of possible score quantisation levels for each cool variable.
In general, a high number should give more accurate ranking but may slow query
processing. (range 10 - 100000)
-rank_limit=<I> - Limit highest rank requestable to n (dflt 1,000,000,000). (range 10 - unlimited)
-ranking_profile=<I> - Choose a profile of settings for the ranking function. 0 - current default; 1
- Standard BM25; 2 - Traditional (pre-12.0) Funnelback. Setting a profile does
not override explicit settings. (range 0 - 100) [Not CGI]
-recency_decay_vals=<S> - <z,w,m,y,d,c,m> - Define how recency scores decay with time. z w, m, y, d, c, m
are floats in the range 0 - 1, which specify the recency score assigned to
documents, 0 days, 1 wk, 1 mth, 1 yr, 1 dec, 1 cen, 1 mill. old. (dflt
1.0,0.75,0.5,0.25,0.025,0.0025) Recency scores between key values linearly
interpolated. Past the millennium, recency scores are 1/daysold.
-reference_date=<S> - If specified, recency is based on this date rather than that of most recent
doc. Format is <yyyymmdd>, or 'today'.
-remove_urls=<S> - Prevent the specified URLs from appearing in the results for a query. Value is
a space separated list of URLs. URLs must correspond to those recorded by
padre-iw. (dflt Inactive)
-sco=<S> - <n>[<classes>] Set doc scoring mode to n, using the classes specified. Most
common values: 0 - score using doc text only ; 1 - no scoring. Produce an
unordered set of results ; 2 - score using anchortext and URLs as well,
upweight titles (or whatever fields are configured with -specf). For example to
automaticall look in fields 'u' and 'v' for the query terms set -sco=2[u,v]
-scope=<S> - Present only results whose URL satisfies the include/exclude scopes included in
list (comma separated). e.g. -scope=anu.edu.au,-anu.edu.au/archives
-sort=<S> - Sort top results by <string>. Possible values: 'date', 'adate' (ascending
date), 'title', 'dtitle' (descending title), 'size' (file size), 'dsize'
(descending filesize), 'url', 'durl' (descending url), 'coll' (collection name,
then score), 'dcoll' (descending collection name, then score), 'meta<f>' (by
metadata field f, then score),'dmeta<f>' (descending metadata field d, then
score), 'shuffle' (random to avoid bias), 'collapse_count' (to order by the
number of collapsed documents, with the largest collapsed set first),
'acollapse_count' (with the largest collapsed set last), 'prox' (for geo
search: Sort top results by proximity to origin), 'dprox' (for geo search:
Sort top results by descending proximity to origin). 'score_ignoring_tiers'
(descending score, ignoring any tiers. Only useful with sortall.) (dflt is
case-insensitive for title and meta). '-sort=' turns off sorting.
-sort_sensitive=<B> - Use case-sensitive sorting when sorting results by title or metadata strings.
-sortall=<B> - Include partial matches in the resorting performed by -sort.
-specf=<S> - Fields listed in string s, as a list of comma separated fields surrounded by
square brackets, will be scored specially and added to query when using the
-sco=2 mode (dflt '[k,K]').
-sss_defeat_pattern=<S> - URLs matching the specified pattern (currently a simple string match) will not
be subject to samesite suppression.
-static_cool_exponent=<F> - Control the extent to which static scores are attenuated with length of query.
0 => no attenuation; 1 => max attenuation. Attenuation by len ** -f. (range 0.000000 - 1.000000)
-unknown_daysold=<I> - A doc with unknown date is assumed to be d days old (for recency calcs) (dflt
366). (range 0 - unlimited)
-use_Paik=<B> - Use the tf.idf scheme proposed by Jiaul Paik at SIGIR 2013 rather than the more
conventional BM25 variant.
-use_secds=<B> - When working with domain-importance features in ranking, use SECDs if value is
on, and raw domain names otherwise.
-vsimple=<S> - Very simple ranking. If set to 'on', equivalent to -sco=0 -cool=off -SSS=0
-kmod=0.
-weight_only_fields=<S> - Documents will not be retrieved in DAAT mode if they only match unfielded query
terms in one or more of the implicit fields listed here. For example,
specifying '[K,k]' will stop the query 'Monica Lewinski' matching a document
solely because of click data or referring anchortext.
-wmeta.<K=V> - wmeta.C=F Set upweighting factors for metadata class scoring. C - metadata
class; F - weight to set. (dflt 0.5 for 'k' and 'K', 1 for everything else).
-xscope=<S> - Present only results whose URL exactly matches the provided URL (after
canonicalisation).
S. Ranking - Result diversification options:
-SSS=<I> - Same site suppression depth: 0 - no suppression (dflt for non-web
collections.); 2 - hosts and their top level dir's (dflt for web and meta
collections; 10 - special meaning for big Web applications. (range 0 - 10)
-neardup=<F> - Near dupulicates in ranking are multiplied by f. Setting f to 1 turns off
near-dup detection. (range 0.000000 - 1.000000)
-repetitiousness_factor=<F> - Penalise a repetitious result by multiplying by the factor specified.
(Repetitiousness may involve same-site, same component or repeated metadata.)
The penalty stiffens with more repetition. Setting to 1 turns this off. (range 0.000000 - 1.000000)
-same_collection_suppression=<F> - While searching a meta-collection, penalise the second result from the same
primary collection as a previous result by multiplying by the factor specified.
The penalty stiffens with more repetition. Setting to 0 turns this off. (range 0.000000 - 1.000000)
-same_meta_suppression=<F> - Penalise the second result with the same value for a specified metafield as a
previous result by multiplying by the factor specified. The penalty stiffens
with more repetition. Setting to 0 turns this off (range 0.000000 - 1.000000)
-title_dup_factor=<F> - The query processor will penalise a result which has exactly the same title as
a previous result by multiplying by the factor specified. The penalty stiffens
with more repetition. Setting to 1 turns this off. (range 0.000000 - 1.000000)
T. Result collapsing options:
-collapsing=<B> - Activate collapsing. Collapsing will be based on document content ('$') unless
a collapsing_sig value is specified. Note that use of this option will disable
result set diversification.
-collapsing_SF=<S> - Metadata fields to include in display for collapsed documents (assuming
collapsing_num_ranks is non-zero). (dflt no fields). To view metadata fields
'id' and 'a' set this to '[id,a]'.
-collapsing_label=<S> - Label to indicate why items have been collapsed. (dflt "which are very
similar")
-collapsing_num_ranks=<I> - Specify how many collapsed results are to be shown under the uncollapsed ones.
(dflt 0) (range 0 - 1000)
-collapsing_scoped=<B> - Scope to only documents which have been collapsed on. Default is off.
-collapsing_sig=<S> - The collapsing_control segment to use when collapsing. E.g. "[a,p]", collapse
on author+publisher. The value must correspond to one segment of the
indexing.collapse_fields string. (A segment is a comma separated list of fields
surrounded by square brackets) (dflt '[$]' (Collapsing on document content.))
U. Security options:
-dls_internal_test=<I> - This allows testing of the padre side of the custom document level security
mechanism. There is no call out to an external function. The value is
interpreted as a combination of bits: 1 bit - dls_internal_test is active/not
active; 2 bit - selects whether MINRESULTS mode is used or not. During internal
testing, every odd numbered document in the original ranking is arbitrarily
treated as inaccessible. (range 0 - unlimited)
-ipreject=<S> - <queryLimit>,<windowSeconds>,<upperQueryLimit> - Use an ip rejector to limit
requests from a single machine. Allow <queryLimit> queries per
<windowsSeconds>, don't record more than <upperQueryLimit> queries. [Not CGI]
-ldLibraryPath=<S> - Full path to security plugin library [Not CGI]
-locking_model=<S> - Name of locking model, either "trim" or "sharepoint". [Not CGI]
-no_security=<B> - Disable DLS, available as a command line option. [Not CGI]
-secPlugin=<S> - Name of security plugin library [Not CGI]
-secPluginScript=<S> - Name of security plugin script [Not CGI]
-translucent_DLS=<B> - Enables translucent DLS DAAT only. [Not CGI]
-userkeys=<S> - Conduct this search with security keys specified by s. The format is
'<collectionName>;key<delim>' where delim is either ',' or new line, spaces are
removed for example 'c1;k1
c2;k1,c2;k2' [Not CGI]
V. Spelling options:
-spelling=<B> - Activate spelling suggestion mechanism.
-spelling_alpha=<F> - Set the weighting between 'closeness to the query' and support in the
collection for a candidate suggestion. Big alpha, high weight on closeness to
the query. (range 0.000000 - 1.000000)
-spelling_blend_thresh=<F> - Confidence threshold for automatically blending results for a query suggestion
with those from the user's original query. (range 0.000000 - 1.000000)
-spelling_difflen_thresh=<I> - Don't make suggestions more than i characters longer or shorter than query. (range 0 - 1000)
-spelling_dym_thresh=<F> - Confidence threshold for making a 'Did you mean' suggestion. (range 0.000000 - 1.000000)
-spelling_edist_constant=<F> - Don't make suggestions whose edit distance from the query exceeds f +
query_length * spelling_edist_proportion (range 0.000000 - 1000.000000)
-spelling_edist_proportion=<F> - Don't make suggestions whose edit distance from the query exceeds
spelling_edist_constant + query_length * f (0<=f<=1) (range 0.000000 - 1.000000)
-spelling_fullmatch_trigger_const=<F>- Don't look for suggestions if there are at least f * log10(num docs) full
matches. (range 0.000000 - unlimited)
-spelling_fullmatch_trigger_const=<F>- Don't look for suggestions if there are at least f * log10(num docs) full
matches. (range 0.000000 - inf)
-spelling_include_context=<B> - Include the non-corrected part of the query in the suggestion link.
-spelling_min_querylen=<I> - Suggestions not made for queries shorter than this. (range 1 - 1000)
-spelling_wt_thresh=<F> - Don't make suggestions whose weight is less than this. Weight is complex to
explain, sorry. (range 0.000000 - 100.000000)
W. TREC specific options:
-trec_runid=<S> - For TREC participation: Each result in TREC format will include this runid.
-trec_topic=<I> - For TREC participation: The first query in a batch will get this topic number.
Each new query will increase the number by one. (range 0 - unlimited)
-trecids=<B> - For TREC participation: Each result in TREC format will use the TREC docno
rather than a URL
38. padre-topk
Missing required argument -input=<input file>
List the top-k most frequent items
Usage: /data/jenkins/workspace/funnelback-padre-linux-64-15.24.x/install/padre-topk -input=<input file> [-capacity=<integer>] [-k=<integer>]
Where:
* -input: is the file of items where each item is delimited by new line
* -capacity: this is the limit as to the number of items that will be held in memory at once.
* -k: this is the number of items that will shown at the end.
Efficiently (compared to some other algorithims) estimates the count of the most frequent top-k items
e.g. for a,b,c,a,b,a the top-2 would be a with a count of 3 followed by b with a count of 2
39. pan-look
Purpose: Efficient location of all lines in a sorted text file which match a prefix.
Usage: pan-look <prefix> <file name>
40. phrasefinder
Purpose: Extract frequently occurring word tuples ('phrases') from a collection.
Usage: /data/jenkins/workspace/funnelback-padre-linux-64-15.24.x/install/phrasefinder <stem> [-unco] [-hash_limit=<i>] [-num_to_show=<i>] [-max_tuple=<i>] [-debug]
/data/jenkins/workspace/funnelback-padre-linux-64-15.24.x/install/phrasefinder reads the .results file of <stem> and locates candidate word tuples
('phrases') up to a configurable maximum length. Phrases are sequences
of up to max_tuple words which are unbroken by anything other than space.
Candidates are stored in a hashtable and counted. A limit of
hash_limit candidates is stored. Once this is reached, the program
exits. (Useful for testing or for limiting execution time and virtual
memory requirements.) When processing finishes, the top num_to_show
candidates are sorted in descending order of frequency and output in
<stem>.phrases, in the same format as the .lex file, with word breaks
represented by hyphens.
41. run-with-flock
Purpose: Runs a command with a exclusive file lock held.
Usage: /data/jenkins/workspace/funnelback-padre-linux-64-15.24.x/install/run-with-flock file_lock_path lock_acquired_path cmd arg0 arg1 ... argn
This will open (and create) 'file_lock_path' and then will take a
exclusive file lock on the path after that it will create the file:
'lock_acquired_path', if all of that succeeds then the given command will be
executed while the lock is held.
42. show_annotations_for_doc
Purpose: Given an annotation index, summarise the annotations applied to a given URL.
Usage: /data/jenkins/workspace/funnelback-padre-linux-64-15.24.x/install/show_annotations_for_doc <index_stem>|<collection> <URL>|<URL64>|<DOCNO> [-csv|-html|-xml]
- Show the annotations applying to the specified URL and their weights.
- Don't forget to quote or escape shell metacharacters in a URL!
- Default output format is XML.
43. url_tagger
Purpose: Apply the tags in a tag mapping file to a PADRE index.
Usage1: /data/jenkins/workspace/funnelback-padre-linux-64-15.24.x/install/url_tagger stem (-clear|<url-tags-file>)
Usage 2: /data/jenkins/workspace/funnelback-padre-linux-64-15.24.x/install/url_tagger -v
/data/jenkins/workspace/funnelback-padre-linux-64-15.24.x/install/url_tagger -clear clears all tags from all documents in the index
/data/jenkins/workspace/funnelback-padre-linux-64-15.24.x/install/url_tagger <url-tags-file> takes an url-tags file and applies it to
the relevant entries in <stem>.results.
/data/jenkins/workspace/funnelback-padre-linux-64-15.24.x/install/url_tagger -v shows version information.
Lines in the url-tags file are in the form: <url> <comma-separated-taglist>
It is assumed that <url> contains no space and tags contain no commas.