crawler.inline_filtering_enabled

Option to control whether text extraction from binary files is done "inline" during a web crawl.

Key: crawler.inline_filtering_enabled
Type: Boolean
Can be set in: collection.cfg

Description

This parameter controls whether content is filtered.

Filtering is done inline during the gathering phase. An example of a standard filtering operation would be the extraction of text from binary document formats (e.g. PDF files, MS Office formats etc.).

If enabled then the extractor will use the Tika filtering program as its default for filtering Office and PDF files.

Default Value

crawler.inline_filtering_enabled=true

Enables inline filtering resulting in filtering running when the content is gathered.

Examples

Turn off inline filtering:

crawler.inline_filtering_enabled=false

crawler.inline_filtering_enabled

Description

Default Value

Examples

See Also

Contents