crawler.non_html

Which non-html file formats to crawl (e.g. pdf, doc, xls etc.).

Key: crawler.non_html
Type: List<String>
Can be set in: collection.cfg

Description

This option is a comma-separated list of file extensions to download. The file types are for non-html files i.e. binary file types like .pdf, .doc etc. These files will not be parsed i.e. the crawler will not attempt to extract hyperlinks from them.

If crawler.inline_filtering_enabled is set to "true" then these files will be filtered. If you don't want this to happen for a specific type of file you can add its MIME type to the filter.ignore.mimeTypes setting.

Default Value

crawler.non_html=doc,docx,pdf,ppt,pptx,rtf,xls,xlsx,xlsm

Examples

Only download PDF files.

crawler.non_html=pdf

crawler.non_html

Description

Default Value

Examples

See Also

Contents