Skip to content

Built-in filters: TikaFilterProvider

Introduction

This filter converts specific binary file formats to text using Apache Tika.

Supported formats

Funnelback includes default support for the following binary file formats:

  • Portable Document Format (.pdf)
  • Microsoft Word (.doc, .docx)
  • Microsoft Excel (.xls, .xlsx)
  • Microsoft Powerpoint (.ppt, .pptxx)
  • Rich text files (.rtf)

Apache Tika supports a huge and varied set of file formats and these can easily be extended by configuring Funnelback to process specific additional supported formats.

Caveats

For successful indexing a textual representation of the document must be produced by the filter.

The following documents will not be indexable by Funnelback:

  • Password protected or encrypted documents (such as protected Microsoft Office and PDF documents)
  • Scanned PDF documents (to index a scanned document requires an OCR process to run over the documents)

Configure Tika to index additional supported file types

Before going any further check the list of supported document types. When checking ensure you look at the correct version of Tika - you can find out the version by finding the Tika jar files that sit within the $SEARCH_HOME/lib/java/all folder.

For formats supported by Tika see: Funnelback - Tika versions.

Add the file extensions of the additional file type to the filter.tika.types configuration option.

Example

filter.tika.types=doc,dot,ppt,xls,rtf,docx,pptx,xlsx,xlsm,pdf,png,gif,jpg,jpeg,tif,tiff,epub,vsd,msg,odt,odp,ods,odg,docm

Once the extra type has been added some additional options may need to be set depending on the type of collection being indexed.

Collection typeCollection configuration optionDescription
webcrawler.reject_filesEnsure the file extension is not listed here
webcrawler.accept_filesIf used ensure the file extension is listed here
webcrawler.non_htmlEnsure the file extension is listed here
filecopyfilecopy.filetypesEnsure the file extension is listed here
trimpushtrim.extracted_file_typesEnsure the file extension is listed here

See also:

top

Funnelback logo
v15.16.0