Classifying textual vs. non-textual documents

Funnelback can be configured to guess whether each document contains textual content during a crawl, and display it later using content auditor.

Caveats

Plain text documents don't allow HTML-style metadata to be attached. These documents will show up as (No Value) in the 'Textual vs Non-Textual' facet.

To set up Textual vs. Non-textual detection on a collection, a few separate parts of Funnelback need to be configured:

Filter chain

Add TextDetectionFilterProvider to the filter chain by editing the collection configuration.

It should be far enough back in the chain for text conversion to already have happened, i.e. after the group in which TikaFilterProvider sits. You should probably put it at the end of the chain, (prepended by :) unless you have a specific reason not to.

Metadata mappings

Add the following field:

Class name: T
Class type: text
Search behaviour: display only
Metadata source (HTML type): X-Funnelback-Textual

Check that (capital) 'T' isn't already used - if it is, choose a different letter

collection.cfg

Add the following field:

ui.modern.content-auditor.facet-metadata.T=Textual

Click Customise -> customise faceted navigation

Click 'Add Facet'
On the new facet:
- Fill in the new facet, e.g. Textual vs Non-Textual
Click 'Add Category'
On the new category:
- Set the dropdown to 'Metadata field fill'
- Fill in the text box with T
Hit save

Update necessary

Adding TextDetectionFilterProvider to the filter chain will require a full update
The other steps only need a re-index.

After updating, there should be a new 'Textual vs Non-Textual' widget in the content auditor overview page.

Classifying textual vs. non-textual documents

Caveats

Filter chain

Metadata mappings

collection.cfg

Faceted navigation

Update necessary

See also

Contents