Skip to content

Classifying textual vs. non-textual documents

Funnelback can be configured to guess whether each document contains textual content during a crawl, and display it later using content auditor.

Caveats

  • Plain text documents don't allow HTML-style metadata to be attached. These documents will show up as (No Value) in the 'Textual vs Non-Textual' facet.

To set up Textual vs. Non-textual detection on a collection, a few separate parts of Funnelback need to be configured:

Filter chain

Add TextDetectionFilterProvider to the filter chain (Administer -> Edit Collection Settings -> Workflow -> Filter classes)

It should be far enough back in the chain for text conversion to already have happened, i.e. after the group in which TikaFilterProvider sits. You should probably put it at the end of the chain, (prepended by :) unless you have a specific reason not to.

Metadata mappings

Add the following field:

  • Class name: T
  • Class type: text
  • Search behaviour: display only
  • Metadata source (HTML type): X-Funnelback-Textual

Check that (capital) 'T' isn't already used - if it is, choose a different letter

collection.cfg

Add the following field:

ui.modern.content-auditor.facet-metadata.T=Textual

Faceted navigation

Click Customise -> customise faceted navigation

  • Click 'Add Facet'
  • On the new facet:
    • Fill in the new facet, e.g. Textual vs Non-Textual
  • Click 'Add Category'
  • On the new category:
    • Set the dropdown to 'Metadata field fill'
    • Fill in the text box with T
  • Hit save

Update necessary

  • Adding TextDetectionFilterProvider to the filter chain will require a full update
  • The other steps only need a re-index.

After updating, there should be a new 'Textual vs Non-Textual' widget in the content auditor overview page.

See also

top

Funnelback logo
v15.18.0