Text Mining

Introduction

Text Mining in Funnelback involves extracting entities and definitions from textual data. Named entities include person names, organisations, products, geographic locations etc, as well as acronyms. A query for one of these entities will result in the definition being displayed on the search results page, with a link back to the source document.

Configuration

  • Text Mining can be enabled by going to the Administration Home Page -> Administer Tab -> Edit Collection Settings -> "Interface" tab.
  • This will cause the TextMiner filter to be added to the filter chain configured in the filter.classes setting.
  • Entities and their definitions will be extracted from HTML and filtered files (PDF, Office formats) and inserted into a database for display at query time.
  • You will need to be using the Modern UI to see the suggestions.
  • You may need to increase the gather.max_heap_size setting to give the Text Mining process enough memory for its operation. A value of 1400MB should be enough for most collections.
  • If you wish to prevent particular suggestions from being displayed you can add the query trigger words to a text-miner-blacklist.cfg file.

User Interface

Text-miner-john-doe.png

In the screenshot above we can see a named entity (a person's name) and an associated definition. The entity is a hyperlink which will go to the document from which the definition was taken. The FreeMarker tags which cause this to be displayed are:

    <#if response.entityDefinition?exists>
        <div class="textminer"><@fb.TextMiner></@fb.TextMiner></div>         
    </#if>

This syntax states that if the response contains an EntityDefinition object then display it in a div with the class "textminer" and use the "TextMiner" tags to output the entity, the link and the definition.

Logging

Text Mining log messages will be written out to the file crawler.inline_filter.log. Messages showing which entities are being stored will contain the label Entity: e.g.

Entity: [Rss] JSON: {"nounPhrase":"Rss","sourceURL":"http://sample.com/","definition":"is a format for delivering regularly changing content via the web..."}
Entity: [Ors] JSON: {"nounPhrase":"Ors","sourceURL":"http://sample.com/","definition":"Order Routing System..."}

Here the entities "Rss" and "Ors" are being inserted into the database. Searching for these entities will cause their definitions to be displayed.

See Also

top