Skip to content

Built-in filters: Inject no index filter (InjectNoIndexFilterProvider)

Introduction

This filter is used to hide content from the Funnelback indexer by wrapping specific HTML elements in Funnelback noindex expressions. The documents to process are based on their URL, and the elements to wrap are designated using JSoup CSS-like selectors.

To configure this filter you need to define new settings in your collection.cfg file, using the prefix filter.noindex.X. Each entry will contain one URL pattern (standard Java-type regular expressions) per line, with the corresponding CSS selectors. URLs and selectors are separated by a space.

Caveats

  • Ensure that the selected regions don't result in any nested areas as the hidden regions will not be skipped as expected. See the final example below for further explanation.
  • URL regular expressions containing spaces need to have the space URL encoded (%20).

Enabling

To enable the filter add InjectNoIndexFilterProvider to the filter chain.

Example

The following lines have been added to collection.cfg:

filter.classes=CombinerFilterProvider,TikaFilterProvider,ExternalFilterProvider:InjectNoIndexFilterProvider:JSoupProcessingFilterProvider:DocumentFixerFilterProvider
filter.noindex.1=.* header,footer
filter.noindex.2=server\.com div.navigation,#footer
filter.noindex.3=page\?type=resource div.hidden
filter.noindex.4=https://server\.com/.*/folder%20with%20spaces/.* input[type=text]

This configuration will:

  • cause all documents' <header> and <footer> tags to be wrapped in noindex expressions.
  • cause all the documents with server.com in their URL to have all the <div class="navigation"> and the element with id="footer" to be wrapped in noindex expressions.
  • cause all documents named page with a URL containing the value resource for the type parameter to have their <div class="hidden"> wrapped in noindex expressions.
  • cause all documents on https://server.com/ that have a folder with spaces in their URL to have their inputs of type text wrapped in noindex expressions.

Example input

http://server.com/home

...
<div class="navigation">
  <p>Navigation lives here.</p>
</div>
...
<footer id="footer">
  <p>Footer lives here.</p>
</footer>
...

http://server.com/path/to/page?type=resource

...
<div class="hidden">
  <p>Secret hidden text lives here.</p>
</div>
...
<span class="hidden special">
  <p>Secret special hidden text lives here too.</p>
</span>
...

https://server.com/path/to/folder/long%20name/page

...
<input type="text" name="example" id="example" />
...

http://server.com/navexample

...
<footer id="footer">
  <div class="navigation">
    <p>Navigation lives here.</p>
  </div>

  <p>Footer lives here.</p>
</footer>
...

Example output

http://server.com/home

...
<!--noindex-->
<div class="navigation">
  <p>Navigation lives here.</p>
</div>
<!--endnoindex-->
...
<!--noindex-->
<footer id="footer">
  <p>Footer lives here.</p>
</footer>
<!--endnoindex-->
...

http://server.com/path/to/page?type=resource

...
<!--noindex-->
<div class="hidden">
  <p>Secret hidden text lives here.</p>
</div>
<!--endnoindex-->
...
<!--noindex-->
<span class="hidden special">
  <p>Secret special hidden text lives here too.</p>
</span>
<!--endnoindex-->
...

https://server.com/path/to/folder/long%20name/page

...
<!--noindex-->
<input type="text" name="example" id="example" />
<!--endnoindex-->
...

http://server.com/navexample

Note: this example illustrates why nested filter.noindex rules do not work correctly. In this example everything after the first <!--endnoindex--> will be indexed. Recall that noindex and endnoindex rules operate like switches. When a <!--noindex--> is encountered indexing ceases, and when a <!--endnoindex--> is encountered indexing recommences.

...
<!--noindex-->
<footer id="footer">
<!--noindex-->
  <div class="navigation">
    <p>Navigation lives here.</p>
  </div>
<!--endnoindex-->

  <p>Footer lives here.</p>
</footer>
<!--endnoindex-->
...

See also

top

Funnelback logo
v15.16.0