Built-in filters: Inject no index filter (InjectNoIndexFilterProvider)
Introduction
This filter is used to hide content from the Funnelback indexer by wrapping specific HTML elements in Funnelback noindex
expressions. The documents to process are based on their URL, and the elements to wrap are designated using JSoup CSS-like selectors.
To configure this filter you need to define new settings in your collection.cfg
file, using the prefix filter.noindex.X
. Each entry will contain one URL pattern (standard Java-type regular expressions) per line, with the corresponding CSS selectors. URLs and selectors are separated by a space.
Caveats
- Ensure that the selected regions don't result in any nested areas as the hidden regions will not be skipped as expected. See the final example below for further explanation.
- URL regular expressions containing spaces need to have the space URL encoded (
%20
).
Enabling
To enable the filter add InjectNoIndexFilterProvider
to the filter chain.
Example
The following lines have been added to collection.cfg
:
filter.classes=CombinerFilterProvider,TikaFilterProvider,ExternalFilterProvider:InjectNoIndexFilterProvider:JSoupProcessingFilterProvider:DocumentFixerFilterProvider
filter.noindex.1=.* header,footer
filter.noindex.2=server\.com div.navigation,#footer
filter.noindex.3=page\?type=resource div.hidden
filter.noindex.4=https://server\.com/.*/folder%20with%20spaces/.* input[type=text]
This configuration will:
- cause all documents'
<header>
and<footer>
tags to be wrapped innoindex
expressions. - cause all the documents with server.com in their URL to have all the
<div class="navigation">
and the element withid="footer"
to be wrapped innoindex
expressions. - cause all documents named page with a URL containing the value resource for the type parameter to have their
<div class="hidden">
wrapped in noindex expressions. - cause all documents on https://server.com/ that have a folder with spaces in their URL to have their
input
s of typetext
wrapped innoindex
expressions.
Example input
http://server.com/home
...
<div class="navigation">
<p>Navigation lives here.</p>
</div>
...
<footer id="footer">
<p>Footer lives here.</p>
</footer>
...
http://server.com/path/to/page?type=resource
...
<div class="hidden">
<p>Secret hidden text lives here.</p>
</div>
...
<span class="hidden special">
<p>Secret special hidden text lives here too.</p>
</span>
...
https://server.com/path/to/folder/long%20name/page
...
<input type="text" name="example" id="example" />
...
http://server.com/navexample
...
<footer id="footer">
<div class="navigation">
<p>Navigation lives here.</p>
</div>
<p>Footer lives here.</p>
</footer>
...
Example output
http://server.com/home
...
<!--noindex-->
<div class="navigation">
<p>Navigation lives here.</p>
</div>
<!--endnoindex-->
...
<!--noindex-->
<footer id="footer">
<p>Footer lives here.</p>
</footer>
<!--endnoindex-->
...
http://server.com/path/to/page?type=resource
...
<!--noindex-->
<div class="hidden">
<p>Secret hidden text lives here.</p>
</div>
<!--endnoindex-->
...
<!--noindex-->
<span class="hidden special">
<p>Secret special hidden text lives here too.</p>
</span>
<!--endnoindex-->
...
https://server.com/path/to/folder/long%20name/page
...
<!--noindex-->
<input type="text" name="example" id="example" />
<!--endnoindex-->
...
http://server.com/navexample
Note: this example illustrates why nested filter.noindex
rules do not work correctly. In this example everything after the first <!--endnoindex-->
will be indexed. Recall that noindex and endnoindex rules operate like switches. When a <!--noindex-->
is encountered indexing ceases, and when a <!--endnoindex-->
is encountered indexing recommences.
...
<!--noindex-->
<footer id="footer">
<!--noindex-->
<div class="navigation">
<p>Navigation lives here.</p>
</div>
<!--endnoindex-->
<p>Footer lives here.</p>
</footer>
<!--endnoindex-->
...