Built-in filters: Metadata normaliser filter (MetadataNormaliser)
Introduction
The metadata normaliser filter can be used to clean and normalise metadata values. Normalisation is particularly useful for faceted navigation, allowing similar categories to be merged into a single category.
The filter processed HTML meta tags (<meta name="key" content="value">
) and tests the value against regular expressions. The value is replaced when the value matches a regular expression.
Enabling
To enable the filter add MetadataNormaliser
to the filter chain where <default_filter_chain>
is the default value.
filter.classes=<default_filter_chain>:MetadataNormaliser
Configuring the metadata normaliser filter
Mapping must be defined in collection.cfg
, using the following key:
filter.md_normaliser.keys=...
For example, to perform metadata normalisation on <meta name="Author" ... >
and <meta name="Publisher" ... >
, this value would be set to:
filter.md_normaliser.keys=author,publisher
Keys are case insensitive. Any key name can be used - recommended practice is to use the same meta "name" attribute.
A corresponding mapping file must be defined for each key in
$SEARCH_HOME/conf/<collection>/md_normaliser.<key>.mapping
Example filename:
$SEARCH_HOME/conf/<collection>/md_normaliser.author.mapping
The first line in the mapping file is the <key>
expression, i.e. author. The key is case-insensitive and is treated as a regular expression (so expressions like DC.Creator|Author
are valid).
- Each following line must be
<regex>=<replacement>
- Capture groups can be used (e.g.
(.*)@domain.com=$1
) - Lines starting with
#
are considered comments
Regular expressions are tried in order. The filter terminates on the first matching regular expression.
Example
To normalise non-preferred values of Shakespeare
and John Smith
that may exist in Author
and Creator
metadata fields:
collection.cfg
:
filter.md_normaliser.keys=author
md_normaliser.author.mapping
:
Author|Creator
.*shakespeare.*=Shakespeare
[wW]\.?[sS]\.?=Shakespeare
[Ss]\.?[wW]\.?=Shakespeare
jsmith=John Smith
jack smith=John Smith
j\. smith=John Smith
johnny smith=John Smith