Metadata scraper filter
Introduction
The metadata scraper filter is used to extract content out of HTML documents and inject it as metadata for the document.
Enabling
To enable the metadata scraper filter add MetadataScraper
to the filter.jsoup.classes list where <default_jsoup_filters>
is the default value.
filter.jsoup.classes=<default_jsoup_filters>,MetadataScraper
Configuration
The filter is configured via a separate file metadata_scraper.json
which must reside in the collection configuration folder ($SEARCH_HOME/conf/$COLLECTION/metadata_scraper.json
).
This file is in JSON format and contains a list of rules to apply to documents, depending if their URL matches a regular expression:
[{
"urlRegex": "http://example\\.org/",
"metadataName": "author",
"elementSelector": "div.author-name",
"applyIfNoMatch": false,
"extractionType": "text",
"description": "Get author from DIV"
}, {
"urlRegex": "http://example\\.org/products/",
"metadataName": "productSku",
"elementSelector": "div.product p.sku",
"applyIfNoMatch": false,
"extractionType": "attr",
"attributeName": "data-sku"
}]
Each rule is defined with the following attributes:
urlRegex
Regular expression to specify which documents the rule applies to. The URL of the document will be matched against this regular expression and the rule will be applied only if there's a match.
Note: Because this is a regular expression, special characters like .
must be escaped with \
. In addition backslashes must be themselves escaped by \
in JSON, resulting in a double backslash: \\.
. Without this escaping, .
would mean "any character" in the regular expression syntax.
metadataName
This is the name of the resulting metadata that will get injected in the document. For example if this is set to author
, the following will be injected in the document:
<meta name="author" content="...">
If the rule yields multiple values, they will be injected separately:
<meta name="author" content="Shakespeare">
<meta name="author" content="Yeats">
elementSelector
This is a CSS selector to select the HTML element from which to extract the content of the metadata to inject. For example with the following HTML fragment:
<div class="info">
<div class="author-name">William Shakespeare</div>
</div>
And the following selector: div.author-name
, the inner <div>
would be selected for extraction.
applyIfNoMatch
This is a boolean which indicates if the rule should get applied when the selector matches (false
, this is the default), or when it doesn't match (true
).
This is useful to inject a metadata on documents that don't match a specific selector. For example:
{
"urlRegex": "http://example\\.org/products/",
"metadataName": "productCategory",
"elementSelector": "p.category",
"applyIfNoMatch": true,
"processMode": "constant",
"value": "Default category"
}
With this rule, if a document doesn't contain a <p>
tag with the category
class, a productCategory
metadata will be injected with the content Default category
.
When applyIfNoMatch
is set to true, the rule will only run when the elementSelector
does not match. In the example above, if the document did contain a <p>
tag with the category
class, then the productCategory
metadata will not be set to anything by this rule. For such a use case, it is recommended that this rule is paired with another one which extracts the category.
Note: Setting "processMode": "constant"
is also important here. Without it, the default processMode of regex
will be applied and this won't match any content.
extractionType
This indicates how the content should be extracted from the selected element. Possible values are:
text
: The textual content of the matching element will be extracted. If the element contains HTML, all the tags are strippedhtml
: The raw HTML content of the matching element will be extracted.attr
: The value of an attribute of the matching element will be extracted. In this mode,attributeName
must be provided.
For example, with the following HTML fragment
<div class="product" data-sku="1234">
<h1>Product title</h1>
<p>Product description</p>
</div>
And the selector div.product
:
text
would result in the contentProduct title Product description
to be extractedhtml
would result in the content<h1>Product title</h1> <p>Product description</p>
to be extractedattr
, withattributeName: "data-sku"
would result in1234
to be extracted
attributeName
This specifies the name of the attribute to extract the value from, if extractionType
is set to attr
. See example from the previous section for details.
processMode and value
This indicates how the extracted content is processed. Possible values are:
regex
: Apply a regular expression over the extracted content. The regular expression must contain a capture group (using()
) and each match will be injected as a separate metadataconstant
: Return a hard coded string
This setting works in conjunction with value
which indicates either the regular expression to apply, or the hard coded value to use.
This setting is optional. If it's not set, the complete extracted content is retained as-is.
For example with the HTML fragment:
<div class="info">
<div class="author-name">William Shakespeare</div>
</div>
And the rule:
{
"urlRegex": "http://example\\.org/",
"metadataName": "author",
"elementSelector": "div.author-name",
"applyIfNoMatch": false,
"extractionType": "text",
"processMode": "regex",
"value": "(\\S+)"
}
This would result in two metadata being injected author=William
and author=Shakespeare
because the regular expression (\\S+)
yields 2 matches.
If processMode
were set to constant
and value
to Yeats
, a single metadata author=Yeats
would have been injected.
description
This attribute is used to add a comment to the rule. It is optional and is not used when applying the rule.