Metadata scraper filter

Introduction

The metadata scraper filter is used to extract content out of HTML documents and inject it as metadata for the document.

Enabling

To enable the metadata scraper filter add MetadataScraper to the filter.jsoup.classes list where <default_jsoup_filters> is the default value.

filter.jsoup.classes=<default_jsoup_filters>,MetadataScraper

Configuration

The filter is configured via a separate file metadata_scraper.json which must reside in the collection configuration folder ($SEARCH_HOME/conf/$COLLECTION/metadata_scraper.json). This file is in JSON format and contains a list of rules to apply to documents, depending if their URL matches a regular expression:

[{
  "urlRegex": "http://example\\.org/",
  "metadataName": "author",
  "elementSelector": "div.author-name",
  "applyIfNoMatch": false,
  "extractionType": "text",
  "description": "Get author from DIV"
}, {
  "urlRegex": "http://example\\.org/products/",
  "metadataName": "productSku",
  "elementSelector": "div.product p.sku",
  "applyIfNoMatch": false,
  "extractionType": "attr",
  "attributeName": "data-sku"
}]

Each rule is defined with the following attributes:

urlRegex

Regular expression to specify which documents the rule applies to. The URL of the document will be matched against this regular expression and the rule will be applied only if there's a match.

Note: Because this is a regular expression, special characters like . must be escaped with \. In addition backslashes must be themselves escaped by \ in JSON, resulting in a double backslash: \\.. Without this escaping, . would mean "any character" in the regular expression syntax.

metadataName

This is the name of the resulting metadata that will get injected in the document. For example if this is set to author, the following will be injected in the document:

<meta name="author" content="...">

If the rule yields multiple values, they will be injected separately:

<meta name="author" content="Shakespeare">
<meta name="author" content="Yeats">

elementSelector

This is a CSS selector to select the HTML element from which to extract the content of the metadata to inject. For example with the following HTML fragment:

<div class="info">
  <div class="author-name">William Shakespeare</div>
</div>

And the following selector: div.author-name, the inner <div> would be selected for extraction.

applyIfNoMatch

This is a boolean which indicates if the rule should get applied when the selector matches (false, this is the default), or when it doesn't match (true).

This is useful to inject a metadata on documents that don't match a specific selector. For example:

{
  "urlRegex": "http://example\\.org/products/",
  "metadataName": "productCategory",
  "elementSelector": "p.category",
  "applyIfNoMatch": true,
  "processMode": "constant",
  "value": "Default category"
}

With this rule, if a document doesn't contain a <p> tag with the category class, a productCategory metadata will be injected with the content Default category.

When applyIfNoMatch is set to true, the rule will only run when the elementSelector does not match. In the example above, if the document did contain a <p> tag with the category class, then the productCategory metadata will not be set to anything by this rule. For such a use case, it is recommended that this rule is paired with another one which extracts the category.

Note: Setting "processMode": "constant" is also important here. Without it, the default processMode of regex will be applied and this won't match any content.

extractionType

This indicates how the content should be extracted from the selected element. Possible values are:

text: The textual content of the matching element will be extracted. If the element contains HTML, all the tags are stripped
html: The raw HTML content of the matching element will be extracted.
attr: The value of an attribute of the matching element will be extracted. In this mode, attributeName must be provided.

For example, with the following HTML fragment

<div class="product" data-sku="1234">
  <h1>Product title</h1>
  <p>Product description</p>  
</div>

And the selector div.product:

text would result in the content Product title Product description to be extracted
html would result in the content <h1>Product title</h1> <p>Product description</p> to be extracted
attr, with attributeName: "data-sku" would result in 1234 to be extracted

attributeName

This specifies the name of the attribute to extract the value from, if extractionType is set to attr. See example from the previous section for details.

processMode and value

This indicates how the extracted content is processed. Possible values are:

regex: Apply a regular expression over the extracted content. The regular expression must contain a capture group (using ()) and each match will be injected as a separate metadata
constant: Return a hard coded string

This setting works in conjunction with value which indicates either the regular expression to apply, or the hard coded value to use.

This setting is optional. If it's not set, the complete extracted content is retained as-is.

For example with the HTML fragment:

<div class="info">
  <div class="author-name">William Shakespeare</div>
</div>

And the rule:

{
  "urlRegex": "http://example\\.org/",
  "metadataName": "author",
  "elementSelector": "div.author-name",
  "applyIfNoMatch": false,
  "extractionType": "text",
  "processMode": "regex",
  "value": "(\\S+)"
}

This would result in two metadata being injected author=William and author=Shakespeare because the regular expression (\\S+) yields 2 matches.

If processMode were set to constant and value to Yeats, a single metadata author=Yeats would have been injected.

description

This attribute is used to add a comment to the rule. It is optional and is not used when applying the rule.