Skip to content

External metadata

Introduction

External metadata is metadata which can be applied to pages in a web collection, without actually modifying the pages in any way. For example, it is possible to make all pages in a particular website match the query genre:comedy by adding a single line to the external metadata file.

www.example.org/ genre:comedy

Targets of the metadata are identified by their URL as supplied by the web server when the page was originally crawled. Note: The external metadata mechanism was designed for use only with collections containing URL data (i.e. web collections).

Metadata information can be supplied for any of the allowable metadata classes, but care should be taken if reusing reserved or special classes which have special behaviour (i.e. d is used for dates).

Activate External Metadata

To activate the external metadata for a collection, create the file external_metadata.cfg in the collection's conf/ subdirectory. The file can be created through the Funnelback administration interface using the file-manager.

When indexing commences external_metadata.cfg is checked for validity and data structures are set up to enable efficient lookup. If an error is detected, an appropriate error message will be printed, scanning of the file will cease and the documents will be indexed without external metadata. (See Step-Index.log using the "Browse Log Files" option in the administration interface.)

External metadata file format

The external metadata file must be a text file delimited into lines by linefeed (\n, hex 0x0A) characters. Each line consists of a URL-prefix followed by a list of metadata elements which apply to all URLs which start with that prefix (unless overridden by a more specific URL-prefix).

URL prefixes must include a full hostname. It is permissible to commence the prefix with "http://". If no protocol is specified then "http://" is assumed.

Each metadata element consists of a metadata field specifier, followed by a colon, followed by a word or a string of text in double quotes. Metadata elements are separated by whitespace. Punctuation should only be present within quoted strings. e.g. genre:"Historical Drama" director:Costner. t:"example title" metadata will be used as the document title.

Here is an example of an external metadata file:

www.example.org publisher:"Movies Inc."
www.example.org/comedy/ genre:Comedy
www.example.org/historical-drama/ genre:"Historical Drama"
www.example.org/comedy/romance/ genre:Romance year:2012

These records have the effect that:

  1. Any page within the www.example.org site, e.g. www.example.org/movies/ or www.example.org/about.htm will be indexed with the metadata "publisher" = "Movies Inc."
  2. Any page within www.example.org/comedy/, e.g. www.example.org/comedy/movies.htm or www.example.org/comedies/x/y/z.pdf will be index with "genre" = "Comedy" and "publisher" = "Movies Inc."
  3. Any page within www.example.org/comedy/romance e.g. www.example.org/comedy/romance/movies.htm will be indexed with the metadata "genre" = "Romance", "year" = "2012" and "publisher" = "Movies Inc.". The "Comedy" genre will not be inherited from the second line because it was overridden in the fourth line.

Note: If multiple lines in the metadata file start with an identical prefix, only the first will be effective.

Default page handling

URL default pages (e.g. index) will be stripped from both the prefix given in the external metadata file and the URL being checked from the collection. When a URL prefix is stripped exact matching will be applied, hence example.org/index would match any default page on example.org, but would not match other pages or sub directories on example.org. Default pages are defined as any page called index, welcome, home, default or main followed by a dot and a three or four character extension (e.g. htm or html).

Metadata mapping & types

Metadata classes from external metadata don't need to be defined in metamap.cfg or xml.cfg and will be automatically created, however these metadata classes will be of type 0 (non-indexed content). If the type needs to be changed (to have their content indexed, or for numerical metadata) then the metadata class needs to be mapped:

# external_metadata.cfg
http://www.example.org/ Genre:"Comedy"

# metamap.cfg
Genre,1,Genre

The indexer actually ignores the last column ...,Genre when the metadata comes from external metadata. Only the first column (metadata name) and second column (type) are taken into account. Because of this, and because the mapping above may conflict with HTML tag mapping (e.g. <meta name="Genre" content="Drama" />), it's recommended to use a dummy prefix in the metadata name, to indicate that the metadata comes from an external source:

Genre,1,EXTERNAL_METADATA_Genre
Year,3,EXTERNAL_METADATA_Year

External Metadata Date Format

Dates must be specified as 8 digit integers in the format YYYYMMDD. The document internal date is mapped to the d metadata, any other date field will be treated as a string (unless configured otherwise in metamap.cfg or xml.cfg):

http://www.example.org/movies/2004/ d:20040101 maxDate:20041231

This would result in the documents under http://www.example.org/movies/2004/ to have an internal date of 20040101, and an additional metadata field "expiryDate" = "20041231"

Profile-based external metadata

It is possible to create external_metadata.cfg files per profile, rather than in the collection conf directory, however these files won't be processed by the indexer. To have them processed they need to be concatenated and written into conf/external_metadata.cfg prior to indexing.

This operation can be automated using the validation tool described below.

Validation and concatenation tool

mediator.pl provides a task to validate an external metadata file, as well as concatenating per-profile files into a single collection one. This task can process per-profile external metadata files, and concatenate only valid lines into the collection's conf/external_metadata.cfg file, discarding malformed lines.

The task name is ValidateExternalMetadata, it takes the following parameter:

collection

ID of the collection containing the external metadata file to validate or concatenate

mode

Either check to only check an existing file for errors, or concatenate to check and concatenate per-profile files

errorThreshold

Percentage of errors (expressed as a floating number between 0 and 1) to tolerate before exiting with an error code. For example, using errorThreshold=0.7 will tolerate up to 70% of errors in external metadata files before aborting. The default value is 0.The common use case is to configure this task in the update workflow using a pre_index_command:

 pre_index_command=$SEARCH_HOME/linbin/ActivePerl/bin/perl $SEARCH_HOME/bin/mediator.pl ValidateExternalMetadata collection=$COLLECTION_NAME mode=concatenate

If there are more errors than the configured threshold, the task will fail with a non-zero exit code, causing the update to fail before indexing.

A log of the validation is produced under the collection log folder:

 $SEARCH_HOME/data/<collection>/log/external_metadata.cfg-validation.log

See also

top

Funnelback logo
v15.12.0