Built-in filters: CSV to XML filter

Introduction

The CSVToXML filter converts CSV documents to multiple XML documents, where each record in the CSV results in one XML document.

Enabling

To enable the filter add CSVToXML to the filter chain. Documents will only be filtered if the document has the mime type text/csv or text/tab-separated-values. You may need to write a custom filter to alter the document type.

If all documents being gathered are CSV documents you can use the ForceCSVMime filter before this filter (e.g. ForceCSVMime:CSVToXML) to set the MIME type to CSV to filter all documents as CSV.

Examples

To add the csv-to-xml conversion to the existing filter chain:

filter.classes=<default_filter_chain>:CSVToXML

where <default_filter_chain> is the default value for filter.classes.

To force Funnelback to treat all documents as CSV and convert all entries to XML:

filter.classes=<default_filter_chain>:ForceCSVMime:CSVToXML

Downloading CSV on web collections

To allow the web crawler to download CSV documents you may need to add csv to the crawler.non_html option.

Configuring the filter

Format

The filter generally assumes RFC4180 (skipping blank lines). If your CSV is in another format, you can set filter.csv-to-xml.format.

You can instruct Funnelback to read the headers and use them as element names in the resulting XML by enabling filter.csv-to-xml.has-header in collection.cfg.

It is also possible to set a custom header by defining the element names in collection.cfg using filter.csv-to-xml.custom-header.

If a custom header is intended to overwrite an existing header filter.csv-to-xml.has-header should be set true.

Tip: Take note of the case of the elements in the header when trying to map metadata classes as they are case sensitive.

URL template for resulting XML documents

You can change the template for the URLs used in the resulting XML documents by setting filter.csv-to-xml.url-template.

CSV to XML conversion example

The CSVToXML filter uses the field names from the CSV file when generating the XML for indexing. Any non-word characters found in the field names are converted to an underscore when generating the XML field names. The XML fields preserve the case of the CSV field names.

For example the CSV file:

"First Name","Last Name","Role","Home Page"
"John","Smith","Plumber","http://directory/smith_john.html"
"Joe","Bloggs","Consultant","http://directory/bloggs_joe.html"
"Fred","Nerk","Teacher","http://directory/nerk_fred.html"

is converted to a three XML documents that have the following form (first document shown):

<csvFields>
  <First_Name>John</First_Name>
  <Last_Name>Smith</Last_Name>
  <Role>Plumber</Role>
  <Home_Page>http://directory/smith_john.html</Home_Page>
</csvFields>

The fields can be mapped to metadata using the normal rules for XML field mapping.

For example the following XML configuration could be used to index the CSV fields:

The document url is optional - the documents will get an auto-assigned URL when the file is split.

Metadata class configuration:

Metadata class name	Metadata class type	Source fields	Source type
firstName	text	//First_Name	XML
lastName	text	//Last_Name	XML
role	text	//Role	XML

The following additional XML special configuration can optionally be set if one of the fields contains a URL that should be the target URL when a result for row is clicked.

See: Document URL