Xml Documents

Introduction

Funnelback can index XML documents and there are some additional configuration files that are applicable to indexing XML files.

Example: Electronic books

Let's say you had a number of XML files representing electronic-books similar to:

<book>
  <info>
  <title> The Adventures Of Sherlock Holmes </title>
  <author> Arthur Conan Doyle </author>
  </info>

  <contents>
    <chapter>A Scandal in Bohemia</chapter>
    <chapter>The Red-headed League</chapter>
...
    <chapter>The Adventure of the Copper Beeches</chapter>
  </contents>
</book>

Because the data is plain XML files, it doesn't need any text conversion (like PDFs), so you could use a local collection.

Metadata

To map this XML structure to metadata classes for the author (a), title (t) and chapters (x), create the xml.cfg file containing:

a,1,,//author
t,1,,//title
x,1,,/book/contents/chapter

When this data is indexed, the text from these elements will be indexed and assigned to the specified metadata classes.

Presentation

Because this is a local collection, there are a couple of configuration options that will help present the XML.

  1. Create the template.xsl stylesheet to convert the XML into HTML.
  2. Change the collection's search forms to use the cache_url instead of the live_url.

For more details and caveats, see the XML and XSL section of the Cache Controller documentation.

Crawling XML Files

To crawl XML files you will need to ensure that the crawler.parser.mimeTypes parameter includes text/xml as one of the MIME types the web crawler will accept.

top