Funnelback can index XML documents and there are some additional configuration files that are applicable to indexing XML files.
- You can map metadata classes to elements in the XML structure via xml.cfg.
- You can display cached copies of the document via XSLT processing.
Example: electronic books
Let's say you had a number of XML files representing electronic-books similar to:
<book> <info> <title> The Adventures Of Sherlock Holmes </title> <author> Arthur Conan Doyle </author> </info> <contents> <chapter>A Scandal in Bohemia</chapter> <chapter>The Red-headed League</chapter> ... <chapter>The Adventure of the Copper Beeches</chapter> </contents> </book>
Because the data is plain XML files, it doesn't need any text conversion (like PDFs), so you could use a local collection.
a,1,,//author t,1,,//title x,1,,/book/contents/chapter
When this data is indexed, the text from these elements will be indexed and assigned to the specified metadata classes.
Because this is a local collection, there are a couple of configuration options that will help present the XML.
- Create the
template.xslstylesheet to convert the XML into HTML.
- Change the collection's search forms to use the
cache_urlinstead of the
For more details and caveats, see the XML and XSL section of the Cache Controller documentation.
Crawling XML files
To crawl XML files you will need to ensure that the crawler.parser.mimeTypes parameter includes text/xml as one of the MIME types the web crawler will accept.