Skip to content

Filter example: change document type

Description

Filters allow the document type of a document to be changed. This should be done if a filter every changes the type of a document e.g. converting from pdf to html. The document type can also be changed to correct the document so that filters further done the chain correctly run. Although in this example StringDocumentFilter is implemented, both ByteDocumentFilter and Filter can be used to change the document type.

Example

In this example we inspect the document content and change the document type to XML if the document looks like an XML document. This example implements the StringDocumentFilter. We are required to implement canFilter(), which in this example always returns ATTEMPT_FILTER as we must inspect the document content before we can decide if the document can be skipped or not. We are also required to implement filterAsStringDocument() which contains the logic for the filter.

This example also has a simple test method which can be executed by running the main method see testing Groovy filters.

package com.myfilters;

import org.junit.*;
import org.junit.Test;
import com.funnelback.filter.api.*;
import com.funnelback.filter.api.documents.*;
import com.funnelback.filter.api.filters.*;
import com.funnelback.filter.api.mock.*;

@groovy.util.logging.Log4j2
public class FixingDocumentType implements StringDocumentFilter {

    @Override
    public PreFilterCheck canFilter(NoContentDocument document, FilterContext context) {
        //Always attempt the filter.
        return PreFilterCheck.ATTEMPT_FILTER;
    }

    @Override
    public FilterResult filterAsStringDocument(StringDocument document, FilterContext context) {
        //Assume documents which start with <?xml e.g. <?xml version=\"1.0\" encoding=\"UTF-8\"?>
        //are XML documents.
        if(document.getContentAsString().trim().startsWith("<?xml ")) {
            //Change the document type to XML.
            StringDocument filteredDocument = document.cloneWithStringContent(DocumentType.MIME_XML_TEXT, 
                                                                                document.getContentAsString());
            
            return FilterResult.of(filteredDocument);
        }
        
        log.debug(document.getURI() + " does not appear to be a XML document.");
        
        //Return skipped so that a choice filter can try fixing the document with a different filter.
        return FilterResult.skipped();
    }
    
    /*
     * Below are filter test methods. 
     */
    public static class FilterTest {
        @Test
        public void fixXMLDocumentTypeTest() {
            //Create a input document where the content looks like XML but the document type
            //is unknown
            StringDocument inputDoc = MockDocuments.mockEmptyStringDoc()
                                                    .cloneWithStringContent(DocumentType.MIME_UNKNOWN, 
                                                        "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n"
                                                        +"<foo>bar</foo>");
            FilterResult filterResult = new FixingDocumentType().filter(inputDoc, MockFilterContext.getEmptyContext());
            
            Assert.assertFalse("Filter should not have been skipped", filterResult.isSkipped());
            
            StringDocument filteredDoc = (StringDocument) filterResult.getFilteredDocuments().get(0);
            
            Assert.assertTrue("Document type should have been changed to xml", filteredDoc.getDocumentType().isXML());
        }
        
        @Test
        public void skipsNonXMLDocumentTest() {
            //Create a document which does not look like XML
            StringDocument inputDoc = MockDocuments.mockEmptyStringDoc()
                                                    .cloneWithStringContent(DocumentType.MIME_UNKNOWN, 
                                                        "This doesn't look like XML!");
            FilterResult filterResult = new FixingDocumentType().filter(inputDoc, MockFilterContext.getEmptyContext());
            
            Assert.assertTrue("Filter should have been skipped, as the document does not look like XML", 
                                filterResult.isSkipped());
        }
    }

    //Running the main method will execute the test methods.
    public static void main(String[] args) throws Exception {
        FilterTestRunner.runTests(FilterTest.class);
    }
}

See also:

top

Funnelback logo
v15.16.0