Funnelback Feeds

Introduction

Funnelback's normal method of operation is an information 'pull'. Funnelback will initiate data gathering operations itself (e.g. start a web crawl). The feeds mechanism provides an interface by which external systems can explicitly 'push' information into Funnelback collections without relying on a user to navigate through the administration interface.

This feature will be especially useful to administrators who have new data that will arrive sporadically but is still important enough that it must get into search indexes quickly (e.g. Web based marketplaces). It will also be useful for developers of applications built on top of Funnelback, or wrappers around it.

Feeds Tasks

For simplicity, this section will refer to ‘resources’ whenever it is discussing web pages, files on a file share, database records etc..., and it will refer to collections of resources as ‘sites’ (even though these might be web sites, directories, databases etc.).

The feeds feature will allow outside programs to perform the following tasks:

  • Add new content directly to existing indexes – This means specifically feeding the content of a resource into Funnelback, and having that resource appear as a search result appearing to belong to the given URL. Note that this does not guarantee permanent storage of the data in the index, it will only be present in the index until the next full update.
  • Triggering the update of new or existing content within the index – This means indicating where Funnelback should go to get the content of the resource. Once again, this should apply only until the next full update when the content will be refreshed.
  • Remove existing content from indexes – This means deleting specific resources from the index at this time only. The resources may reappear later if they are re-gathered.
  • Adding new sites to be permanently included within the gathering of existing collections. This will last as long as the lifetime of the collection.
  • Removing sites permanently from the gathering of existing collections. This will last as long as the lifetime of the collection.

Interface

The Feeds interface is a web based interface that allows outside programs to post XML based instructions describing actions that should be taken by Funnelback to update its indexes. Instructions are described in an XML format and sent to the Funnelback server. Each feed instruction set should be sent to your Funnelback server as a POST request with the query string feed= where is the xml document (the actual data, not a filename) that contains the feed instructions. All feed instructions should be sent to:

https://<your funnelback server>:<your admin UI port>/search/admin/handle-feed.cgi

e.g.

https://search.company.com:8443/search/admin/handle-feed.cgi

Please note that as this interface is usually protected via a login and encrypted channel, your application that is using feeds must also be able to login and use the HTTPS protocol.

Feed format

The feeds instruction format can be best described by looking at an example. The following example may look complicated, but it will be broken down into its elements further down.

 <funnelbackfeed>
  <feed>
    <header>
      <collection>mywebsite</collection>
    </header>

    <group action="addResource">
      <resource url="http://mywebsite.com/apage.html" />
      <resource url="http://mywebsite.com/anotherpage.html" type="inline">
        &lt;html&gt;
          &lt;head/&gt;
          &lt;body&gt;
            Some text
          &lt;/body&gt;
        &lt;/html&gt;
      </resource>
    </group>

    <group action="deleteResource">
      <resource url="http://mywebsite.com/anoldpage.html" />
      <resource url=... />
    </group>

    <group action="crawlSite">
      <site start_url="http://wwww.domain.com/seed.html" include_patterns="domain.com,secure.com" exclude_patterns="cgi-bin,calendar" />
      <site ... />
    </group>

    <group action="addSite">
      <site start_url="www.domain.com" include_patterns="domain.com,secure.com" exclude_patterns="" />
      <site ... />
    </group>

    <group action="removeSite">
      <site start_url="home.oldsite.com" include_patterns="oldsite.com" exclude_patterns="" />
      <site ... />
    </group>

    <group>
      ...
    </group>
  </feed>

  <feed>
    <header>
      <collection>another-name-here</collection>
    </header>

    <group>
      ...
    </group>
  </feed>

  <feed>
    ...
  </feed>

 </funnelbackfeed>

The first thing to note is that the feed document contains a root element named 'funnelbackfeed' and one or more 'feed' elements.

  <feed>
   <header>
     <collection>mywebsite</collection>
   </header>
   <group action="...">
     ...
   </group>
   <group action="...">
     ...
   </group>
 </feed>

The 'collection' element contained within the 'header' element for each feed tells Funnelback which collection this feed should operate on. Other than the 'header', each feed contains one or more action groups.

   <group action="addResource">
     <resource url="http://mywebsite.com/apage.html" />
     <resource url="http://mywebsite.com/anotherpage.html" type="inline">
       &lt;html&gt;
         &lt;head/&gt;
         &lt;body&gt;
           Some text
         &lt;/body&gt;
       &lt;/html&gt;
   </group>

An 'addResource' action group contains a list of resources to add to the collection in this feed. There are two ways that resources can be specified here:

  • With a URL only - this will cause Funnelback to go out and fetch the resource from the given URL.
  • As an 'inline' resource - in this case, the application generating the feed instruction must supply the resource content as XML encoded HTML as shown in the above example. Note that with an inline resource specifying the URL is still mandatory, since Funnelback uses the URL as an identifier.
   <group action="deleteResource">
     <resource url="http://mywebsite.com/anoldpage.html" />
   </group>

A 'deleteResource' action group causes resources with the given URLs to be immediately removed from the search indexes.

   <group action="crawlSite">
     <site start_url="http://wwww.domain.com/seed.html" include_patterns="domain.com,secure.com" exclude_patterns="cgi-bin,calendar" />
   </group>

A 'crawlSite' action group causes Funnelback to immediately go out and crawl the specified website and add its contents to the collection in question. Each 'site' element in this group must contain a start_url attribute. It is highly recommended to contain an 'include_patterns' attribute as well. It may also contain an 'exclude_patterns' attribute.

   <group action="addSite">
     <site start_url="www.domain.com" include_patterns="domain.com,secure.com" exclude_patterns="" />
   </group>
   <group action="removeSite">
     <site start_url="home.oldsite.com" include_patterns="oldsite.com" exclude_patterns="" />
   </group>

The 'addSite' and 'removeSite' action groups allow you to control which sites constitute a collection's source data. Adding and removing sites using this mechanism will only affect the collection's data content at the time of the next crawl. Note that this is entirely different to the 'crawlSite' group type, which will affect the content of the collection up until the next crawl.

Feeds tasks and collection types

Not all of the feed tasks will make sense for every collection type. The following is a table describing which tasks will be available for each collection type:

Task/Collection typeWebFilecopyDatabaseLocalMeta
Add new contentYYNNN
Add new content (inline)YYYNN
Remove existing contentYYYNN
Update siteYYNNN
Add SiteYYNNN
Remove SiteYYNNN

See Also

top