Skip to content

Using click data to improve rankings

Introduction

Result quality can be improved in some situations by utilising click data. Click data are records of which results users have clicked on in response to particular queries. The idea is that if users are selecting a particular result from a list of results, then this result is more likely to be an important resource than other resources.

Funnelback keeps a record of all click data against each collection and this can incorporated into the Funnelback ranking algorithms to improve result quality.

Including click data

Setting up Funnelback to take into account click data for your collections is a simple procedure that requires editing a small number of collection.cfg options. The main option is:

click_data.use_click_data_in_index

This option should be set to "true" to enable the inclusion of click data click_data.use_click_data_in_index=true.

Note: Click data is included in the indexing phase of updating a collection so it will only take effect after your collection has been updated. See updating collections for more details on updating your collections.

Once the main option has been set setting the following parameters will control which click data is used:

  • click_data_archive_dirs_collection_cfg This option is a comma or space separated list of directories to look in for click data files. This should normally be set to the default archive directory and the default live logging directory as follows: $SEARCH_HOME/data/<collection>/archive $SEARCH_HOME/data/<collection>/live/log. (e.g. This might be /opt/funnelback/data/mycollection/archive/ and /opt/funnelback/data/mycollection/live/log on Linux systems or c:\funnelback\data\mycollection\archive and c:\Funnelback\data\mycollection\live\log on Microsoft Windows systems).

  • click_data.num_archived_logs_to_use This option should be a number indicating how many logs to use from each archive directory listed. e.g. Setting this option to 5 will mean that click data from the last 5 logs (typically each log represents the amount of time between collection updates) in your archive directories will be taken into consideration when calculating query results. This option can be set to all to indicate that every available click data log should be used.

  • click_data.week_limit This option, if set, limits the inclusion of click data to clicks that have occurred in the previous n weeks where n is the value that this option is set to. It is useful to set this feature in regularly changing websites to make sure that the click data used does not represent clicks on documents that may have since been changed or moved.

Click data and meta collections

Collections that are typically searched as part of a meta collection should alter their click_data.archive_dirs option to include the archive and live log directories of the meta collection that they are part of. i.e. a collection named sub-collection that is part of a meta collection named super-collection might set its option as follows:

on Linux systems

click_data.archive_dirs=/opt/funnelback/data/super-collection/archive,/opt/funnelback/data/super-collection/live/log,/opt/funnelback/data/sub-collection/archive,/opt/funnelback/data/sub-collection/live/log

on Windows systems

click_data.archive_dirs=c:\funnelback\data\super-collection\archive,c:\funnelback\data\super-collection\live\log,c:\funnelback\data\sub-collection\archive,c:\funnelback\data\sub-collection\live\log

The reason for this is explained in the next paragraph.

When using a meta collection, it is normal to have a single meta collection being used to search across multiple non-meta collections (which contain the real information). In this case the click data information is typically logged against the meta collection rather than the sub-collection from which the results come. I.e. logs will appear in the $SEARCH_HOME/data/<meta-collection>/archive or $SEARCH_HOME/data/<meta-collection>/live/log directories.

In order to take into account this information when producing the search results, the sub collections must be able to find this information when they are producing their index. For this reason, collections that are part of a sub-collection should include the meta-collections archive (and live log) directories in their list of click data archive directories.

Weighting click data

Like all sources of new information, click data can have a varying degree of impact on the quality of your search results. It is important to weight the information appropriately for your collection in order to obtain the best results. This should be done as part of the normal tuning of Funnelback to suit your unique collection.

Weighting of click data can be achieved with the wmeta.K option, included as either part of a search URL or as part of the query_processor_options configuration parameter in the collection.cfg file. For example, to set the click data weight to 0.7 (the default is 0.5) you might include it in a search URL:

http://company.com/search/search.cgi?collection=mycollection&query=stuff&wmeta.K=0.7

or perhaps set a configuration option in collection.cfg:

query_processor_options= -wmeta.K=0.7

See also

top

Funnelback logo
v15.16.0