Skip to content processes a collection's data files, producing data reports on their contents. <--collection "collection config"> [--log]
                                                  [--datadir "data directory"]
                                                  [--output "output directory"]
                                                  [--hosts "host list file"] 


  • The collection configuration file must be specified, and must be a filesystem path to an existing, readable and valid collection configuration file.
  • "--log" may also be specified, and indicates that the script should write to a log file.
  • "--plain" may also be specified, and indicates that the script should output plain HTML instead of Funnelback look and feel HTML.
  • "--datadir "data directory"" may also be specified, and gives the directory to provide reports for.
  • "--output "output directory"" may also be specified, and gives the directory to write output to.
  • "--hosts "host list file"" may also be specified, and gives the location of a file on a disk that groups sites / hosts into groups and subgroups.

Function runs over a data directory, recording statistics on the directories contents, and outputs reports to HTML files.

The directory that runs over is specified by the collection configurations data_root setting, or by the "--datadir" option. The collection data_root setting should point to data gathered by an update. will place output in $SEARCH_HOME/admin/data_report/<collection> by default, or in the directory specified by "--output".

If "--log" is specified, the script will write a log called crawl_data_report.log to the log directory beside the specified data directory: eg, if the data directory is /opt/funnelback/data/<collection>/offline/data/, the log file will be /opt/funnelback/data/<collection>/offline/data/crawl_data_report.log, and if the data directory is /tmp/my_own_gathered_stuff/, the log file will be /tmp/log/crawl_data_report.log.

The reports produced will be plain HTML if "--plain" is specified. When this script is run by the update process, the files will include various substitutable strings, including: @ADMIN_HOME@, @ADMIN_BASE@ and @REPORT_BASE@. This is so that the admin UI can read these files from disk and substitute in links to the administration UI homepage, CSS files, images, etcetera.

A "hosts list" may be specified. If none is specified, a default of $SEARCH_HOME/conf/<collection>/sites-by-portfolio.csv is assumed. The hosts list does not have to exist and has negligible impact on the reports. If present, the list should be of the format:


For example:,businesses,funnelback,businesses,funnelback,governmental,australia,businesses,microsoft,government,australia,government,australia

Should the host list exist, various aggregate statistics will be produced. For example, statistics will not just be reported for individual sites, but for groups of sites.

See also


Funnelback logo