Skip to content

Site profiles

Introduction

Defines web crawler interactions for sites within a web collection.

To access the Site Profiles configuration editor, go to administration home page then under the Administer tab select Browse Collection Configuration Files and it will open up the file manager. Then select site_profiles.cfg from dropdown and it will navigate you to the Site Profiles configuration editor.

Site profiles can be used to customise how the web crawler interacts with a particular set of web sites. This can be useful in an environment where a variety of different hardware and software configurations are being interacted with.

Fields

The meaning of the fields is as follows:

Field Value
Server or domain Exact server name or domain name should be in a valid URL or IP format. Partial URL patterns are not accepted. URL does not support wildcards or regular expressions
Request delay This setting specifies the number of milliseconds to wait before making another request to a single host
Maximum parallel requests The values you specify here is used to override the default value that have been specified for the crawl
Revisit policy This parameter controls what revisit policy the web crawler uses when doing network calls to the server or domain of the site profile
Username The optional username can be used to specify user account details for Crawling password protected websites on specific servers
Password The optional password can be used to specify user account details for Crawling password protected websites on specific servers
Maximum Files Stored This parameter is used to specify an optional value for the maximum number of files the webcrawler should download during the crawl
Comments Optional Comments

Examples

  1. If your collection consists of a website with http basic authentication, you can create a site-profile entry to specify rules for that particular domain.
Field Value
Server or domain docs.funnelback.com
Request delay 100
Maximum parallel requests 4
Revisit policy AlwaysRevisitPolicy
Username johnsmith
Password pass1234
Maximum Files Stored 1000
Comments
  1. If there is a site within a collection which has slow response time, you can specify request delay in a site-profile entry for that particular site.
Field Value
Server or domain www.example.com
Request delay 200
Maximum parallel requests 1
Revisit policy SimpleRevisitPolicy
Username
Password
Maximum Files Stored
Comments
  1. You can override the default Maximum parallel requests limit by specifying a new value in a site-profile entry for a site within your collection.
Field Value
Server or domain server.example.com
Request delay 500
Maximum parallel requests 4
Revisit policy AlwaysRevisitPolicy
Username user_one
Password password_one
Maximum Files Stored
Comments
  1. You can enter an optional value for Maximum Files Stored to specify the maximum number of files the webcrawler should download during the crawl
Field Value
Server or domain funnelback.com
Request delay 200
Maximum parallel requests 2
Revisit policy SimpleRevisitPolicy
Username
Password
Maximum Files Stored 200
Comments

Notes

  • The values in the list are used to override the default values that have been specified for the crawl as a whole. For example, the default number of parallel requests to each server is usually one, to try to be as polite as possible.
  • You will need to set the Frontier to be a MultipleRequestsFrontier for the max parallel requests setting to be taken into account.
  • The optional username and password can be used to specify user account details for crawling password protected content on specific servers.
  • If you wish to specify the max_files_stored value it needs to be the 7th field on the line, so you may need to have empty values for the optional username and password fields.

RESTFul API

The API is documented using Swagger, which can also be used to interact with the API. To access the Swagger API user interface, go to the administration home page then under the System drop-down menu click on View API UI this will open up the Swagger UI, then click on Site Profiles.

Or you can go directly to:

https://host_name:admin_port/search/admin/api-ui/#/Site32Profiles

See also

top

Funnelback logo
v15.24.0