Skip to content

Site profiles

Introduction

Defines web crawler interactions for sites within a web collection.

To access the Site Profiles configuration editor, go to administration home page then under the Administer tab select Browse Collection Configuration Files and it will open up the file manager. Then select site_profiles.cfg from dropdown and it will navigate you to the Site Profiles configuration editor.

Site profiles can be used to customise how the web crawler interacts with a particular set of web sites. This can be useful in an environment where a variety of different hardware and software configurations are being interacted with.

Fields

The meaning of the fields is as follows:

FieldValue
Server or domainExact server name or domain name should be in a valid URL or IP format. Partial URL patterns are not accepted. URL does not support wildcards or regular expressions
Request delayThis setting specifies the number of milliseconds to wait before making another request to a single host
Maximum parallel requestsThe values you specify here is used to override the default value that have been specified for the crawl
Revisit policyThis parameter controls what revisit policy the web crawler uses when doing network calls to the server or domain of the site profile
UsernameThe optional username can be used to specify user account details for Crawling password protected websites on specific servers
PasswordThe optional password can be used to specify user account details for Crawling password protected websites on specific servers
Maximum Files StoredThis parameter is used to specify an optional value for the maximum number of files the webcrawler should download during the crawl
CommentsOptional Comments

Examples

  1. If your collection consists of a website with http basic authentication, you can create a site-profile entry to specify rules for that particular domain.
FieldValue
Server or domaindocs.funnelback.com
Request delay100
Maximum parallel requests4
Revisit policyAlwaysRevisitPolicy
Usernamejohnsmith
Passwordpass1234
Maximum Files Stored1000
Comments
  1. If there is a site within a collection which has slow response time, you can specify request delay in a site-profile entry for that particular site.
FieldValue
Server or domainwww.example.com
Request delay200
Maximum parallel requests1
Revisit policySimpleRevisitPolicy
Username
Password
Maximum Files Stored
Comments
  1. You can override the default Maximum parallel requests limit by specifying a new value in a site-profile entry for a site within your collection.
FieldValue
Server or domainserver.example.com
Request delay500
Maximum parallel requests4
Revisit policyAlwaysRevisitPolicy
Usernameuser_one
Passwordpassword_one
Maximum Files Stored
Comments
  1. You can enter an optional value for Maximum Files Stored to specify the maximum number of files the webcrawler should download during the crawl
FieldValue
Server or domainfunnelback.com
Request delay200
Maximum parallel requests2
Revisit policySimpleRevisitPolicy
Username
Password
Maximum Files Stored200
Comments

Notes

  • The values in the list are used to override the default values that have been specified for the crawl as a whole. For example, the default number of parallel requests to each server is usually one, to try to be as polite as possible.
  • You will need to set the Frontier to be a MultipleRequestsFrontier for the max parallel requests setting to be taken into account.
  • The optional username and password can be used to specify user account details for crawling password protected content on specific servers.
  • If you wish to specify the max_files_stored value it needs to be the 7th field on the line, so you may need to have empty values for the optional username and password fields.

RESTFul API

The API is documented using Swagger, which can also be used to interact with the API. To access the Swagger API user interface, go to the administration home page then under the System drop-down menu click on View API UI this will open up the Swagger UI, then click on Site Profiles.

Or you can go directly to:

https://host_name:admin_port/search/admin/api-ui/#/Site32Profiles

See also

top

Funnelback logo
v15.18.0