Skip to content

crawler.max_files_per_area

Specifies a limit on the number of files from a single directory or dynamically generated URLs that will be crawled.

Key: crawler.max_files_per_area
Type: Integer
Can be set in: collection.cfg

Description

This option sets the limit for the number of files within an area. Here "area" is defined as either a static directory or a generator e.g. index.asp?doc=123.

Note: This parameter was previously called crawler.max_dir_size - the name was changed to show that generators are also included in this definition.

If the crawler encounters an area on a site with the URL:

http://www.example.com/white_papers/

and downloads multiple files from within this area/directory then it will stop downloading any further content from this directory once the specified limit is reached.

A similar approach is used for generators e.g. if we encounter a generator like:

http://www.example.com/index.asp?doc=1

and have downloaded multiple URLs generated by this index.asp script then the crawler will download no more from this generator when the limit is reached.

Lotus Notes generator scripts (.nsf) look like directories e.g.

http://www.example.com/publish.nsf/content/doc123/

In this example if "publish.nsf" generates more than the limit we will not request more content from it, even though from the URL it looks like there are other directories or areas underneath it.

Note: If you are trying to crawl a dynamically generated site which has a lot of content generated from a single generator then you may need to increase the default value for this parameter if you are not getting back as much content as you expect.

Default Value

crawler.max_files_per_area=10000

See Also

top

Funnelback logo
v15.24.0