Skip to content

crawler.link_extraction_regular_expression (collection.cfg setting)

Description

This option defines the regular expression that will be used to extract URLs from HTML links like the following:

<link rel="alternate" href="http://www.abc.net.au/mobile"/>
<a href="http://www.abc.net.au">ABC</a>
<img src="http://www.abc.net.au/logo.png" alt="ABC Logo"/>

Default value

crawler.link_extraction_regular_expression=\s(href|src)(\s)*=(\s)*(\'|\")?\\s*(.*?)(>|\"|\'|(\s\w+\=))

If no value is defined, then the above default is used.

Examples

crawler.link_extraction_group=5
crawler.link_extraction_regular_expression=\s(href|src)(\s)*=(\s)*(\'|\")?\\s*(.*?)(>|\"|\'|(\s\w+\=))

Extracted groups:

  1. (href|src): handle link, a or img HTML tags.
  2. (\s): optional spaces
  3. (\s): optional spaces
  4. (\'|\"):- quotes to begin the URL
  5. (.*?): the URL (non-greedy pattern)
  6. (>|\"|\'|(\s\w+\=)): end the URL

See also

top

Funnelback logo
v15.16.0