PHP Classes

File: README.txt

Recommend this page to a friend!
  Classes of Andy Pieters  >  Robots_txt  >  README.txt  >  Download  
File: README.txt
Role: Documentation
Content type: text/plain
Description: Usage Examples
Class: Robots_txt
Test if a URL may be crawled looking at robots.txt
Author: By
Last change:
Date: 14 years ago
Size: 1,604 bytes


Class file image Download
		Robots exclusion standard is considered propper netiquette, so any kind of script that exhibits
		crawling-like behavior is expected to abide by it.

		The intended use of this class is to feed it a url before you intend to visit it. The class will
		automatically attempt to read the robots.txt file and will return a boolean value to indicate if
		you are allowed to visit this url.

		Maximum Crawl-delays and request-rates maxed-out at 60seconds.

		The class will block until the detected crawl-delay (or request-rate) allows visiting the url.

		For instance, if Crawl-delay is set to 3, the Robots_txt::urlAllowed() method will block for 3
		seconds when called a second time. An internal clock is kept with the last visited time, so if
		the delay is already expired, the method will not block.

		Example usage

		foreach($arrUrlsToVisit as $strUrlToVisit) {

			if(Robots_txt::urlAllowed($strUrlToVisit,$strUserAgent)) {

				#visit url, do processing. . . 

		The simple example above will ensure you abide by the wishes of the site owners.

		Note: an unofficial non-standard extension exists, that limits the times that crawlers
			  are allowed to visit a site. I choose to ignore this extension because I feel it
			  is unreasonable.

		Note: You are only *required* to specify your userAgent the first time you call the urlAllowed method, and
			  only the first value is ever used.
Example Usage
For more information send a message to info at phpclasses dot org.