BACKGROUND
I've made a testing bot with PHPCrawl (PHP, Apache, Windows). I am using it to enter a few hundred pages on my site, executing pages to trigger possible errors. If I a syntax error exists in any visisted file, my error log will contain info about it.
THE PROBLEM
If I have a thousand database posts viewed by the same file, like: post.php?id=1 post.php?id=n
It could be enough to test one, or maybe ten different ids for a certain file. If one post work, its likely all posts are working (in my case, it is).
I HAVE FILTERS, BUT OTHER KINDS
I have filters that make the bot avoid URLs with words like "delete" and "remove" which saves me from having a bot that is deleting all my data, but those filters are defined by how the url is formatted, not by a count or limit.
I don't understand how to ignore just some of the URLs. I have overloaded handleDocumentInfo() which is called after the link already has been processed. There "must be" a function called when the URL is found, before visiting it and by that time can be kept or ignored, or am I wrong?
Example of how I want to control it
I would like to make code similar to this example:
if(substr($foundURL, 0, 8) == "post.php")
{
$counter++;
if($counter == 10)
{
return false;
}
return true;
}
Any good ideas? Thanks! (I've asked this question before and tried to make this more clear. Sorry about possible bad english).
Aucun commentaire:
Enregistrer un commentaire