vendredi 2 septembre 2016

Possible to control how many times phpcrawl visits the same file (but different id)?

BACKGROUND

I've made a testing bot with PHPCrawl (PHP, Apache, Windows). I am using it to enter a few hundred pages on my site, executing pages to trigger possible errors. If I a syntax error exists in any visisted file, my error log will contain info about it.

THE PROBLEM

If I have a thousand database posts viewed by the same file, like: post.php?id=1 post.php?id=n

It could be enough to test one, or maybe ten different ids for a certain file. If one post work, its likely all posts are working (in my case, it is).

I HAVE FILTERS, BUT OTHER KINDS

I have filters that make the bot avoid URLs with words like "delete" and "remove" which saves me from having a bot that is deleting all my data, but those filters are defined by how the url is formatted, not by a count or limit.

I don't understand how to ignore just some of the URLs. I have overloaded handleDocumentInfo() which is called after the link already has been processed. There "must be" a function called when the URL is found, before visiting it and by that time can be kept or ignored, or am I wrong?

Example of how I want to control it

I would like to make code similar to this example:

if(substr($foundURL, 0, 8) == "post.php")
{

    $counter++;
    if($counter == 10)
    {
        return false;
    }
    return true;
}

Any good ideas? Thanks! (I've asked this question before and tried to make this more clear. Sorry about possible bad english).

Aucun commentaire:

Enregistrer un commentaire