Aggressive Internet spiders (
web bots) have become a real pain in the
ass for myself and anyone else who writes software that handles Internet
traffic. Programmers and Developers listen
please control your programs.
They come running through machines sucking down everything the can find, not
obeying robots.txt files and generally throwing network administrators into
a panic. So please if you're writing an Internet agent for whatever reason
please control how much it grabs from a particular machine and how often.
They seem to get trapped in my machines and request far more than they
actually should, looping about through sites over and over again.
Rules for Internet Spider Agents
- Obey robots.txt files
- Be polite and don't run-amuck through other peoples sites requesting
thousands of times in a few minutes.
- Don't set your bot to identify itself as an Internet browser(Mozilla). We
don't want spiders here for a reason and covering it up via a browser disguise
is only going to work until we've discovered you requesting 1000 pages every 10
minutes.
Basically have your
web bots behave themselves as
you would or
would
like others to act. I think many programmers simply don't think
when they write their bots for whatever task. When I first began writing bots I
had alot of experiences in getting my systems banned by other administrators. If
everyone writes more polite bots then we won't have this problem. I won't have
to sit around figuring out how to block dipshit spider v1.0 over the weekend
because it gets caught up scraping through one of my machines!!! Thank you and
happy programming.
Userful links
libwww-perl-5.800
- ala PERL this is the best way to write basic but effective Agents. Please
program responsibly.