Search Engine Optimization: How to stop most people from spidering your site and stealing content

I run a few sites with a lot of content that I don't want spidered by anyone other than the major search engines. At best, undesired spidering eats at your bandwidth and page response time. At worst, it can lead to widespread stealing and duplication of your content.

Anyway, this post details exactly what I currently do to prevent such spidering. All the code is in my new git repository I created to share code with you.

I created a DB named logip, owned by user logip, and then added these tables. The logip table records requests from IP addresses I am tracking. And the badips and badips2 tables hold IP addresses I am presently blocking.
I set up this Perl script (logip.pl) to run all the time. It tails my Web server log file and adds IP addresses to the logip table. There are variables at the top you can use to exclude certain virtual hosts, e.g. a site with few pages where you aren't concerned with crawling. You can also exempt certain IPs, e.g. your own.

logip.pl currently only adds IPs for requests with 200 HTTP status codes (OKs). It also skips images, css, js, etc., IP addresses known to be associated with the major search engines, and repeated requests for the same page by the same IP (hard refreshes). The idea is to record unique successful requests of actual distinct pages from non-search engines.

I run the script via daemontools, but you could run it through inetd or whatever. If daemontools interests you, the commands I ran to setup management are in logip.sh. If you use those commands, you will want to change them to point to your svscan and logip.pl script directories appropriately.
I set up this Perl script (badips.pl) to run periodically. In particular, I have it set up to run every minute via crontab. The frequency at which it is run is the minimum frequency that new violators of my spidering policy will be blocked. So if you run it every minute, people will have (on average) half a minute to grab your stuff before you start blocking them. I haven't found in practice an urge to make the time interval smaller, but if a lot of people desire it, I could rewrite the script for that purpose.

badips.pl works on a threshold basis. It looks at various tunable timeframes and checks whether new IP addresses have exceeded a page request threshold for those timeframes. The ones I currently use are in the script. For example, 20 page requests in the past minute, or 50 over a day. You can tune these to what is appropriate for your sites.

The second variable is whether to log violators in either the badips or badips2 tables. The distinction is whether you think a violation is really bad or just pretty bad. For example, I currently mark passing the minute threshold as really bad and all else as just pretty bad. A pretty bad block stays around for 10 days, whereas a really bad block stays around for 180 days.
The output of badips.pl is a configuration file that nginx or Apache reads on the fly. It works with both Web servers, and there is a variable at the top of the file to indicate which one you are using. The resulting conf file is a bunch of Deny IP lines that lists out the current IPs you are blocking.

For Apache, there are a some other lines that preemptively block suspicious user agents, e.g. curl and wget. I haven't yet ported these preemptive lines over to nginx. The intention is for Apache to see the file via an Allow Override All directive, i.e. via a changing .htaccess file. For nginx, the configuration is reloaded on the fly.
If new IPs are added, you are sent an email notifying you of the new block(s). The script attempts to do reverse DNS on the IP and the forward DNS on that host to give you some context. For example, if it is a Google IP, you will want to unblock it. However, in practice, I haven't done that in a while because those IPs are well exempted in logip.pl. badips.pl also cleans up the DB at the end of the script before exiting, deleting expired records and vacuuming the table (for PostgreSQL).

I've evolved this process over the last few years, and it works quite well for me and my sites. Your feedback if of course welcome. I'm always looking for improvements and am willing to make them.

I am aware that the current process has some holes. The two biggest are:

You can spider successfully via a large number of IPs, most notably the TOR network. In the past, I have added those IPs dynamically, and I might do that again. Adding a ton of IPs slows down the Web server considerably, however. This is why I backed off of that approach in the past.
You can grab pages really slowly. That is, if you stay under the thresholds, you won't get caught by this system.

By Gabriel Weinberg on September 15, 2009

Search Engine Optimization

Wednesday, December 23, 2009

How to stop most people from spidering your site and stealing content

No comments:

Post a Comment

Labels

Followers