How to deal with Web Scraping


Hummmm since i have several galleries, one thing i encounter often is web scrapers, personally i don't mind if anybody takes the entire gallery home or even if they re-post it somewhere else, that is fine with me, this is the web, if i wanted the gallery to be private i would made it so, if its public and free then go right ahead...



However the content itself is not the problem, the problem here is that the vast majority of web scrapers has bad default settings or the users that use them put too aggressive settings, its not uncommon for the load on the server to go from 0.10 to 1 in a heartbeat or even go down, i know its partly my fault i personally like to restrict as little as possible the server or the software (i could use several methods to restrict connections or to ban ip's if there are too many connections), however because i don't, sometimes i get into trouble, so this is what i normally do.

First of all i have this on the .htaccces (mod_rewrite), that helps blocking most of scrapping software (unless it spoofs as a browsers hehehe):

RewriteEngine OnRewriteCond %{HTTP_USER_AGENT} ^BlackWidow [OR]

RewriteCond %{HTTP_USER_AGENT} ^Bot\ mailto:craftbot@yahoo.com [OR]

RewriteCond %{HTTP_USER_AGENT} ^ChinaClaw [OR]

RewriteCond %{HTTP_USER_AGENT} ^Custo [OR]

RewriteCond %{HTTP_USER_AGENT} ^DISCo [OR]

RewriteCond %{HTTP_USER_AGENT} ^Download\ Demon [OR]

RewriteCond %{HTTP_USER_AGENT} ^eCatch [OR]

RewriteCond %{HTTP_USER_AGENT} ^EirGrabber [OR]

RewriteCond %{HTTP_USER_AGENT} ^EmailSiphon [OR]

RewriteCond %{HTTP_USER_AGENT} ^EmailWolf [OR]

RewriteCond %{HTTP_USER_AGENT} ^Express\ WebPictures [OR]

RewriteCond %{HTTP_USER_AGENT} ^ExtractorPro [OR]

RewriteCond %{HTTP_USER_AGENT} ^EyeNetIE [OR]

RewriteCond %{HTTP_USER_AGENT} ^FlashGet [OR]

RewriteCond %{HTTP_USER_AGENT} ^GetRight [OR]

RewriteCond %{HTTP_USER_AGENT} ^GetWeb! [OR]

RewriteCond %{HTTP_USER_AGENT} ^Go!Zilla [OR]

RewriteCond %{HTTP_USER_AGENT} ^Go-Ahead-Got-It [OR]

RewriteCond %{HTTP_USER_AGENT} ^GrabNet [OR]

RewriteCond %{HTTP_USER_AGENT} ^Grafula [OR]

RewriteCond %{HTTP_USER_AGENT} ^HMView [OR]

RewriteCond %{HTTP_USER_AGENT} HTTrack [NC,OR]

RewriteCond %{HTTP_USER_AGENT} ^Image\ Stripper [OR]

RewriteCond %{HTTP_USER_AGENT} ^Image\ Sucker [OR]

RewriteCond %{HTTP_USER_AGENT} Indy\ Library [NC,OR]

RewriteCond %{HTTP_USER_AGENT} ^InterGET [OR]

RewriteCond %{HTTP_USER_AGENT} ^Internet\ Ninja [OR]

RewriteCond %{HTTP_USER_AGENT} ^JetCar [OR]

RewriteCond %{HTTP_USER_AGENT} ^JOC\ Web\ Spider [OR]

RewriteCond %{HTTP_USER_AGENT} ^larbin [OR]

RewriteCond %{HTTP_USER_AGENT} ^LeechFTP [OR]

RewriteCond %{HTTP_USER_AGENT} ^Mass\ Downloader [OR]

RewriteCond %{HTTP_USER_AGENT} ^MIDown\ tool [OR]

RewriteCond %{HTTP_USER_AGENT} ^Mister\ PiX [OR]

RewriteCond %{HTTP_USER_AGENT} ^Navroad [OR]

RewriteCond %{HTTP_USER_AGENT} ^NearSite [OR]

RewriteCond %{HTTP_USER_AGENT} ^NetAnts [OR]

RewriteCond %{HTTP_USER_AGENT} ^NetSpider [OR]

RewriteCond %{HTTP_USER_AGENT} ^Net\ Vampire [OR]

RewriteCond %{HTTP_USER_AGENT} ^NetZIP [OR]

RewriteCond %{HTTP_USER_AGENT} ^Octopus [OR]

RewriteCond %{HTTP_USER_AGENT} ^Offline\ Explorer [OR]

RewriteCond %{HTTP_USER_AGENT} ^Offline\ Navigator [OR]

RewriteCond %{HTTP_USER_AGENT} ^PageGrabber [OR]

RewriteCond %{HTTP_USER_AGENT} ^Papa\ Foto [OR]

RewriteCond %{HTTP_USER_AGENT} ^pavuk [OR]

RewriteCond %{HTTP_USER_AGENT} ^pcBrowser [OR]

RewriteCond %{HTTP_USER_AGENT} ^RealDownload [OR]

RewriteCond %{HTTP_USER_AGENT} ^ReGet [OR]

RewriteCond %{HTTP_USER_AGENT} ^SiteSnagger [OR]

RewriteCond %{HTTP_USER_AGENT} ^SmartDownload [OR]

RewriteCond %{HTTP_USER_AGENT} ^SuperBot [OR]

RewriteCond %{HTTP_USER_AGENT} ^SuperHTTP [OR]

RewriteCond %{HTTP_USER_AGENT} ^Surfbot [OR]

RewriteCond %{HTTP_USER_AGENT} ^tAkeOut [OR]

RewriteCond %{HTTP_USER_AGENT} ^Teleport\ Pro [OR]

RewriteCond %{HTTP_USER_AGENT} ^VoidEYE [OR]

RewriteCond %{HTTP_USER_AGENT} ^Web\ Image\ Collector [OR]

RewriteCond %{HTTP_USER_AGENT} ^Web\ Sucker [OR]

RewriteCond %{HTTP_USER_AGENT} ^WebAuto [OR]

RewriteCond %{HTTP_USER_AGENT} ^WebCopier [OR]

RewriteCond %{HTTP_USER_AGENT} ^WebFetch [OR]

RewriteCond %{HTTP_USER_AGENT} ^WebGo\ IS [OR]

RewriteCond %{HTTP_USER_AGENT} ^WebLeacher [OR]

RewriteCond %{HTTP_USER_AGENT} ^WebReaper [OR]

RewriteCond %{HTTP_USER_AGENT} ^WebSauger [OR]

RewriteCond %{HTTP_USER_AGENT} ^Website\ eXtractor [OR]

RewriteCond %{HTTP_USER_AGENT} ^Website\ Quester [OR]

RewriteCond %{HTTP_USER_AGENT} ^WebStripper [OR]

RewriteCond %{HTTP_USER_AGENT} ^WebWhacker [OR]

RewriteCond %{HTTP_USER_AGENT} ^WebZIP [OR]

RewriteCond %{HTTP_USER_AGENT} ^Widow [OR]

RewriteCond %{HTTP_USER_AGENT} ^WWWOFFLE [OR]

RewriteCond %{HTTP_USER_AGENT} ^Xaldon\ WebSpider [OR]

RewriteCond %{HTTP_USER_AGENT} ^Zeus

RewriteRule ^.* - [F,L]

I monitor the load of the server if the load as a spike for more than a couple of minutes i check the apache log, if there are lots of connections to the same site from the same ip, i ban this ip with .htaccess, adding this line:

Order allow,deny
Deny from 100.100.100.100
Allow from all

(100.100.100.100 is the ip on the logs) and check the load after a couple of minutes, if its down fine, if they jumped ip's, i'll do one of two things, if they keep on the same ip range i'll just block that ip range, like so:

Order allow,deny
Deny from 100.100.100.
Allow from all

If they aren't then i limit the connections to 10 on the cache or image directory (not on all the site), i know it will hurt all users but its better than nothing, just adding this line:

MaxClients 10

If it still persists, i'll just close the image directory,adding this line again on the cache or image directory (depending on the type of image gallery software you are using):

deny from all

So the site stays up as well as the thumbnails, just the images won't be accessible for a while, all of these are temporary measures, but for now, for me they do the trick, most of the times banning the ip is enough of a cure, and those i always leave on the .htaccess the other options i normally remove them the next day after the connections storm has passed, bottom line is if you want to scrape instead of bombing the server for an hour, make it so it downloads slowly during a couple of hours it makes a big diference and everyone gets what they want.