Home > Archive > PHP Language > February 2007 > Anti Web Scraping - Slightly o/t
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
Anti Web Scraping - Slightly o/t
|
|
| Simon Harris 2007-02-08, 6:58 pm |
| Hi,
Apologies if this is slightly o/t.
Does anyone know of a LAMP solution to minimize/stop bots scraping website
content? I know there are some out there, just looking for recommendations
really, as any such software obviously has the potential to have a negative
effect on the site, for SEO.
Thanks.
--
--
* Please reply to group for the benefit of all
* Found the answer to your own question? Post it!
* Get a useful reply to one of your posts?...post an answer to another one
* Search first, post later : http://www.google.co.uk/groups
* Want my email address? Ask me in a post...Cos2MuchSpamMakesUFat!
--------------------------------------------------------------------------------
I am using the free version of SPAMfighter for private users.
It has removed 3678 spam emails to date.
Paying users do not have this message in their emails.
Try SPAMfighter for free now!
| |
| OmegaJunior 2007-02-09, 3:58 am |
| On Thu, 08 Feb 2007 19:42:10 +0100, Simon Harris
<too-much-spam@makes-you-fat.com> wrote:
> Hi,
>
> Apologies if this is slightly o/t.
>
> Does anyone know of a LAMP solution to minimize/stop bots scraping
> website
> content? I know there are some out there, just looking for
> recommendations
> really, as any such software obviously has the potential to have a
> negative
> effect on the site, for SEO.
>
> Thanks.
>
Check out the robots.txt files and .htaccess.
--
Using Opera's revolutionary e-mail client: http://www.opera.com/mail/
| |
|
| On Fri, 09 Feb 2007 07:51:16 +0100, OmegaJunior <omegajunior@spamremove.home.nl> wrote:
>On Thu, 08 Feb 2007 19:42:10 +0100, Simon Harris
><too-much-spam@makes-you-fat.com> wrote:
>
>
>Check out the robots.txt files and .htaccess.
google htaccess ban list
here is the example i started with...
http://www.bluehostforum.com/archiv....php/t-647.html
this is one example, but keep mind that this above lists will pop up a server error becuase there
are 2 or 3 bad entires that are bad and i can't remember off hand which ones
they way i found it was using cut and paste. I pasted a bunch about 20 or so and see if I got an
error then next twenty. If I got an error I worked my way back by deleting and pasting uploading
and retesting. I should have wrote down which ones caused the error but I didn't.
But this technique does work.
tell you what I'll just post the one i have and maybe others in this group can enhance it. This
is from the list from the link above, but I removed 3 entries that gave me errors
note I have some redirects at the top relating to my application, but you need to have the top 2
lines there.
Quick explain some of the rules at the top are for party invitations
if they link to www.WEBSITE.COM/rsvp_la they will be taken to the page which is
www.WEBSITE.COM/rsvp_la.php
the rest filters out all known bots/scrapers. you see all those 'conditions' and the last line in
the group is RewriteRule ^.* - [F,L] the FL means exeute as last rule.
it's basically saying...
if HTTP_USER_AGENT is NOT BlackWidow OR
if HTTP_USER_AGENT is NOT Bot/mailto.. OR etc....
if NONE of the above is true - then write the link, else do NOT allow access
The bottom three lines prevent people from hotlinking to my graphics, technique that i found
somewhere. What happens here is, if some tries to get one of my images and the request is NOT
coming from my server, then I swap out the image with a blank space.gif
.htaccess can be tricky. If none of this makes any sense at all. You need to read up on
mod_rewrite good luck
RewriteEngine On
RewriteBase /
RewriteRule rsvp/?$ /rsvp_la.php
RewriteRule RSVP/?$ /rsvp_la.php
RewriteRule rsvpny/?$ /rsvp_ny.php
RewriteRule RSVPNY/?$ /rsvp_ny.php
RewriteRule rsvp_la/?$ /rsvp_la.php
RewriteRule RSVP_LA/?$ /rsvp_la.php
RewriteCond %{HTTP_USER_AGENT} ^BlackWidow [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Bot\ mailto:craftbot@yahoo.com [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^ChinaClaw [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^ClariaBot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Custo [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^DISCo [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Download\ Demon [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^eCatch [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^EirGrabber [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailCollector [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailSiphon [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailWolf [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Express\ WebPictures [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^ExtractorPro [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^EyeNetIE [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^FlashGet [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^FlickBot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^GetRight [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^GetWeb! [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Gigabot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Go!Zilla [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Go-Ahead-Got-It [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^GrabNet [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Grafula [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^HMView [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^HTTrack [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^iaea\.org [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Image\ Stripper [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Image\ Sucker [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Indy\ Library [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^InterGET [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Internet\ Ninja [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^JetCar [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^JOC\ Web\ Spider [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^larbin [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^LeechFTP [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^LinkWalker [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Mass\ Downloader [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^MIDown\ tool [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Mister\ PiX [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Navroad [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^NearSite [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^NetAnts [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^NetSpider [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Net\ Vampire [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^NetZIP [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^NPBot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Octopus [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Offline\ Explorer [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Offline\ Navigator [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^PageGrabber [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Papa\ Foto [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^pavuk [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^pcBrowser [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^RealDownload [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^ReGet [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^SiteSnagger [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^SmartDownload [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^SuperBot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^SuperHTTP [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Surfbot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^SurveyBot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^tAkeOut [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Teleport\ Pro [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^TurnitinBot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^VoidEYE [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Web\ Image\ Collector [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Web\ Sucker [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^WebAuto [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^WebCopier [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^WebFetch [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^WebGo\ IS [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^WebLeacher [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^WebReaper [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^WebSauger [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Website\ eXtractor [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Website\ Quester [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^WebStripper [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^WebWhacker [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^WebZIP [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Wget [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^whsearch [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Widow [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^WWWOFFLE [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Xaldon\ WebSpider [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Zeus
RewriteRule ^.* - [F,L]
RewriteCond %{HTTP_REFERER} !^$
RewriteCond %{HTTP_REFERER} !^http://(www\.)?YOURWEBSITE.COM/.*$ [NC]
RewriteRule \.(gif|jpg)$ http://www.YOURWEBSITE.COM/images/spacer.gif [R,L]
|
|
|
|
|