The importance of checking website server logs
Very rarely do I find the time to check the raw access logs for my websites. This is mainly because I’ve got Google Analytics installed and since they provide pretty much all of the information I require, the need to check my server logs is not really necessary.
However, Google Analytics does not show you which search engine bots has been crawling your website because the bots do not execute javascript which analytics uses to track visitors to the site. I had to download my server logs and check whether a particular webpage has been crawled. I did a search for “Googlebot” and got an audit trail of which pages have been crawled for the past month or so but I also noticed a completely different IP for Googlebot. I know the standard IP range starts with 66 (eg 66.249.65.49) but this was one 114.152.179.5. So I did a reverse DNS check on that IP and guess what, the IP which claimed to be googlebot is not googlebot at all. It’s for someone from japan who’s done a program to leech content off my website.
The server logs revealed that it has requested a significant amount of pages and I thought that google was really loving my website and decided to do a deep crawl but in reality, somebody was stealing content off my site. You can guess my frustration! I have now blocked the IP address through htaccess but I will be checking my raw access logs more often in an attempt to ban future leechers. If you’re a leecher and reading this, beware because I’m on to you now!