看看百度是怎样让 Web 服务器疯狂的
July 21st, 2005
早在2001年的时候,我就提出过防止“度蜘蛛” BaiduSpider 的侵蚀。直到现在四年多过去了, 这个蜘蛛依然危害着互联网,侵占互联网的带宽。 看看下面的一段日志:
202.108.11.233 - - [21/Jul/2005:14:36:20 +0800] "GET http://nalai.net/content/view/305174/31/ HTTP/1.1" 403 457 "-" "Baiduspider+(+http://www.baidu.com/search/spider.htm)"
202.108.11.233 - - [21/Jul/2005:14:36:21 +0800] "GET http://nalai.net/content/view/304951/32/ HTTP/1.1" 403 457 "-" "Baiduspider+(+http://www.baidu.com/search/spider.htm)"
202.108.11.233 - - [21/Jul/2005:14:36:23 +0800] "GET http://nalai.net/content/view/306886/27/ HTTP/1.1" 403 457 "-" "Baiduspider+(+http://www.baidu.com/search/spider.htm)"
202.108.11.233 - - [21/Jul/2005:14:36:23 +0800] "GET http://nalai.net/content/view/310054/28/ HTTP/1.1" 403 457 "-" "Baiduspider+(+http://www.baidu.com/search/spider.htm)"
202.108.11.233 - - [21/Jul/2005:14:36:23 +0800] "GET http://nalai.net/content/view/306963/32/ HTTP/1.1" 403 457 "-" "Baiduspider+(+http://www.baidu.com/search/spider.htm)"
61.49.186.145 - - [21/Jul/2005:14:36:24 +0800] "GET http://albertxu.nalai.net/post/1/9124/ HTTP/1.1" 200 197752 "-" "Mozilla/4.0(compatible; MSIE 5.0; Windows 98; DigExt)"
202.108.11.233 - - [21/Jul/2005:14:36:24 +0800] "GET http://nalai.net/content/view/305244/33/ HTTP/1.1" 403 457 "-" "Baiduspider+(+http://www.baidu.com/search/spider.htm)"
202.108.11.233 - - [21/Jul/2005:14:36:26 +0800] "GET http://nalai.net/content/view/307014/31/ HTTP/1.1" 403 457 "-" "Baiduspider+(+http://www.baidu.com/search/spider.htm)"
202.108.11.233 - - [21/Jul/2005:14:36:26 +0800] "GET http://nalai.net/content/view/306971/32/ HTTP/1.1" 403 457 "-" "Baiduspider+(+http://www.baidu.com/search/spider.htm)"
202.108.11.233 - - [21/Jul/2005:14:36:26 +0800] "GET http://nalai.net/content/view/306004/31/ HTTP/1.1" 403 457 "-" "Baiduspider+(+http://www.baidu.com/search/spider.htm)"
202.108.11.233 - - [21/Jul/2005:14:36:30 +0800] "GET http://nalai.net/content/view/305282/7/ HTTP/1.1" 403 456 "-" "Baiduspider+(+http://www.baidu.com/search/spider.htm)"
202.108.11.233 - - [21/Jul/2005:14:36:30 +0800] "GET http://nalai.net/content/view/306236/33/ HTTP/1.1" 403 457 "-" "Baiduspider+(+http://www.baidu.com/search/spider.htm)"
202.108.11.233 - - [21/Jul/2005:14:36:30 +0800] "GET http://nalai.net/content/view/307039/33/ HTTP/1.1" 403 457 "-" "Baiduspider+(+http://www.baidu.com/search/spider.htm)"
202.108.11.233 - - [21/Jul/2005:14:36:32 +0800] "GET http://nalai.net/content/view/314483/31/ HTTP/1.1" 403 457 "-" "Baiduspider+(+http://www.baidu.com/search/spider.htm)"
202.108.11.233 - - [21/Jul/2005:14:36:32 +0800] "GET http://nalai.net/content/view/306820/32/ HTTP/1.1" 403 457 "-" "Baiduspider+(+http://www.baidu.com/search/spider.htm)"
202.108.11.233 - - [21/Jul/2005:14:36:33 +0800] "GET http://nalai.net/content/view/305283/27/ HTTP/1.1" 403 457 "-" "Baiduspider+(+http://www.baidu.com/search/spider.htm)"
202.108.11.233 - - [21/Jul/2005:14:36:34 +0800] "GET http://nalai.net/content/view/314486/33/ HTTP/1.1" 403 457 "-" "Baiduspider+(+http://www.baidu.com/search/spider.htm)"
202.108.11.233 - - [21/Jul/2005:14:36:34 +0800] "GET http://nalai.net/content/view/306961/30/ HTTP/1.1" 403 457 "-" "Baiduspider+(+http://www.baidu.com/search/spider.htm)"
68.142.250.88 - - [21/Jul/2005:14:36:35 +0800] "GET http://oso.nalai.net/print/1/11408/pdf HTTP/1.0" 200 2595 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)"
202.108.11.233 - - [21/Jul/2005:14:36:35 +0800] "GET http://nalai.net/content/view/305242/7/ HTTP/1.1" 403 456 "-" "Baiduspider+(+http://www.baidu.com/search/spider.htm)"
202.108.11.233 - - [21/Jul/2005:14:36:35 +0800] "GET http://nalai.net/content/view/305839/27/ HTTP/1.1" 403 457 "-" "Baiduspider+(+http://www.baidu.com/search/spider.htm)"
202.108.11.233 - - [21/Jul/2005:14:36:37 +0800] "GET http://nalai.net/content/view/306842/33/ HTTP/1.1" 403 457 "-" "Baiduspider+(+http://www.baidu.com/search/spider.htm)"
202.108.11.233 - - [21/Jul/2005:14:36:37 +0800] "GET http://nalai.net/content/view/306626/30/ HTTP/1.1" 403 457 "-" "Baiduspider+(+http://www.baidu.com/search/spider.htm)"
202.108.11.233 - - [21/Jul/2005:14:36:38 +0800] "GET http://nalai.net/content/view/306814/32/ HTTP/1.1" 403 457 "-" "Baiduspider+(+http://www.baidu.com/search/spider.htm)"
这是一段连续的日志,这里的 403 错误, 是我通过修改 Apache 的配置, 通过识别 User-Agent 来阻止这只蜘蛛愚蠢的行为。
如果你去和百度交涉, 他会说他们的算法是如何优秀,不会这样疯狂的爬网站的,可是, 如果您是 Webmaster ,有时间去关注您的 Web 服务器的日志的话, 我相信您也会在自己的Web 服务器日志看到这样疯狂的行为。
如果你的服务器是 Unix 的话, 用以下命令即可查看:
tail -f /var/log/apache/access_log|grep Baiduspider+
这就是百度,the stupid spider crawling the internet in a crazy way.
他要上市圈钱了, 据说也只有 8000 万的规模,我想, if you are investor, and happenly read
this article, please think about carefully before you buy the
stock of this company.
四年前我就倡议过,消灭度蜘蛛, 而今, 我依然这样倡议。
阿呆 said:
百度,究竟你搜索了谁?