クローラーのbotユーザーエージェント一覧

8月 8th, 2011

BOTかどうかの判別処理を書く為に自分の管理しているサーバーに対してアクセスのあったBOTのユーザーエージェントを抽出してみた。

PHPで書いたプログラムは、

function isBot($user_agent) {
if ( strpos ( $user_agent,'Googlebot' ) !== false ) {
if ( strpos ( $user_agent,'Mobile' ) !== false ) {
return true; //http://www.google.com/bot.html
} elseif ( strpos ( $user_agent,'Googlebot-Image' ) !== false ) {
return true; //URL無し
} elseif ( strpos ( $user_agent,'YPBot' ) !== false ) {
return true; //http://www.yellowpages.com/about/legal/crawl
}
return true; //http://www.google.com/bot.html
}elseif ( strpos ( $user_agent,'Yahoo! Slurp' ) !== false ) {
return true; //http://help.yahoo.com/help/us/ysearch/slurp
}elseif ( strpos ( $user_agent,'bingbot' ) !== false ) {
return true; //http://www.bing.com/bingbot.htm
}elseif ( strpos ( $user_agent,'Yeti' ) !== false ) {
return true; //http://help.naver.com/robots/
}elseif ( strpos ( $user_agent,'Baiduspider+' ) !== false ) {
return true; //http://www.baidu.com/search/spider.htm
}elseif ( strpos ( $user_agent,'Baiduspider' ) !== false ) {
return true; //http://www.baidu.com/search/spider.html
}elseif ( strpos ( $user_agent,'Steeler' ) !== false ) {
return true; //http://www.tkl.iis.u-tokyo.ac.jp/~crawler/
}elseif ( strpos ( $user_agent,'ichiro/mobile goo' ) !== false ) {
return true; //http://help.goo.ne.jp/help/article/1142/
}elseif ( strpos ( $user_agent,'ichiro' ) !== false ) {
return true; //http://help.goo.ne.jp/door/crawler.html
}elseif ( strpos ( $user_agent,'hotpage.fr' ) !== false ) {
return true; //http://www.hotpage.fr
}elseif ( strpos ( $user_agent,'Feedfetcher-Google' ) !== false ) {
return true; //http://www.google.com/feedfetcher.html
}elseif ( strpos ( $user_agent,'livedoor FeedFetcher' ) !== false ) {
return true; //http://reader.livedoor.com/
}elseif ( strpos ( $user_agent,'ia_archiver' ) !== false ) {
return true; //http://www.alexa.com/site/help/webmasters
}elseif ( strpos ( $user_agent,'YandexBot' ) !== false ) {
return true; //http://yandex.com/bots
}elseif ( strpos ( $user_agent,'SISTRIX Crawler' ) !== false ) {
return true; //http://crawler.sistrix.net/
}elseif ( strpos ( $user_agent,'msnbot-media' ) !== false ) {
return true; //http://search.msn.com/msnbot.htm
}elseif ( strpos ( $user_agent,'zenback bot' ) !== false ) {
return true; //http://www.logly.co.jp/
}elseif ( strpos ( $user_agent,'Y!J-BRI' ) !== false ) {
return true; //http://help.yahoo.co.jp/help/jp/search/indexing/indexing-15.html
}elseif ( strpos ( $user_agent,'TurnitinBot' ) !== false ) {
return true; //http://www.turnitin.com/robot/crawlerinfo.html
}elseif ( strpos ( $user_agent,'Google Desktop' ) !== false ) {
return true; //http://desktop.google.com/
}elseif ( strpos ( $user_agent,'newzia crawler' ) !== false ) {
return true; //http://www.logly.co.jp/
}elseif ( strpos ( $user_agent,'BaiduMobaider' ) !== false ) {
return true; //http://www.baidu.jp/spider/
}elseif ( strpos ( $user_agent,'Y!J-BRJ/YATS crawler' ) !== false ) {
return true; //http://listing.yahoo.co.jp/support/faq/int/other/other_001.html
}elseif ( strpos ( $user_agent,'Seznam screenshot-generator' ) !== false ) {
return true; //http://fulltext.sblog.cz/screenshot/
}elseif ( strpos ( $user_agent,'SiteBot' ) !== false ) {
return true; //http://www.sitebot.org/robot/
}elseif ( strpos ( $user_agent,'Purebot' ) !== false ) {
return true; //http://www.puritysearch.net/
}elseif ( strpos ( $user_agent,'emBot-GalaBuzz/Nutch' ) !== false ) {
return true; //http://emining.jp/
}elseif ( strpos ( $user_agent,'Search17Bot' ) !== false ) {
return true; //http://www.search17.com/bot.php
}elseif ( strpos ( $user_agent,'Toread-Crawler' ) !== false ) {
return true; //http://news.toread.cc/crawler.php
}elseif ( strpos ( $user_agent,'Tumblr' ) !== false ) {
return true; //http://www.tumblr.com/
}elseif ( strpos ( $user_agent,'DotBot' ) !== false ) {
return true; //http://www.dotnetdotcom.org/
}elseif ( strpos ( $user_agent,'Chilkat' ) !== false ) {
return true; //http://www.chilkatsoft.com/ChilkatHttpUA.asp
}
return false;
}

こんな感じ。

コピペし続けて手が疲れた・・・。

使うときは

isBot(isset($_SERVER['HTTP_USER_AGENT'])?$_SERVER['HTTP_USER_AGENT']:'')

とでもして下さい。

注)そのままコピー&ペーストして使用して頂いて構いませんが、プログラムにミスがあっても自己責任でお願いします

誰かの役に立つかもしれないのでユーザーエージェントを丸ごと貼りつけておきます。

並び順はアクセスの多い順になっています。

Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
Yeti/1.0 (NHN Corp.; http://help.naver.com/robots/)
Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)
Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)
DoCoMo/2.0 N905i(c100;TB;W24H16) (compatible; Googlebot-Mobile/2.1; +http://www.google.com/bot.html)
Mozilla/5.0 (compatible; Steeler/3.5; http://www.tkl.iis.u-tokyo.ac.jp/~crawler/)
ichiro/3.0 (http://help.goo.ne.jp/door/crawler.html)
hotpage.fr (http://www.hotpage.fr)
Feedfetcher-Google; (+http://www.google.com/feedfetcher.html; 1 subscribers; feed-id=6741455251313593689)
Baiduspider+(+http://www.baidu.com/search/spider.htm)
livedoor FeedFetcher/0.01 (http://reader.livedoor.com/; 1 subscriber)
ia_archiver (+http://www.alexa.com/site/help/webmasters; crawler@alexa.com)
Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)
Mozilla/5.0 (compatible; SISTRIX Crawler; http://crawler.sistrix.net/)
msnbot-media/1.1 (+http://search.msn.com/msnbot.htm)
Mozilla/5.0 (compatible; zenback bot; powered by logly +http://www.logly.co.jp/)
Y!J-BRI/0.0.1 crawler ( http://help.yahoo.co.jp/help/jp/search/indexing/indexing-15.html )
TurnitinBot/2.1 (http://www.turnitin.com/robot/crawlerinfo.html)
Mozilla/5.0 (compatible; Google Desktop/5.9.1005.12335; http://desktop.google.com/)
Mozilla/5.0 (compatible; newzia crawler +http://www.logly.co.jp/)
DoCoMo/2.0 P05A(c100;TB;W24H15) (compatible; BaiduMobaider/1.0; +http://www.baidu.jp/spider/)
Y!J-BRJ/YATS crawler (http://listing.yahoo.co.jp/support/faq/int/other/other_001.html)
Mozilla/5.0 (compatible; Seznam screenshot-generator 2.0; +http://fulltext.sblog.cz/screenshot/)
Mozilla/5.0 (compatible; SiteBot/0.1; +http://www.sitebot.org/robot/)
Mozilla/5.0 (compatible; Purebot/1.1; +http://www.puritysearch.net/)
emBot-GalaBuzz/Nutch-1.0 (http://emining.jp/; em@galabuzz.jp)
Mozilla/5.0 (compatible; Search17Bot/1.1; http://www.search17.com/bot.php)
DoCoMo/2.0 P900i(c100;TB;W24H11) (compatible; ichiro/mobile goo; +http://help.goo.ne.jp/help/article/1142/)
YPBot/Raven1.1.3 (compatible; Googlebot/2.1;+http://www.yellowpages.com/about/legal/crawl)
Mozilla/4.0 (Toread-Crawler/1.1; +http://news.toread.cc/crawler.php)
Tumblr/1.0 RSS syndication (+http://www.tumblr.com/) (support@tumblr.com)
Mozilla/5.0 (compatible; DotBot/1.1; http://www.dotnetdotcom.org/, crawler@dotnetdotcom.org)
Chilkat/1.0.0 (+http://www.chilkatsoft.com/ChilkatHttpUA.asp)

カテゴリー: PHP, サーバー, 開発

Leave a comment

コメントフィード1件のコメント

  1. 偽クローラ

    凄い役に立ちました
    ありがとう!

Leave a comment

コメントは承認待ちです。表示されるまでしばらく時間がかかるかもしれません。

Feed

http://blog.yume-dia.jp / クローラーのbotユーザーエージェント一覧