对于遵循robots协议的蜘蛛,可以直接在robots禁止。将下面的内容加入到网站根目录下面的robots.txt就可以了
User-agent: SemrushBot
Disallow: /
User-agent: DotBot
Disallow: /
User-agent: MegaIndex.ru
Disallow: /
User-agent: MauiBot
Disallow: /
User-agent: AhrefsBot
Disallow: /
User-agent: MJ12bot
Disallow: /
User-agent: BLEXBot
Disallow: /
对于不尊许robots规则的蜘蛛,目前能够屏蔽的方法就是根据useragent或者ip来禁止了。 对于这些蜘蛛程序我们可以在conf中配置
#禁止 Scrapy 等工具的抓取
if ($http_user_agent ~* (Scrapy|Curl|HttpClient)) {
return 403;
}
#禁止指定 UA 及 UA 为空的访问
if ($http_user_agent ~ "yisouspider|FeedDemon|Indy Library|Alexa Toolbar|AskTbFXTV|AhrefsBot|CrawlDaddy|CoolpadWebkit|Java|Feedly|UniversalFeedParser|ApacheBench|Microsoft URL Control|Swiftbot|ZmEu|oBot|jaunty|Python-urllib|lightDeckReports Bot|YYSpider|DigExt|HttpClient|MJ12bot|heritrix|EasouSpider|YandexBot|ZoominfoBot|PetalBot|petalBot|Ezooms|^$" ) {
return 403;
}
#禁止非 GET|HEAD|POST 方式的抓取
if ($request_method !~ ^(GET|HEAD|POST)$) {
return 403;
}
#禁压缩包
location ~* .(tgz|bak|zip|rar|tar|gz|bz2|xz|tar.gz)$ {
return 400;
break;
}