V2EX = way to explore
V2EX 是一个关于分享和探索的地方
现在注册
已注册用户请  登录
V2EX 提问指南
sanp
V2EX  ›  问与答

一个诡异的爬虫,求分析。

  •  
  •   sanp · 2014-08-05 12:47:25 +08:00 · 3947 次点击
    这是一个创建于 3763 天前的主题,其中的信息可能已经有所发展或是发生改变。
    这是我截取的access log. 其中/{xxx}代表的是我网站的某个路径,其他的都是原始的log未做改动。
    这个爬虫IP不固定,封了后过一会会有新的IP爬过来。
    这个爬虫从大概2年前就开始爬我的站,中间我的站关掉了一年左右,现在重新开,没想到这个爬虫居然还在。不知道什么路数。很有可能我关站的这段时间他还在爬。大家给分析分析

    89.248.162.170 - - [05/Aug/2014:04:42:36 +0000] "GET /{xxx} HTTP/1.1" 301 193 "http://www.google.com/#q=a3" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:15.0) Gecko/20120427 Firefox/15.0a1"
    89.248.162.170 - - [05/Aug/2014:04:42:36 +0000] "GET /{xxx} HTTP/1.1" 301 193 "http://www.google.com/#q=a2" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.66 Safari/535.11"
    89.248.162.170 - - [05/Aug/2014:04:42:36 +0000] "GET /{xxx} HTTP/1.1" 301 193 "http://www.google.com/#q=a0" "Mozilla/5.0 (Windows NT 6.2; WOW64; rv:16.0.1) Gecko/20121011 Firefox/16.0.1"
    89.248.162.170 - - [05/Aug/2014:04:42:36 +0000] "GET /{xxx} HTTP/1.1" 301 193 "http://www.google.com/#q=a3" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.66 Safari/535.11"
    89.248.162.170 - - [05/Aug/2014:04:42:36 +0000] "GET /{xxx} HTTP/1.1" 301 193 "http://www.google.com/#q=a4" "Mozilla/5.0 (X11; Linux i686) AppleWebKit/535.11 (KHTML, like Gecko) Ubuntu/11.10 Chromium/17.0.963.65 Chrome/17.0.963.65 Safari/535.11"
    89.248.162.170 - - [05/Aug/2014:04:42:37 +0000] "GET /{xxx} HTTP/1.1" 301 193 "http://www.google.com/#q=a2" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.66 Safari/535.11"
    89.248.162.170 - - [05/Aug/2014:04:42:37 +0000] "GET /{xxx} HTTP/1.1" 301 193 "http://www.google.com/#q=a9" "Mozilla/6.0 (Windows NT 6.2; WOW64; rv:16.0.1) Gecko/20121011 Firefox/16.0.1"
    89.248.162.170 - - [05/Aug/2014:04:42:37 +0000] "GET /{xxx} HTTP/1.1" 301 193 "http://www.google.com/#q=a3" "Mozilla/5.0 (Windows NT 6.2; Win64; x64; rv:16.0.1) Gecko/20121011 Firefox/16.0.1"
    94.102.49.31 - - [05/Aug/2014:04:42:56 +0000] "GET /{xxx} HTTP/1.1" 301 193 "http://www.google.com/#q=a6" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_4) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.65 Safari/535.11"
    94.102.49.31 - - [05/Aug/2014:04:42:56 +0000] "GET /{xxx} HTTP/1.1" 301 193 "http://www.google.com/#q=a8" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.11 (KHTML, like Gecko) Ubuntu/10.10 Chromium/17.0.963.65 Chrome/17.0.963.65 Safari/535.11"
    94.102.49.31 - - [05/Aug/2014:04:42:57 +0000] "GET /{xxx} HTTP/1.1" 301 193 "http://www.google.com/#q=a1" "Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:15.0) Gecko/20100101 Firefox/15.0.1"
    94.102.49.31 - - [05/Aug/2014:04:42:58 +0000] "GET /{xxx} HTTP/1.1" 301 193 "http://www.google.com/#q=a3" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.11 (KHTML, like Gecko) Ubuntu/11.04 Chromium/17.0.963.65 Chrome/17.0.963.65 Safari/535.11"
    94.102.49.31 - - [05/Aug/2014:04:42:58 +0000] "GET /{xxx} HTTP/1.1" 301 193 "http://www.google.com/#q=a8" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.66 Safari/535.11"
    94.102.49.31 - - [05/Aug/2014:04:42:58 +0000] "GET /{xxx} HTTP/1.1" 301 193 "http://www.google.com/#q=a1" "Mozilla/5.0 (Windows NT 6.2; Win64; x64; rv:16.0.1) Gecko/20121011 Firefox/16.0.1"
    94.102.49.31 - - [05/Aug/2014:04:42:58 +0000] "GET /{xxx} HTTP/1.1" 301 193 "http://www.google.com/#q=a7" "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.66 Safari/535.11"
    94.102.49.31 - - [05/Aug/2014:04:42:58 +0000] "GET /{xxx} HTTP/1.1" 301 193 "http://www.google.com/#q=a3" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.11 (KHTML, like Gecko) Ubuntu/11.10 Chromium/17.0.963.65 Chrome/17.0.963.65 Safari/535.11"
    94.102.49.31 - - [05/Aug/2014:04:43:00 +0000] "GET /{xxx} HTTP/1.1" 301 193 "http://www.google.com/#q=a1" "Mozilla/5.0 (Windows NT 6.2; WOW64; rv:16.0.1) Gecko/20121011 Firefox/16.0.1"
    94.102.49.31 - - [05/Aug/2014:04:43:01 +0000] "GET /{xxx} HTTP/1.1" 301 193 "http://www.google.com/#q=a2" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1.16) Gecko/20120427 Firefox/15.0a1"
    15 条回复    2014-08-05 15:25:55 +08:00
    liangdi
        1
    liangdi  
       2014-08-05 12:48:57 +08:00
    log呢?
    sanp
        2
    sanp  
    OP
       2014-08-05 12:50:30 +08:00
    @liangdi 刚才按了Enter居然自动发布了。。。我刚编辑了。
    plprapper
        3
    plprapper  
       2014-08-05 13:16:23 +08:00
    这哪里是爬虫, 简直是癞皮狗。。。。
    liangdi
        4
    liangdi  
       2014-08-05 13:26:59 +08:00
    是采集器吧 lz什么站?
    sintrb
        5
    sintrb  
       2014-08-05 13:28:21 +08:00
    这爬虫好可怜。。
    popbones
        6
    popbones  
       2014-08-05 14:20:01 +08:00
    IP : 89.248.162.170
    Host : server156950.santrex.net
    Country : Netherlands

    IP : 94.102.49.31
    Host : ?
    Country : Netherlands
    captainhcg
        7
    captainhcg  
       2014-08-05 14:34:48 +08:00
    http://www.projecthoneypot.org/ip_94.102.49.213
    貌似是发送垃圾评论的,你的站点是不是用了wordpress之类的框架?
    ChanneW
        8
    ChanneW  
       2014-08-05 14:35:14 +08:00
    怎么看出不是真 google 的
    avrillavigne
        9
    avrillavigne  
       2014-08-05 15:04:17 +08:00
    http://antivirus.neu.edu.cn/ssh/lists/base_30days.txt 东北大学把它列进黑名单了 - -
    vicacheung
        10
    vicacheung  
       2014-08-05 15:06:38 +08:00
    @sanp 现在可以编辑主题了?
    sanp
        11
    sanp  
    OP
       2014-08-05 15:22:05 +08:00
    @liangdi 一个工具类的站,查询数据的,对方是遍历抓取的。我奇怪的是我站都关了一年多。重新开了,他居然还在。
    sanp
        12
    sanp  
    OP
       2014-08-05 15:22:40 +08:00
    @vicacheung 刚发布时候可以编辑的。
    sanp
        13
    sanp  
    OP
       2014-08-05 15:24:02 +08:00
    @plprapper 确实,一般的爬虫禁了就行了,这个是禁了吗,过会就有别的IP过来,而且抓取很频繁,基本不停的爬。
    sanp
        14
    sanp  
    OP
       2014-08-05 15:25:03 +08:00
    @captainhcg 没有用wordpress。这个爬虫是遍历网站页面,然后就不停的爬。
    sanp
        15
    sanp  
    OP
       2014-08-05 15:25:55 +08:00
    @avrillavigne 确实被互联网上不少地方列黑名单了。我就是奇怪他咋就不停的爬。
    关于   ·   帮助文档   ·   博客   ·   API   ·   FAQ   ·   实用小工具   ·   952 人在线   最高记录 6679   ·     Select Language
    创意工作者们的社区
    World is powered by solitude
    VERSION: 3.9.8.5 · 24ms · UTC 20:43 · PVG 04:43 · LAX 12:43 · JFK 15:43
    Developed with CodeLauncher
    ♥ Do have faith in what you're doing.