V2EX = way to explore
V2EX 是一个关于分享和探索的地方
现在注册
已注册用户请  登录
推荐学习书目
Learn Python the Hard Way
Python Sites
PyPI - Python Package Index
http://diveintopython.org/toc/index.html
Pocoo
值得关注的项目
PyPy
Celery
Jinja2
Read the Docs
gevent
pyenv
virtualenv
Stackless Python
Beautiful Soup
结巴中文分词
Green Unicorn
Sentry
Shovel
Pyflakes
pytest
Python 编程
pep8 Checker
Styles
PEP 8
Google Python Style Guide
Code Style from The Hitchhiker's Guide
redhatping
V2EX  ›  Python

爬虫思路,这个有趣的网站, 问一下思路怎么来做?

  •  
  •   redhatping · 2015-06-17 16:49:44 +08:00 · 4101 次点击
    这是一个创建于 3503 天前的主题,其中的信息可能已经有所发展或是发生改变。
    网站: http://www.exporivaschuh.it/catalogue/15ES2/search.html

    我需要做的事 找出所有的中国企业,cn, 爬出来他们的公司名字,电话号码,邮箱。

    让我吐槽的是,似乎数据时存储在javascript里。


    这个怎么分析。。思路怎么来弄,, 求大家给个方案。

    我一般都是beautifulsoup。
    第 1 条附言  ·  2015-06-17 17:26:08 +08:00
    p[18] = new e ("AIMINER LEATHER PRODUCTS CO., LTD.","","","CN", "AIMINERLEATHERPRODUCTSCOLTD","NO.258 WENCHANG ROAD,JIN HUA QIAO STREET,WUHOU DIS","610043", "CHENGDU", "PALAFIERE - HALL B2 - Stand D08", "+86 28 85017357", "[email protected]";, "www.aiminer.net", "09012799","00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000");


    p[17] = new e ("AIDER COMPANY LIMITED","","","CN", "AIDERCOMPANYLIMITED","9/F XINGHU COMMERCIAL BLDG., NO. 46 OF HU LI DA DA","", "XIAMEN", "PALAFIERE - HALL C4 - Stand A23", "+86-592-5699286", "[email protected]";, "",

    p[32] = new e ("ALLROUNDER SARL","","","FR", "ALLROUNDERSARL","BP 60007 - ROUTE DE SARREGUEMINES ","57400", "SARREBOURG", "PALAZZO DEI CONGRESSI 1^ - Stand 15", "+33 387 233000", "[email protected]";, "www.allrounder.com", "16012812","00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000");


    数据存储是上面这个样子的,, 大家给个代码看看吧。。。
    我不懂js, 目前只是自学到PYTHON

    我需要的是, 如:

    前期是确认是 cn (中国),然后找到

    AIMINER LEATHER PRODUCTS CO., LTD
    +86 28 85017357
    [email protected]
    www.aiminer.net
    6 条回复    2016-07-09 06:39:47 +08:00
    mutoulbj
        1
    mutoulbj  
       2015-06-17 16:57:55 +08:00
    只要数据是可以获取到,再javascript里也没关系,自己处理下文本,再从中提取就可以了。
    mhycy
        2
    mhycy  
       2015-06-17 17:01:42 +08:00
    分析JS逻辑,最简单直接用正则表达式抓取后重建索引
    hiboshi
        3
    hiboshi  
       2015-06-17 17:15:56 +08:00
    在js里面就更简单了直接正则匹配js文件
    fangjinmin
        4
    fangjinmin  
       2015-06-17 17:54:38 +08:00   ❤️ 1
    import urllib2
    import re
    from bs4 import BeautifulSoup

    url="http://www.exporivaschuh.it/catalogue/15ES2/search.html"
    soup = BeautifulSoup(urllib2.urlopen(url).read())
    script = soup.findAll('script')[0].string
    p1 = re.compile('new e \(.*\)')
    arrEs = p1.findall(script)
    f = open('companysofchina.csv', 'w')
    for e in arrEs:
    e = e.replace('new e (','').replace(')', '')
    arrItems = eval('[' + e + ']')
    if arrItems[3] == 'CN':
    company = arrItems[0]
    tel = arrItems[9]
    email = arrItems[10]
    f.write(company + ',' + tel + ',' + email + '\n')

    f.close()
    redhatping
        5
    redhatping  
    OP
       2015-06-17 18:19:36 +08:00
    @fangjinmin 没错,搞定。。。 非常感谢 ,
    aeshfawre
        6
    aeshfawre  
       2016-07-09 06:39:47 +08:00
    @redhatping 已发邮件
    关于   ·   帮助文档   ·   博客   ·   API   ·   FAQ   ·   实用小工具   ·   1249 人在线   最高记录 6679   ·     Select Language
    创意工作者们的社区
    World is powered by solitude
    VERSION: 3.9.8.5 · 26ms · UTC 18:04 · PVG 02:04 · LAX 10:04 · JFK 13:04
    Developed with CodeLauncher
    ♥ Do have faith in what you're doing.