V2EX = way to explore
V2EX 是一个关于分享和探索的地方
现在注册
已注册用户请  登录
推荐学习书目
Learn Python the Hard Way
Python Sites
PyPI - Python Package Index
http://diveintopython.org/toc/index.html
Pocoo
值得关注的项目
PyPy
Celery
Jinja2
Read the Docs
gevent
pyenv
virtualenv
Stackless Python
Beautiful Soup
结巴中文分词
Green Unicorn
Sentry
Shovel
Pyflakes
pytest
Python 编程
pep8 Checker
Styles
PEP 8
Google Python Style Guide
Code Style from The Hitchhiker's Guide
brucebot
V2EX  ›  Python

scrapy的json输出问题

  •  
  •   brucebot · 2013-12-26 14:33:56 +08:00 · 6405 次点击
    这是一个创建于 3774 天前的主题,其中的信息可能已经有所发展或是发生改变。
    我在使用scrapy抓取youtube上关于工业机器人视频的标题与链接,希望输出到json文件里面,
    以下是我的代码:
    from scrapy.contrib.spiders import CrawlSpider, Rule
    from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
    from scrapy.selector import HtmlXPathSelector
    from scrapy.http import Request
    from scrapy import log
    from youtube.items import YoutubeItem

    class YoutubeSpider(CrawlSpider):
    name = "youtube"
    allowed_domains = ["youtube.com"]
    start_urls = ['http://www.youtube.com/results?search_query=industrial+robot+Assembling&page=%d' %n for n in range (1,2)]
    rules = ()
    def parse(self,response):
    print "Start scrapping youtube videos info..."
    hxs=HtmlXPathSelector(response)
    bases = hxs.select('//*[@id="results"]//*[@id="search-results"]')
    items=[]
    for base in bases:
    item = YoutubeItem()
    t_title=base.select('//*[@id="search-results"]/li/div/h3/a//text()').extract()
    item['title']=map(lambda s: s.strip(), t_title)
    item['linkID'] = base.select('//*[@id="search-results"]/li/div/h3/a/@href').extract()
    #t_desc=base.select('//*[@id="search-results"]/li/div[2]/div[2]/text()')
    #t_desc="".join(base.select('//*[@id="search-results"]/li/div[2]/div[2]/text()').extract_unquoted())
    #item['description']=t_desc
    #item['thumbnail'] = base.select('//*[@id="search-results"]/li/div/a//img/@src').extract()
    items.append(item)
    return(items)

    但是输出的结果是:
    {'linkID': [u'/watch?v=iFKbpbe_9pw',
    u'/watch?v=Fnlzl6sBOsA',
    u'/watch?v=QbrqeJRy0hY',
    u'/watch?v=u6-d5VkOB3I',
    u'/watch?v=9--qNRr1VZI',
    u'/watch?v=89prwGUZjM0',
    u'/watch?v=txahbz9eswk',
    u'/watch?v=52ptIgooZ64',
    u'/watch?v=goNOPztC_qE',
    u'/watch?v=daH5Xs11uQc',
    u'/watch?v=V2V3Cu0nWvg',
    u'/watch?v=TQwN-YeWXfs',
    u'/watch?v=aWDAG3fz-ec',
    u'/watch?v=Xmn06cpqngs',
    u'/watch?v=iuaAEDrrVyg',
    u'/watch?v=TG4yzjV4d8w&list=PLECC02EA2EAE0E159',
    u'/watch?v=GCCW9O7IKhY',
    u'/watch?v=O8HwEXDLug8',
    u'/watch?v=yYCHUT79tFM',
    u'/watch?v=82w_r2D1Ooo'],
    'title': [u'Assembly Line Robot Arms on How Do They Do It',
    u'Engine Assembly Robots - FANUC Robot Industrial Automation',
    u'LR Mate 200iC USB Memory Stick Assembly Robot - FANUC Robot Industrial Automation',
    u'R-2000iA Automotive Assembly Robots - FANUC Robotics Industrial Automation',
    u'M-3iA Flexible Solar Collector Assembly Robot - FANUC Robotics Industrial Automation',
    u'LR Mate 200iB Gas Can Assembly Robot - FANUC Robotics Industrial Automation',
    u'ABB Robotics - Assembly of electrical sockets',
    u'M-1iA Circuit Board Assembly Robots - FANUC Robotics Industrial Automation',
    u'ABB Robotics - Assembly of digital camera',
    u'M-1iA LED Lens Assembly Robots - FANUC Robotics Industrial Automation',
    u'LR Mate Small Piston Engine Assembly Robots - FANUC Robotics Industrial Automation',
    u'M-1iA Keyboard Assembly Robots - FANUC Robotics Industrial Automation',
    u'M-1iA Ball Bearing Assembly Robot - FANUC Robotics Industrial Automation',
    u'M-1iA Intelligent Gear Assembly Robot - FANUC Robotics Industrial Automation',
    u'LR Mate 200iC Small Part Assembly Robots - FANUC Robotics Industrial Automation',
    u'Assembly Robots - FANUC Robotics Application Videos',
    u'M-1iA/LR Mate 200iC Solar Panel Assembly Robots - FANUC Robotics Industrial Automation',
    u'ABB Robotics - Assembly',
    u'Yaskawa Motoman SDA10 Robot Assembly Video',
    u'Toyota Camry Hybrid Factory Robots']}


    而我想要的是linkID与title一一对应起来,这是哪里有问题吗?
    11 条回复    1970-01-01 08:00:00 +08:00
    greatghoul
        1
    greatghoul  
       2013-12-26 14:44:02 +08:00
    我觉得你把代码帖到 gist 里面再贴链接出来比较好一些。
    youtube 应该有 API 吧,有没有考虑不走抓取就做成事呢?
    brucebot
        2
    brucebot  
    OP
       2013-12-26 14:50:26 +08:00
    @greatghoul 想改来着,但是好像过时间了,用scrapy也是学习一下

    代码在这里
    https://gist.github.com/brucebot/734ddc9469d3970fdc02
    brucebot
        3
    brucebot  
    OP
       2013-12-26 14:50:45 +08:00
    734ddc9469d3970fdc02
    brucebot
        4
    brucebot  
    OP
       2013-12-26 14:51:43 +08:00
    muzuiget
        5
    muzuiget  
       2013-12-26 14:54:25 +08:00
    用 zip 来拼一下就行咯
    brucebot
        6
    brucebot  
    OP
       2013-12-26 14:58:55 +08:00
    @greatghoul @livid
    我的错,我建的一个secrec gist

    https://gist.github.com/brucebot/8130663
    brucebot
        7
    brucebot  
    OP
       2013-12-26 14:59:31 +08:00   ❤️ 1
    @muzuiget 重新拼?同样的例子,输出是正常的,我很奇怪这个
    muzuiget
        8
    muzuiget  
       2013-12-26 20:47:40 +08:00   ❤️ 1
    @brucebot 这样 zip(result['linkID'], result['title'])

    你的 parse() 里 items 是个 list,但是返回是个 dict,肯定哪里被二次转换过了。
    rayind
        9
    rayind  
       2013-12-27 11:20:13 +08:00   ❤️ 1
    xpath选取那一块写错
    这几句:
    bases = hxs.select('//*[@id="results"]//*[@id="search-results"]/*')

    t_title=base.select('div/h3/a//text()').extract()

    item['linkID'] = base.select('div/h3/a/@href').extract()
    brucebot
        10
    brucebot  
    OP
       2013-12-27 13:42:59 +08:00
    @rayind 非常感谢,终于输出正常了
    brucebot
        11
    brucebot  
    OP
       2013-12-27 13:43:48 +08:00
    @muzuiget 还是谢谢你,可是不是特别熟悉这个,用@rayind的方法成功了
    关于   ·   帮助文档   ·   博客   ·   API   ·   FAQ   ·   我们的愿景   ·   实用小工具   ·   1008 人在线   最高记录 6543   ·     Select Language
    创意工作者们的社区
    World is powered by solitude
    VERSION: 3.9.8.5 · 26ms · UTC 22:31 · PVG 06:31 · LAX 15:31 · JFK 18:31
    Developed with CodeLauncher
    ♥ Do have faith in what you're doing.