V2EX = way to explore
V2EX 是一个关于分享和探索的地方
现在注册
已注册用户请  登录
tiRolin
V2EX  ›  Java

spider-flow 框架如何实现一个爬虫连续爬取多个同类型网页?

  •  
  •   tiRolin · 2023-08-07 16:00:44 +08:00 · 896 次点击
    这是一个创建于 508 天前的主题,其中的信息可能已经有所发展或是发生改变。

    我需要使用 spider-flow 框架爬取下面这三个网站的内容 https://price.21food.cn/product/939.html https://price.21food.cn/product/1505.html https://price.21food.cn/product/196.html

    这三个网址中我已经实现了其中一个网址的爬虫,由于这三个网址只是数据不同,所以这三个网址的数据其实可以放到一个爬虫里实现,之前我在 Selenium 框架中我是直接构建一个 url 集合用 for 循环解决的,但是在 spider-flow 中却难以实现

    我的想法是先定义一个 url 集合,然后建立循环爬取,所以我构建了如下所示的内容

    spider-flow

    第一个定义变量的内容是 urlList,定义了三个地址的集合["https://price.21food.cn/product/939.html","https://price.21food.cn/product/1505.html","https://price.21food.cn/product/196.html"]

    第二个是循环,顶一个 urlIndex 的下标,次数为 urlList

    第三个变量定义了 url 变量,值为${urlList[urlIndex]},其实就是获取前面集合中的具体 url

    第四个开始爬取使用的 url 指定为前面的 url ,值为${url}

    后面都是爬取数据爬虫逻辑,后面的内容是完全可用的,我之前已经试过了,这样构造我看着感觉没问题,但是时间运行之后的结果就是在第一个定义变量定义完之后就结束了

    我去网上搜索了很多教程,但是关于这个需求怎么实现的是找不到相关教程和案例,这个官网的文档我还不知道为什么打不开,我是实在没办法了,所以我来请教各位,各位有懂的还希望能不吝赐教,小弟在这里先谢过了

    spider-flow 框架的码云地址: https://gitee.com/ssssssss-team/spider-flow

    下载项目然后用 idea 打开,在数据库中运行项目提供 db.sql 并指定配置文件中数据库的地址就可以正确运行了,默认访问地址是 localhost:8088

    下面是我的构建的爬虫的内容,各位只要将该内容粘贴到 spider-flow 中即可运行,具体点击 XML 编辑的选项

    <mxGraphModel>
      <root>
        <mxCell id="0">
          <JsonProperty as="data">
            {&quot;spiderName&quot;:&quot;食品商务网爬虫(未整合多个网址)&quot;,&quot;submit-strategy&quot;:&quot;random&quot;,&quot;threadCount&quot;:&quot;&quot;}
          </JsonProperty>
        </mxCell>
        <mxCell id="1" parent="0"/>
        <mxCell id="2" value="开始" style="start" parent="1" vertex="1">
          <mxGeometry x="300" y="80" width="32" height="32" as="geometry"/>
          <JsonProperty as="data">
            {&quot;shape&quot;:&quot;start&quot;}
          </JsonProperty>
        </mxCell>
        <mxCell id="3" value="开始抓取" style="request" parent="1" vertex="1">
          <mxGeometry x="490" y="80" width="32" height="32" as="geometry"/>
          <JsonProperty as="data">
            {&quot;value&quot;:&quot;开始抓取&quot;,&quot;loopVariableName&quot;:&quot;&quot;,&quot;method&quot;:&quot;GET&quot;,&quot;sleep&quot;:&quot;&quot;,&quot;timeout&quot;:&quot;&quot;,&quot;response-charset&quot;:&quot;&quot;,&quot;retryCount&quot;:&quot;&quot;,&quot;retryInterval&quot;:&quot;&quot;,&quot;body-type&quot;:&quot;none&quot;,&quot;body-content-type&quot;:&quot;text/plain&quot;,&quot;loopCount&quot;:&quot;&quot;,&quot;url&quot;:&quot;${url}&quot;,&quot;proxy&quot;:&quot;&quot;,&quot;request-body&quot;:&quot;&quot;,&quot;follow-redirect&quot;:&quot;1&quot;,&quot;tls-validate&quot;:&quot;1&quot;,&quot;cookie-auto-set&quot;:&quot;1&quot;,&quot;repeat-enable&quot;:&quot;0&quot;,&quot;shape&quot;:&quot;request&quot;}
          </JsonProperty>
        </mxCell>
        <mxCell id="4" value="定义变量" style="variable" parent="1" vertex="1">
          <mxGeometry x="620" y="80" width="32" height="32" as="geometry"/>
          <JsonProperty as="data">
            {&quot;value&quot;:&quot;定义变量&quot;,&quot;loopVariableName&quot;:&quot;&quot;,&quot;variable-name&quot;:[&quot;dataList&quot;],&quot;variable-description&quot;:[&quot;&quot;],&quot;loopCount&quot;:&quot;&quot;,&quot;variable-value&quot;:[&quot;${extract.xpaths(resp.html,&#39;/html/body/div[2]/div[3]/div/div[2]/div[1]/div[2]/div[2]/ul/li&#39;)}&quot;],&quot;shape&quot;:&quot;variable&quot;}
          </JsonProperty>
        </mxCell>
        <mxCell id="9" value="" style="strokeWidth=2;sharp=1;" parent="1" source="3" target="4" edge="1">
          <mxGeometry relative="1" as="geometry"/>
          <JsonProperty as="data">
            {&quot;value&quot;:&quot;&quot;,&quot;exception-flow&quot;:&quot;0&quot;,&quot;lineWidth&quot;:&quot;2&quot;,&quot;line-style&quot;:&quot;sharp&quot;,&quot;lineColor&quot;:&quot;black&quot;,&quot;condition&quot;:&quot;&quot;,&quot;transmit-variable&quot;:&quot;1&quot;}
          </JsonProperty>
        </mxCell>
        <mxCell id="11" value="循环" style="loop" parent="1" vertex="1">
          <mxGeometry x="620" y="170" width="32" height="32" as="geometry"/>
          <JsonProperty as="data">
            {&quot;value&quot;:&quot;循环&quot;,&quot;loopItem&quot;:&quot;&quot;,&quot;loopVariableName&quot;:&quot;index&quot;,&quot;loopCount&quot;:&quot;${list.length(dataList)}&quot;,&quot;loopStart&quot;:&quot;0&quot;,&quot;loopEnd&quot;:&quot;-1&quot;,&quot;shape&quot;:&quot;loop&quot;}
          </JsonProperty>
        </mxCell>
        <mxCell id="12" value="" style="strokeWidth=2;sharp=1;" parent="1" source="4" target="11" edge="1">
          <mxGeometry relative="1" as="geometry"/>
          <JsonProperty as="data">
            {&quot;value&quot;:&quot;&quot;,&quot;exception-flow&quot;:&quot;0&quot;,&quot;lineWidth&quot;:&quot;2&quot;,&quot;line-style&quot;:&quot;sharp&quot;,&quot;lineColor&quot;:&quot;black&quot;,&quot;condition&quot;:&quot;&quot;,&quot;transmit-variable&quot;:&quot;1&quot;}
          </JsonProperty>
        </mxCell>
        <mxCell id="13" value="输出" style="output" parent="1" vertex="1">
          <mxGeometry x="790" y="334" width="32" height="32" as="geometry"/>
          <JsonProperty as="data">
            {&quot;value&quot;:&quot;输出&quot;,&quot;loopVariableName&quot;:&quot;&quot;,&quot;tableName&quot;:&quot;&quot;,&quot;csvName&quot;:&quot;&quot;,&quot;csvEncoding&quot;:&quot;GBK&quot;,&quot;output-name&quot;:[&quot;产品名&quot;,&quot;市场&quot;,&quot;规格&quot;,&quot;最高价格&quot;,&quot;平均价格&quot;,&quot;最低价格&quot;,&quot;日期&quot;],&quot;loopCount&quot;:&quot;&quot;,&quot;output-value&quot;:[&quot;${name}&quot;,&quot;${market}&quot;,&quot;${specifications}&quot;,&quot;${top}&quot;,&quot;${avg}&quot;,&quot;${low}&quot;,&quot;${dataDate}&quot;],&quot;output-all&quot;:&quot;0&quot;,&quot;output-database&quot;:&quot;0&quot;,&quot;output-csv&quot;:&quot;0&quot;,&quot;shape&quot;:&quot;output&quot;}
          </JsonProperty>
        </mxCell>
        <mxCell id="15" value="定义变量" style="variable" parent="1" vertex="1">
          <mxGeometry x="620" y="250" width="32" height="32" as="geometry"/>
          <JsonProperty as="data">
            {&quot;value&quot;:&quot;定义变量&quot;,&quot;loopVariableName&quot;:&quot;&quot;,&quot;variable-name&quot;:[&quot;name&quot;,&quot;market&quot;,&quot;specifications&quot;,&quot;top&quot;,&quot;avg&quot;,&quot;low&quot;,&quot;dataDate&quot;],&quot;variable-description&quot;:[&quot;&quot;,&quot;&quot;,&quot;&quot;,&quot;&quot;,&quot;&quot;,&quot;&quot;,&quot;&quot;],&quot;loopCount&quot;:&quot;&quot;,&quot;variable-value&quot;:[&quot;${dataList[index].selectors(&#39;table tbody tr td a&#39;)[0].text()}&quot;,&quot;${dataList[index].selectors(&#39;table tbody tr td a&#39;)[1].text()}&quot;,&quot;${dataList[index].selectors(&#39;table tbody tr td span&#39;)[0].text()}&quot;,&quot;${dataList[index].selectors(&#39;table tbody tr td span&#39;)[1].text()}&quot;,&quot;${dataList[index].selectors(&#39;table tbody tr td span&#39;)[3].text()}&quot;,&quot;${dataList[index].selectors(&#39;table tbody tr td span&#39;)[2].text()}&quot;,&quot;${dataList[index].selectors(&#39;table tbody tr td span&#39;)[4].text()}&quot;],&quot;shape&quot;:&quot;variable&quot;}
          </JsonProperty>
        </mxCell>
        <mxCell id="16" value="" style="strokeWidth=2;sharp=1;" parent="1" source="11" target="15" edge="1">
          <mxGeometry relative="1" as="geometry"/>
          <JsonProperty as="data">
            {&quot;value&quot;:&quot;&quot;,&quot;exception-flow&quot;:&quot;0&quot;,&quot;lineWidth&quot;:&quot;2&quot;,&quot;line-style&quot;:&quot;sharp&quot;,&quot;lineColor&quot;:&quot;black&quot;,&quot;condition&quot;:&quot;&quot;,&quot;transmit-variable&quot;:&quot;1&quot;}
          </JsonProperty>
        </mxCell>
        <mxCell id="18" value="" style="strokeWidth=2;sharp=1;" parent="1" source="15" target="13" edge="1">
          <mxGeometry relative="1" as="geometry"/>
          <JsonProperty as="data">
            {&quot;value&quot;:&quot;&quot;,&quot;exception-flow&quot;:&quot;0&quot;,&quot;lineWidth&quot;:&quot;2&quot;,&quot;line-style&quot;:&quot;sharp&quot;,&quot;lineColor&quot;:&quot;black&quot;,&quot;condition&quot;:&quot;&quot;,&quot;transmit-variable&quot;:&quot;1&quot;}
          </JsonProperty>
        </mxCell>
        <mxCell id="27" value="定义变量" style="variable" parent="1" vertex="1">
          <mxGeometry x="90" y="440" width="32" height="32" as="geometry"/>
          <JsonProperty as="data">
            {&quot;value&quot;:&quot;定义变量&quot;,&quot;loopVariableName&quot;:&quot;&quot;,&quot;variable-name&quot;:[&quot;urlList&quot;],&quot;variable-description&quot;:[&quot;&quot;],&quot;loopCount&quot;:&quot;&quot;,&quot;variable-value&quot;:[&quot;[\&quot;https://price.21food.cn/product/939.html\&quot;,\&quot;https://price.21food.cn/product/1505.html\&quot;,\&quot;https://price.21food.cn/product/196.html\&quot;]&quot;],&quot;shape&quot;:&quot;variable&quot;}
          </JsonProperty>
        </mxCell>
        <mxCell id="29" value="循环" style="loop" parent="1" vertex="1">
          <mxGeometry x="180" y="440" width="32" height="32" as="geometry"/>
          <JsonProperty as="data">
            {&quot;value&quot;:&quot;循环&quot;,&quot;loopItem&quot;:&quot;&quot;,&quot;loopVariableName&quot;:&quot;urlIndex&quot;,&quot;loopCount&quot;:&quot;${list.length(urlList)}&quot;,&quot;loopStart&quot;:&quot;0&quot;,&quot;loopEnd&quot;:&quot;-1&quot;,&quot;shape&quot;:&quot;loop&quot;}
          </JsonProperty>
        </mxCell>
        <mxCell id="31" value="定义变量" style="variable" parent="1" vertex="1">
          <mxGeometry x="262" y="440" width="32" height="32" as="geometry"/>
          <JsonProperty as="data">
            {&quot;value&quot;:&quot;定义变量&quot;,&quot;loopVariableName&quot;:&quot;&quot;,&quot;variable-name&quot;:[&quot;url&quot;],&quot;variable-description&quot;:[&quot;&quot;],&quot;loopCount&quot;:&quot;&quot;,&quot;variable-value&quot;:[&quot;${urlList[urlIndex]}&quot;],&quot;shape&quot;:&quot;variable&quot;}
          </JsonProperty>
        </mxCell>
        <mxCell id="42" value="" style="strokeWidth=2;sharp=1;" edge="1" parent="1" source="27" target="29">
          <mxGeometry relative="1" as="geometry"/>
          <JsonProperty as="data">
            {&quot;value&quot;:&quot;&quot;,&quot;exception-flow&quot;:&quot;0&quot;,&quot;lineWidth&quot;:&quot;2&quot;,&quot;line-style&quot;:&quot;sharp&quot;,&quot;lineColor&quot;:&quot;black&quot;,&quot;condition&quot;:&quot;&quot;,&quot;transmit-variable&quot;:&quot;1&quot;}
          </JsonProperty>
        </mxCell>
        <mxCell id="43" value="" style="strokeWidth=2;sharp=1;" edge="1" parent="1" source="29" target="31">
          <mxGeometry relative="1" as="geometry"/>
          <JsonProperty as="data">
            {&quot;value&quot;:&quot;&quot;,&quot;exception-flow&quot;:&quot;0&quot;,&quot;lineWidth&quot;:&quot;2&quot;,&quot;line-style&quot;:&quot;sharp&quot;,&quot;lineColor&quot;:&quot;black&quot;,&quot;condition&quot;:&quot;&quot;,&quot;transmit-variable&quot;:&quot;1&quot;}
          </JsonProperty>
        </mxCell>
        <mxCell id="44" value="" style="strokeWidth=2;sharp=1;" edge="1" parent="1" source="2" target="27">
          <mxGeometry relative="1" as="geometry"/>
          <JsonProperty as="data">
            {&quot;value&quot;:&quot;&quot;,&quot;exception-flow&quot;:&quot;0&quot;,&quot;lineWidth&quot;:&quot;2&quot;,&quot;line-style&quot;:&quot;sharp&quot;,&quot;lineColor&quot;:&quot;black&quot;,&quot;condition&quot;:&quot;&quot;,&quot;transmit-variable&quot;:&quot;1&quot;}
          </JsonProperty>
        </mxCell>
        <mxCell id="45" value="" style="strokeWidth=2;sharp=1;" edge="1" parent="1" source="31" target="3">
          <mxGeometry relative="1" as="geometry"/>
          <JsonProperty as="data">
            {&quot;value&quot;:&quot;&quot;,&quot;exception-flow&quot;:&quot;0&quot;,&quot;lineWidth&quot;:&quot;2&quot;,&quot;line-style&quot;:&quot;sharp&quot;,&quot;lineColor&quot;:&quot;black&quot;,&quot;condition&quot;:&quot;&quot;,&quot;transmit-variable&quot;:&quot;1&quot;}
          </JsonProperty>
        </mxCell>
      </root>
    </mxGraphModel>
    
    
    tiRolin
        1
    tiRolin  
    OP
       2023-08-07 16:39:35 +08:00
    还有我想问下这个框架怎么模拟点击操作?我看案例中打开新网页的方法是获取 url 拼接之后开启新的爬虫进行爬取
    但是有些我想要爬取数据的网址是不直接存在 html 中,要执行点击操作才会自动跳转到新网址,我在代码上使用 Selenium 框架可以执行操作,但是在 spiderflow 框架中又要怎么做才行?
    关于   ·   帮助文档   ·   博客   ·   API   ·   FAQ   ·   实用小工具   ·   2730 人在线   最高记录 6679   ·     Select Language
    创意工作者们的社区
    World is powered by solitude
    VERSION: 3.9.8.5 · 23ms · UTC 12:19 · PVG 20:19 · LAX 04:19 · JFK 07:19
    Developed with CodeLauncher
    ♥ Do have faith in what you're doing.