爬取评论的的代码:
def fetch_status_comments_and_total_number(sid):
comments_url = "https://m.weibo.cn/comments/hotflow"
payload = {
'id': sid,
'mid': sid,
'page': 1, # 默认返回第一页,但为了防止意外,显示请求第一页
}
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'
}
try:
raw_response = requests.get(comments_url, params=payload,
headers=headers)
response = raw_response.json()
except Exception:
logger.error("请求评论信息时出错,请求的参数为 {},URL 为{}\n 返回的文本为:\n{}".format(sid,
raw_response.url,
raw_response.text))
return 0, None
if response is None or response['ok'] == 0:
return 0, None
total_number = response['data']['total_number']
comments = make_comments(response['data']['data'])
return total_number, comments
有时候能够成功,有时候访问时就出错,log 记录的异常信息为:
请求评论信息时出错,请求的参数为 4309813452636887,URL 为 https://m.weibo.cn/comments/hotflow?id=4309813452636887&mid=4309813452636887&page=1
返回的文本为:
<!DOCTYPE html>
<html lang="zh">
<head>
<meta charset="UTF-8">
<link rel="dns-prefetch" href="https://h5.sinaimg.cn">
<meta id="viewport" name="viewport"
content="width=device-width,initial-scale=1.0,minimum-scale=1.0,maximum-scale=1.0">
<meta name="format-detection" content="telephone=no">
<title>微博-出错了</title>
<style>
html {
font-size: 2rem;
}
@media (max-width: 1024px) {
html {
font-size: 1.25rem;
}
}
@media (max-width: 414px) {
html {
font-size: 1.06rem;
}
}
@media (max-width: 375px) {
html {
font-size: 1rem;
}
}
body {
margin: 0;
padding: 0;
background-color: #f2f2f2;
}
p {
margin: 0;
}
.h5-4box {
padding-top: 6.125rem;
text-align: center;
}
.h5-4img {
display: inline-block;
}
.h5-4img img {
max-width: 100%;
}
.h5-4con {
padding-top: 1.875rem;
font-size: 0.875rem;
line-height: 1.2;
color: #636363;
text-align: center;
}
.btn {
display: inline-block;
border: #e86b0f solid 1px;
margin: 0 0 0 5px;
padding: 0 10px;
line-height: 25px;
font-size: .75rem;
vertical-align: middle;
color: #FFF;
border-radius: 3px;
background-color: #ff8200;
}
</style>
</head>
<body>
<div class="h5-4box">
<span class="h5-4img">
<img src="//h5.sinaimg.cn/upload/2016/04/11/319/h5-404.png">
</span>
<p class="h5-4con">认证失败</p>
<br/>
</div>
</body>
</html>
可能是微博的反爬机制?但不清楚是什么机制,是否有对应的解决办法呢?求指教!
注:在浏览器端访问上面请求的 url 是会正常返回结果的。
1
t333st 2018-11-24 15:37:02 +08:00
是的,反扒机制。。。
话说你记得有个 3g 页面版本的微博吗? |
2
chuanqirenwu OP @t333st 这个限制更少吗?能否提供一下地址呢?
|
3
t333st 2018-11-24 16:32:59 +08:00
@chuanqirenwu 我就是找不到才问你。。。
|
4
chwhsen 2018-11-24 17:09:40 +08:00
|
5
chuanqirenwu OP @chwhsen 这个和上面的 m.weibo.cn 应该是一样的
|