如何获得网页渲染后每个 DOM node 的位置

This topic created in 1726 days ago, the information mentioned may be changed or developed.

请教一个问题，对于一个网页中的某个 DOM node，如何获取渲染之后的位置呢？比如按照 1920x1080 的分辨率来渲染，某个网页的渲染后实际像素数量是 1920x3000，标题对应的矩形框的左上和右下角分别是 (200, 200) 和 (1500, 400)，正文对应的框的左上角和右下角分别是 (200, 500) 和 (1500, 2500) 这样。

如果能够用 headless 的方法渲染网页然后获取上述信息的话，或许可以训练个模型什么的，比如预测标题和正文对应的 DOM node，或者预测哪些 DOM node 贡献了可见的内容。如果想训练这种模型的话，可能还需要更进一步来标注数据，把网页渲染成图片之后把特定的 DOM node 高亮出来（比如画个框），然后根据对应的任务进行人工标注。

Supplement 1 · Oct 21, 2021

汇报一下：用 Selenium 可以解决，本地配了 Chromedriver 之后，如下代码解君愁：

from selenium import webdriver
from selenium.webdriver.chrome.options import Options

chrome_options = Options()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
chrome_options.add_argument("window-size=1920,1080")

d = webdriver.Chrome(chrome_options=chrome_options)
d.get('http://www.v2ex.com')
elems = d.find_elements_by_class_name('topic-link')

print(len(elems), 'elements found')

for i, elem in enumerate(elems):
    print(i + 1, '/', len(elems), 'location:', elem.location, 'size:', elem.size, 'content:', elem.text)

输出如下：

49 elements found
1 / 49 location: {'x': 490, 'y': 162} size: {'height': 17, 'width': 205} content: WSL 2 拳打 macOS，脚踢 Ubuntu ？
2 / 49 location: {'x': 490, 'y': 233} size: {'height': 17, 'width': 90} content: 想问下有 Pixel 用户吗？
3 / 49 location: {'x': 490, 'y': 304} size: {'height': 17, 'width': 246} content: Windows Subsystem for Android 来了
4 / 49 location: {'x': 490, 'y': 375} size: {'height': 17, 'width': 247} content: Google Authenticator 更新了，之前重复的两步校验消失
...

DOM

Node

渲染

网页

10 replies • 2021-10-19 15:35:31 +08:00