webdriver 获取的page_sorce 还是js怎么办?

305 views

在使用scrapy、selenium 和 chrome结合爬虫爬取网易云动态界面的时候发现有个问题?为什么通过webdriver获取到的page_source 里面都是js,不应该是已经渲染好的页面吗?说好的所见即所得呢?

问题出在哪呢?问题出在页面可能包含多个js文件,这些js文件生成了html代码,而且在生成过程中使用了下图所示的iframe这个东西,导致你的page_source都是js,那怎么把page_source里面的js转换成html呢?

第一种方法: 虽然我们不能将page_source中的js执行完成,但是我们还是有另外的解决办法的,第二种变通的解决办法就是webdriver切换frame,直接使用webdriver来进行select元素,然后提取信息,不过这样的话,我们的中间件便失去了灵活性,也就意味着每个爬虫都要在middleware中进行修改。代码如下所示:

def process_request(self, request, spider):
import scrapy
import json
# Called for each request that goes through the downloader
# middleware.

# Must either:
# - return None: continue processing this request
# - or return a Response object
# - or return a Request object
# - or raise IgnoreRequest: process_exception() methods of
# installed downloader middleware will be called
chrome_options = Options()
chrome_options.add_argument('--headless') # 使用无头谷歌浏览器模式
chrome_options.add_argument('--disable-gpu')
chrome_options.add_argument('--no-sandbox')

# 指定谷歌浏览器路径
self.driver = webdriver.Chrome(chrome_options=chrome_options, executable_path=WEBDRIVER_PATH)
self.driver.implicitly_wait(5)
self.driver.get(request.url)

self.driver.switch_to.frame('g_iframe')

# 获取歌名和排名
nameselect = self.driver.find_elements_by_xpath(r'//*[@id="top-flag"]/dl[1]/dd/ol/li[position()<11]/a')
rankselect = self.driver.find_elements_by_xpath(r'//*[@id="top-flag"]/dl[1]/dd/ol/li[position()<11]/span')
namelist = []
ranklist = []
for tmpnameselect in nameselect:
namelist.append(tmpnameselect.text)
for tmprankselect in rankselect:
ranklist.append(tmprankselect.text)

# 获取歌名的url
urlselect = self.driver.find_elements_by_xpath(r'//*[@id="top-flag"]/dl[1]/dd/ol/li[position()<11]/a')
urllist = []

for tmpurlselect in urlselect:
urllist.append(tmpurlselect.get_attribute('href'))

# 将数据转换成为json格式
datadict = []
for i in range(len(namelist)):
datadict.append({'rank': ranklist[i], 'name': namelist[i], 'url': urllist[i]})

html = json.dumps(datadict)

self.driver.quit()

# print("The html of the page is: ", html)
# return html
return scrapy.http.HtmlResponse(url=request.url, body=html.encode('utf-8'), encoding='utf-8',
request=request)

第二种办法:先切换iframe,然后找到最外层的div,找到该元素, 获取innerHTML属性就能获取最外成div下的html文本了。代码如下:

 
from selenium.webdriver.chrome.options import Options
from netcloudmusic.settings import WEBDRIVER_PATH
from lxml import etree
url = 'https://music.163.com'
chrome_options = Options()

chrome_options.add_argument('--headless')
chrome_options.add_argument('--disable-gpu')
chrome_options.add_argument('--no-sandbox')


driver = webdriver.Chrome(chrome_options=chrome_options, executable_path=WEBDRIVER_PATH)
driver.implicitly_wait(10) # 等待十秒中加载
driver.get(url)
# 切换iframe
tmpframe = driver.switch_to.frame('g_iframe')
# 获取最外层的html
tmpselect = driver.find_element_by_xpath(r'//*[@id="discover-module"]')
tmpselect.get_attribute('innerHTML')

运行结果:


Rating: 5.0/5. From 1 vote.
Please wait...