Reptile – picture lazy loading solution

Dynamic data loading processing

I. Lazy loading of pictures

  • What is lazy loading of pictures?
    • Case analysis: Grab image data from webmaster material http://sc.chinaz.com/
      #!/usr/bin/env python
      
      #
      -*- coding:utf-8 -*-
      import requests
      from lxml import etree

      if __name__ == "__main__":
      url
      = 'http://sc.chinaz.com/tupian /gudianmeinvtupian.html'
      headers
      = {
      'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36',
      }
      #Get page text data
      response = requests.get(url=url,headers=headers)
      response.encoding
      = 'utf-8'
      page_text
      = response.text
      #Analyze page data (get the picture link in the page) span>
      #Create an etree object
      tree = etree.HTML(page_text)
      div_list
      = tree.xpath('//div[@id= "container"]/div')
      #Resolve to obtain the picture address and picture name
      for div in div_list:
      image_url
      = div.xpath('.//img/@src ')
      image_name
      = div.xpath('.//img/@alt ')
      print(image_url) #Print image link
      print(image_name)#Print image name

      Share a picture

    • – The operation result shows that we can get the name of the picture, but the link gets empty. After checking, there is no problem with the xpath expression. What is the reason?

    • The concept of lazy loading of images:

      • Lazy loading of images is a webpage optimization technology. As a kind of network resource, pictures are the same as ordinary static resources when they are requested, they will occupy network resources, and loading all pictures of the entire page at one time will greatly increase the loading time of the first screen of the page. In order to solve this problem, the front-end and back-end cooperation allows the picture to be loaded only when it appears in the current window of the browser. The technology that reduces the number of requests for the first-screen picture is called “picture lazy loading”.

    • How do websites generally implement image lazy loading technology?

      • In the source code of the webpage, a “pseudo attribute” (usually src2, original…) is used in the img tag to store the real image link and It is not directly stored in the src attribute. When the picture appears in the visual area of ​​the page, the pseudo attribute will be dynamically replaced with the src attribute to complete the loading of the picture.

    • Follow-up analysis of the webmaster material case: After carefully observing the structure of the page, it is found that the link to the picture on the webpage is stored in the pseudo Attributes

      # !/usr/bin/env python
      
      #
      -*- coding:utf-8 -*-
      import requests
      from lxml import etree

      if __name__ == "__main__":
      url
      = 'http://sc.chinaz.com/tupian /gudianmeinvtupian.html'
      headers
      = {
      'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36',
      }
      #Get page text data
      response = requests.get(url=url,headers=headers)
      response.encoding
      = 'utf-8'
      page_text
      = response.text
      #Analyze page data (get the picture link in the page) span>
      #Create an etree object
      tree = etree.HTML(page_text)
      div_list
      = tree.xpath('//div[@id= "container"]/div')
      #Resolve to obtain the picture address and picture name
      for div in div_list:
      image_url
      = div.xpath('.//img/@src '2) #src2 pseudo-attribute
      image_name = div.xpath('.//img/@alt ')
      print(image_url) #Print image link
      print(image_name)#Print image name

      Share picture

Two.selenium

  • What is selenium?
    • is a third-party library of Python. The interface provided to the outside can operate the browser, and then let the browser complete the automated operation.

  • Environment setup

    • Install selenum: pip install selenium

    • Get the driver of a certain browser (take Google Chrome as an example)

      • Google Chrome driver download address: http:/ /chromedriver.storage.googleapis.com/index.html

      • The downloaded driver must be consistent with the browser version. You can follow the http: //blog.csdn.net/huilan_same/article/details/51896672 provides corresponding version mapping table

Effect display:

from selenium import webdriver from time import sleep  # Behind is your browser driver location, remember to add r'' in front,'r' is the driver to prevent character escaping = webdriver.Chrome(r'driver path') # Open Baidu page with get driver.get("http://www.baidu.com") # Find page "Settings" option, and click driver.find_elements_by_link_text('Settings')[0].click() sleep(2) # # Open the settings and find the "Search Settings" option, set to display 50 driver.find_elements_by_link_text('Search Settings')[0].click() sleep(2) # Select to display 50 entries per page m = driver.find_element_by_id( 'nr') sleep(2) m.find_element_by_xpath('//*[@id="nr"]/option[3]').click() m.find_element_by_ xpath('.//option[3]').click() sleep(2) # Click to save the settings driver.find_elements_by_class_name("prefpanelgo")[0].click() sleep(2) # Process the pop-up warning page. Confirm accept() and cancel the dismiss() driver. switch_to_alert().accept() sleep(2) # Find the input box of Baidu, and enter the beauty driver.find_element_by_id('kw').send_keys('beauty') sleep(2)  # Click the search button driver.find_element_by_id('su').click() sleep(2) # Find "Selenium-Open Source Chinese Community" in the opened page, and open this page driver.find_elements_by_link_text('Beauty_Baidu Picture')[0].click() sleep(< span class="hljs-number">3) # Close the browser driver.quit()  span>  span>  span> span>

share picture img alt=”Share picture” class=”cke_reset cke_widget_drag_handler” title=”Click and drag to move” src=”/wp-content/uploads/images/industry/web-crawler/1626796371044data:image/gif;base64,R0lGODlhAQABAPABAP //wAAACH5BAEKAAAALAAAAAABAAEAAAICRAEAOw==” width=”15″ height=”15″ data-cke-widget-drag-handler=”1″ >

Code introduction:

#导包from selenium import webdriver #Create a browser object through which You can operate the browser browser = webdriver.Chrome('drive path') #Use the browser to initiate the specified request browser.get(url) #Use the following method to find the specified element and operate it. find_element_by_id Find the node according to the id find_elements_by_name Find the node according to the name find_elements_by_xpath Find according to the xpath find_elements_by_tag_name Find the tag name find_elements_by_class_name according to class name lookup

Share a picture share picture < /div>

三.phantomJs

  • PhantomJS is an interfaceless browser, its automatic operation process and the above operations Google Chrome is consistent. Because it has no interface, in order to be able to display the automated operation process, PhantomJS provides users with a screen capture function, which is implemented using the save_screenshot function.
  • Code demo:
    from selenium import webdriver import time # phantomjs path path = r'PhantomJS drive path' browser = webdriver.PhantomJS(path) # Open Baidu url = 'http://www.baidu.com/' browser.get(url) time.sleep(3 ) browser.save_screenshot(r'phantomjs\baidu.png') # Find the input input box my_input = browser.find_element_by_id('kw') # Write text in the box my_input.send_keys('beauty') time.sleep(3) #Screenshot browser.save_screenshot(r'phantomjs\meinv.png') # Find the search button = browser.find_elements_by_class_name('s_btn')[0] button.click() time.sleep(3) browser.save_screenshot(r'phantomjs\show.png') time.sleep(3) browser.quit()< /span>

    share picture “share pictures “class=”cke_reset cke_widget_drag_handler” title=”Click and drag to move” src=”/wp-content/uploads/images/industry/web-crawler/1626796371044data:image/gif;base64,R0lGODlhAQABAPAPABAP//wAOCH5AE=”AAAALEAAAPABAP width=”15″ height=”15″ data-cke-widget-drag-handler=”1″ >

  • Key points : Selenium+phantomjs is the ultimate solution for crawlers: the content information on some websites is formed by dynamically loading js, so using ordinary crawler programs cannot go back to dynamically loaded js content. For example, the movie information in Douban Movies dynamically loads more movie information through a pull-down operation.

    • Comprehensive operation: The requirement is to crawl as much movie information in Douban as possible

      from selenium import webdriver from time import sleep import time if __name__ == ‘__main__‘: url = ‘https://movie.douban.com/typerank?type_name=%E6%81%90%E6%80%96&type=20&interval_id=100:90&action=‘ # 发起请求前,可以让url表示的页面动态加载出更多的数据 path = r‘C:\Users\Administrator\Desktop\爬虫授课\day05\ziliao\phantomjs-2.1.1-windows\bin\phantomjs.exe‘ # 创建无界面的浏览器对象 bro = webdriver.PhantomJS(path) # 发起url请求 bro.get(url) time.sleep(3) # 截图 bro.save_screenshot(‘1.png‘) # 执行js代码(让滚动条向下偏移n个像素(作用:动态加载了更多的电影信息)) js = ‘window.scrollTo(0,document.body.scrollHeight)‘ bro.execute_script(js) # 该函数可以执行一组字符串形式的js代码 time.sleep(2) bro.execute_script(js) # 该函数可以执行一组字符串形式的js代码 time.sleep(2) bro.save_screenshot(‘2.png‘) time.sleep(2) # 使用爬虫程序爬去当前url中的内容 html_source = bro.page_source # 该属性可以获取当前浏览器的当前页的源码(html) with open(‘./source.html‘, ‘w‘, encoding=‘utf-8‘) as fp: fp.write(html_source) bro.quit()

      分享图片 分享图片

       

四.谷歌无头浏览器

  • 由于PhantomJs最近已经停止了更新和维护,所以推荐大家可以使用谷歌的无头浏览器,是一款无界面的谷歌浏览器。
  • 代码展示:
    from selenium import webdriver from selenium.webdriver.chrome.options import Options import time # 创建一个参数对象,用来控制chrome以无界面模式打开 chrome_options = Options() chrome_options.add_argument(‘--headless‘) chrome_options.add_argument(‘--disable-gpu‘) # 驱动路径 path = r‘C:\Users\ZBLi\Desktop\1801\day05\ziliao\chromedriver.exe‘ # 创建浏览器对象 browser = webdriver.Chrome(executable_path=path, chrome_options=chrome_options) # 上网 url = ‘http://www.baidu.com/‘ browser.get(url) time.sleep(3) browser.save_screenshot(‘baidu.png‘) browser.quit()


    转载地址:https://www.cnblogs.com/bobo-zhang/p/9685362.html

#!/usr/bin/env python

#
-*- coding:utf-8 -*-
import requests
from lxml import etree

if __name__ == "__main__":
url
= http://sc.chinaz.com/tupian/gudianmeinvtupian.html
headers
= {
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36,
}
#获取页面文本数据
response = requests.get(url=url,headers=headers)
response.encoding
= utf-8
page_text
= response.text
#解析页面数据(获取页面中的图片链接)
#创建etree对象
tree = etree.HTML(page_text)
div_list
= tree.xpath(//div[@id="container"]/div)
#解析获取图片地址和图片的名称
for div in div_list:
image_url
= div.xpath(.//img/@src)
image_name
= div.xpath(.//img/@alt)
print(image_url) #打印图片链接
print(image_name)#打印图片名称

 

分享图片

#!/usr/bin/env python

#
-*- coding:utf-8 -*-
import requests
from lxml import etree

if __name__ == "__main__":
url
= http://sc.chinaz.com/tupian/gudianmeinvtupian.html
headers
= {
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36,
}
#获取页面文本数据
response = requests.get(url=url,headers=headers)
response.encoding
= utf-8
page_text
= response.text
#解析页面数据(获取页面中的图片链接)
#创建etree对象
tree = etree.HTML(page_text)
div_list
= tree.xpath(//div[@id="container"]/div)
#解析获取图片地址和图片的名称
for div in div_list:
image_url
= div.xpath(.//img/@src)
image_name
= div.xpath(.//img/@alt)
print(image_url) #打印图片链接
print(image_name)#打印图片名称

#!/usr/bin/env python

#
-*- coding:utf-8 -*-
import requests
from lxml import etree

if __name__ == "__main__":
url
= http://sc.chinaz.com/tupian/gudianmeinvtupian.html
headers
= {
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36,
}
#获取页面文本数据
response = requests.get(url=url,headers=headers)
response.encoding
= utf-8
page_text
= response.text
#解析页面数据(获取页面中的图片链接)
#创建etree对象
tree = etree.HTML(page_text)
div_list
= tree.xpath(//div[@id="container"]/div)
#解析获取图片地址和图片的名称
for div in div_list:
image_url
= div.xpath(.//img/@src2) #src2伪属性
image_name = div.xpath(.//img/@alt)
print(image_url) #打印图片链接
print(image_name)#打印图片名称

 

 

分享图片 分享图片

#!/usr/bin/env python

#
-*- coding:utf-8 -*-
import requests
from lxml import etree

if __name__ == "__main__":
url
= http://sc.chinaz.com/tupian/gudianmeinvtupian.html
headers
= {
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36,
}
#获取页面文本数据
response = requests.get(url=url,headers=headers)
response.encoding
= utf-8
page_text
= response.text
#解析页面数据(获取页面中的图片链接)
#创建etree对象
tree = etree.HTML(page_text)
div_list
= tree.xpath(//div[@id="container"]/div)
#解析获取图片地址和图片的名称
for div in div_list:
image_url
= div.xpath(.//img/@src2) #src2伪属性
image_name = div.xpath(.//img/@alt)
print(image_url) #打印图片链接
print(image_name)#打印图片名称

from selenium import webdriver from time import sleep # 后面是你的浏览器驱动位置,记得前面加r‘‘,‘r‘是防止字符转义的 driver = webdriver.Chrome(r‘驱动程序路径‘) # 用get打开百度页面 driver.get("http://www.baidu.com") # 查找页面的“设置”选项,并进行点击 driver.find_elements_by_link_text(‘设置‘)[0].click() sleep(2) # # 打开设置后找到“搜索设置”选项,设置为每页显示50条 driver.find_elements_by_link_text(‘搜索设置‘)[0].click() sleep(2) # 选中每页显示50条 m = driver.find_element_by_id(‘nr‘) sleep(2) m.find_element_by_xpath(‘//*[@id="nr"]/option[3]‘).click() m.find_element_by_xpath(‘.//option[3]‘).click() sleep(2) # 点击保存设置 driver.find_elements_by_class_name("prefpanelgo")[0].click() sleep(2) # 处理弹出的警告页面 确定accept() 和 取消dismiss() driver.switch_to_alert().accept() sleep(2) # 找到百度的输入框,并输入 美女 driver.find_element_by_id(‘kw‘).send_keys(‘美女‘) sleep(2) # 点击搜索按钮 driver.find_element_by_id(‘su‘).click() sleep(2) # 在打开的页面中找到“Selenium - 开源中国社区”,并打开这个页面 driver.find_elements_by_link_text(‘美女_百度图片‘)[0].click() sleep(3) # 关闭浏览器 driver.quit()

分享图片 分享图片

#导包 from selenium import webdriver #创建浏览器对象,通过该对象可以操作浏览器 browser = webdriver.Chrome(‘驱动路径‘) #使用浏览器发起指定请求 browser.get(url) #使用下面的方法,查找指定的元素进行操作即可 find_element_by_id 根据id找节点 find_elements_by_name 根据name找 find_elements_by_xpath 根据xpath查找 find_elements_by_tag_name 根据标签名找 find_elements_by_class_name 根据class名字查找

分享图片 分享图片

from selenium import webdriver import time # phantomjs路径 path = r‘PhantomJS驱动路径‘ browser = webdriver.PhantomJS(path) # 打开百度 url = ‘http://www.baidu.com/‘ browser.get(url) time.sleep(3) browser.save_screenshot(r‘phantomjs\baidu.png‘) # 查找input输入框 my_input = browser.find_element_by_id(‘kw‘) # 往框里面写文字 my_input.send_keys(‘美女‘) time.sleep(3) #截屏 browser.save_screenshot(r‘phantomjs\meinv.png‘) # 查找搜索按钮 button = browser.find_elements_by_class_name(‘s_btn‘)[0] button.click() time.sleep(3) browser.save_screenshot(r‘phantomjs\show.png‘) time.sleep(3) browser.quit()

分享图片 分享图片

from selenium import webdriver from time import sleep import time if __name__ == ‘__main__‘: url = ‘https://movie.douban.com/typerank?type_name=%E6%81%90%E6%80%96&type=20&interval_id=100:90&action=‘ # 发起请求前,可以让url表示的页面动态加载出更多的数据 path = r‘C:\Users\Administrator\Desktop\爬虫授课\day05\ziliao\phantomjs-2.1.1-windows\bin\phantomjs.exe‘ # 创建无界面的浏览器对象 bro = webdriver.PhantomJS(path) # 发起url请求 bro.get(url) time.sleep(3) # 截图 bro.save_screenshot(‘1.png‘) # 执行js代码(让滚动条向下偏移n个像素(作用:动态加载了更多的电影信息)) js = ‘window.scrollTo(0,document.body.scrollHeight)‘ bro.execute_script(js) # 该函数可以执行一组字符串形式的js代码 time.sleep(2) bro.execute_script(js) # 该函数可以执行一组字符串形式的js代码 time.sleep(2) bro.save_screenshot(‘2.png‘) time.sleep(2) # 使用爬虫程序爬去当前url中的内容 html_source = bro.page_source # 该属性可以获取当前浏览器的当前页的源码(html) with open(‘./source.html‘, ‘w‘, encoding=‘utf-8‘) as fp: fp.write(html_source) bro.quit()

分享图片 分享图片

from selenium import webdriver from selenium.webdriver.chrome.options import Options import time # 创建一个参数对象,用来控制chrome以无界面模式打开 chrome_options = Options() chrome_options.add_argument(‘--headless‘) chrome_options.add_argument(‘--disable-gpu‘) # 驱动路径 path = r‘C:\Users\ZBLi\Desktop\1801\day05\ziliao\chromedriver.exe‘ # 创建浏览器对象 browser = webdriver.Chrome(executable_path=path, chrome_options=chrome_options) # 上网 url = ‘http://www.baidu.com/‘ browser.get(url) time.sleep(3) browser.save_screenshot(‘baidu.png‘) browser.quit()


转载地址:https://www.cnblogs.com/bobo-zhang/p/9685362.html

Leave a Comment

Your email address will not be published.