crawler 03 selenium module summary
One: installation and use of selenium module< /span>
01: Introduction to selenium
What is selenium? Selenium is a third-party library of Python. The interface provided to the outside can operate the browser, and then let the browser complete the automatic operation.
selenium was originally an automated testing tool, and the crawler used it mainly to solve the problem that requests cannot directly execute JavaScript code selenium
The essence is to drive the browser to completely simulate the operation of the browser, such as jump, input, click, pull down, etc., to get the result of the web page rendering. Supports multiple browsers
02: Installation and use of selenium
1. Download the driver
http://npm.taobao.org/mirrors/chromedriver/2.35/
if mac system: Then move the decompressed chromedriver to /usr /local/bin directory
if window system: download chromdriver.exe and put it in the scripts directory of the python installation path, pay attention The latest version is 2.38, not 2.9
2. Install the pip package
pip3 install -i https://pypi.tuna.tsinghua.edu.cn/simple selenium
ps: Note: The default webdriver supported by selenium3 is Firfox, and Firefox needs to install geckodriver
Download link (https://github.com/mozilla/geckodriver/releases)
03: Applicable browsers for selenium
001: Selenium supports a lot of browsers, such as Chrome, Firefox, Edge, etc., as well as browsers for mobile phones such as Android and BlackBerry. In addition,
also supports the non-interface browser PhantomJS.
002: The syntax used by each browser:
from selenium import webdriver
browser = webdriver.Chrome()
browser = webdriver.Firefox()
browser = webdriver.Edge()
browser = webdriver.PhantomJS()
browser = webdriver.Safari()
04: Selenium element positioning
A: webdriver provides a series of element positioning methods, the following are commonly used: id name class name tag name link text partial link text xpath css selector B: The corresponding methods in python webdriver are: find_element_by_id() find_element_by_name() find_element_by_class_name() find_element_by_tag_name() find_element_by_link_text() find_element_by_partial_link_text() find_element_by_xpath() find_element_by_css_selector() ps: 1. Find_element_by_xxx finds the first qualified label, and find_elements_by_xxx finds all qualified labels. 2. According to ID, CSS selector and XPath, the results they return are exactly the same. 3. In addition, Selenium also provides a general method find_element(), which needs to pass in two parameters: search method By and value. Actually, It is the general function version of find_element_by_id(). For example, find_element_by_id(id) is equivalent to find_element(By.ID, id), The results obtained by the two are exactly the same.
05: Selenium node interaction
Selenium can drive the browser to perform some operations, which means that the browser can simulate some actions. The more common usages are: send_keys() method when inputting text,
clear() method when clearing text, and click() method when clicking a button.
1 from selenium import webdriver
2 import time
3 browser=webdriver.Chrome()
4 browser.get("https://www.taobao.com/")
5 input=browser.find_element_by_id("q")
6 input.send_keys("Beauty")
7 time.sleep(2)
8 input.clear()
9 time.sleep(3)
10 input.send_keys("Travel bag")
11 button = browser.find_element_by_class_name('btn-search')
12 button.click()
Case code span>
06: Selenium action chain
In the above example, some interactive actions are executed for a certain node. For example, for an input box, we call its input text and clear text methods;
for a button, we call its click method. In fact, there are other operations that do not have specific execution objects, such as mouse drag, keyboard keys, etc. These actions are performed in another way, which is an action chain. For example, now implement the drag operation of a node, drag a node from one place to another.
1 from selenium import webdriver
2 from selenium.webdriver import ActionChains
3
4 import time
5
6 browser = webdriver.Chrome()
7 url = 'http://www.runoob.com/try/try.php?filename=jqueryui-api-droppable'
8 browser.get(url)
9 browser.switch_to.frame('iframeResult')
10 source = browser.find_element_by_css_selector('#draggable')
11 target = browser.find_element_by_css_selector('#droppable')
12 action = ActionChains(browser)
13 action.click_and_hold(source).perform()
14 time.sleep(1)
15 action.move_to_element(target).perform()
16 time.sleep(1)
17 action.move_by_offset(xoffset=50, yoffset=0). perform()
18 action.release()
Case code span>
07: Execute JavaScript
For some operations, Selenium API does not provide. For example, pull down the progress bar, it can directly simulate running JavaScript, in this case, use the execute_script() method.
1 from selenium import webdriver
2
3 browser = webdriver.Chrome()
4 browser.get('https://www.jd.com/')
5 browser.execute_script('window.scrollTo(0, document.body.scrollHeight)')
6 browser.execute_script('alert("123")') #< /span> The alert box will pop up after the scroll bar moves to the bottom
Case code< /span>
08: Get node information
ps: You can get the source code of the webpage through the page_source attribute, and then you can use parsing libraries (such as regular expressions, Beautiful Soup, pyquery, etc.) ) To extract information.
However, since Selenium has provided a method to select a node and it returns a WebElement type, it also has related methods and attributes to directly extract node information, such as attributes, text, etc.
In this case, we can extract information without parsing the source code,
< pre> 1 from selenium import webdriver
2 from selenium.webdriver.common.by < span style="color: #0000ff;">import By # How to find , By.ID,By CSS_SELECT
3 from selenium.webdriver.support import expected_conditions as EC
4 from selenium.webdriver.support.wait < span style="color: #0000ff;">import WebDriverWait
5
6 browser = webdriver.Chrome()
7 browser.get(‘https://www.amazon.cn/‘)
8 wait = WebDriverWait(browser, 10)
9 wait.until(EC.presence_of_all_elements_located((By.ID, ‘cc-lm-tcgShowImgContainer‘)))
10 tag = browser.find_element(By.CSS_SELECTOR, ‘ span>#cc-lm-tcgShowImgContainer img‘)
11 # Get label attributes
12 print(tag.get_attribute(‘src‘))
13
14 # Get tag ID, location, name, size
15 print(“1>>“, tag.id) # 0.93926541719349-2
16 print(“2>>“, tag.location) # {‘x’: -242,’y’: 149}
17 print(“3>>“, tag.tag_name) # img
18 print(“4>>“, tag.size) # {‘height’: 300,’width’: 1500}
19 browser.close()
2 from selenium.webdriver.common.by < span style="color: #0000ff;">import By # How to find , By.ID,By CSS_SELECT
3 from selenium.webdriver.support import expected_conditions as EC
4 from selenium.webdriver.support.wait < span style="color: #0000ff;">import WebDriverWait
5
6 browser = webdriver.Chrome()
7 browser.get(‘https://www.amazon.cn/‘)
8 wait = WebDriverWait(browser, 10)
9 wait.until(EC.presence_of_all_elements_located((By.ID, ‘cc-lm-tcgShowImgContainer‘)))
10 tag = browser.find_element(By.CSS_SELECTOR, ‘ span>#cc-lm-tcgShowImgContainer img‘)
11 # Get label attributes
12 print(tag.get_attribute(‘src‘))
13
14 # Get tag ID, location, name, size
15 print(“1>>“, tag.id) # 0.93926541719349-2
16 print(“2>>“, tag.location) # {‘x’: -242,’y’: 149}
17 print(“3>>“, tag.tag_name) # img
18 print(“4>>“, tag.size) # {‘height’: 300,’width’: 1500}
19 browser.close()
Case code div>
09: Delayed waiting
In Selenium, the get() method will be The execution of the web page frame ends after loading. At this time, if you get page_source, it may not be the page that the browser has completely loaded.
If some pages have additional Ajax requests, we may not be able to successfully obtain them in the web page source code. arrive. Therefore, it is necessary to delay and wait for a certain period of time to ensure that the node has been loaded
out. There are two ways to wait here: one is implicit waiting, the other is explicit waiting.
001: Implicit wait:
When using implicit wait to execute a test, if Selenium does not find a node in the DOM, it will continue to wait. After the set time is exceeded, an exception that the node cannot be found will be thrown .
In other words, when searching for a node and the node does not appear immediately, the implicit wait will wait a period of time before searching for the DOM. The default time is 0.
1 from selenium import webdriver
2 from selenium.webdriver import ActionChains
3 from selenium.webdriver.common.by < span style="color: #0000ff;">import By
4 from selenium.webdriver.common.keys < span style="color: #0000ff;">import Keys #Keyboard key operation< /span>
5 from selenium.webdriver.support import expected_conditions as EC
6 from selenium.webdriver.support.wait < span style="color: #0000ff;">import WebDriverWait #Wait for the page to load Some elements
7
8 browser=webdriver.Chrome()
9 #Implicit wait: When looking for all elements, if they have not been loaded yet, wait at most 10s
10 browser.implicitly_wait(10)
11 browser.get('https://www.baidu.com' )
12 input_tag=browser.find_element_by_id('kw')
13 input_tag.send_keys('tom doll')
14 input_tag.send_keys(Keys.ENTER)
15
16 contents=browser.find_element_by_id('content_left') #If there is no waiting link, an error will be reported
17 print(contents) #
18 # -->
19
20 browser.close()
View Code div>
002: Display waiting
The effect of implicit waiting is actually not that good, because we only specify a fixed time, and the page load time will be affected by network conditions. There is also a more appropriate explicit waiting method, which specifies the node to be searched, and then specifies a maximum waiting time. If the node is loaded within the specified time, the searched node will be returned; if the node is still not loaded within the specified time, a timeout exception will be thrown.
A: webdriver provides a series of element positioning methods, the following are commonly used: id name class name tag name link text partial link text xpath css selector B: The corresponding methods in python webdriver are: find_element_by_id() find_element_by_name() find_element_by_class_name() find_element_by_tag_name() find_element_by_link_text() find_element_by_partial_link_text() find_element_by_xpath() find_element_by_css_selector() ps: 1. Find_element_by_xxx finds the first qualified label, and find_elements_by_xxx finds all qualified labels. 2. According to ID, CSS selector and XPath, the results they return are exactly the same. 3. In addition, Selenium also provides a general method find_element(), which needs to pass in two parameters: search method By and value. Actually, It is the general function version of find_element_by_id(). For example, find_element_by_id(id) is equivalent to find_element(By.ID, id), The results obtained by the two are exactly the same.
1 from selenium import webdriver
2 import time
3 browser=webdriver.Chrome()
4 browser.get("https://www.taobao.com/")
5 input=browser.find_element_by_id("q")
6 input.send_keys("Beauty")
7 time.sleep(2)
8 input.clear()
9 time.sleep(3)
10 input.send_keys("Travel bag")
11 button = browser.find_element_by_class_name('btn-search')
12 button.click()
Case code span>
1 from selenium import webdriver
2 import time
3 browser=webdriver.Chrome()
4 browser.get("https://www.taobao.com/")
5 input=browser.find_element_by_id("q")
6 input.send_keys("Beauty")
7 time.sleep(2)
8 input.clear()
9 time.sleep(3)
10 input.send_keys("Travel bag")
11 button = browser.find_element_by_class_name('btn-search')
12 button.click()
1 from selenium import webdriver
2 from selenium.webdriver import ActionChains
3
4 import time
5
6 browser = webdriver.Chrome()
7 url = 'http://www.runoob.com/try/try.php?filename=jqueryui-api-droppable'
8 browser.get(url)
9 browser.switch_to.frame(‘iframeResult‘)
10 source = browser.find_element_by_css_selector(‘#draggable‘)
11 target = browser.find_element_by_css_selector(‘#droppable‘)
12 action = ActionChains(browser)
13 action.click_and_hold(source).perform()
14 time.sleep(1)
15 action.move_to_element(target).perform()
16 time.sleep(1)
17 action.move_by_offset(xoffset=50, yoffset=0).perform()
18 action.release()
案例代码
1 from selenium import webdriver
2 from selenium.webdriver import ActionChains
3
4 import time
5
6 browser = webdriver.Chrome()
7 url = ‘http://www.runoob.com/try/try.php?filename=jqueryui-api-droppable‘
8 browser.get(url)
9 browser.switch_to.frame(‘iframeResult‘)
10 source = browser.find_element_by_css_selector(‘#draggable‘)
11 target = browser.find_element_by_css_selector(‘#droppable‘)
12 action = ActionChains(browser)
13 action.click_and_hold(source).perform()
14 time.sleep(1)
15 action.move_to_element(target).perform()
16 time.sleep(1)
17 action.move_by_offset(xoffset=50, yoffset=0).perform()
18 action.release()
1 from selenium import webdriver
2
3 browser = webdriver.Chrome()
4 browser.get(‘https://www.jd.com/‘)
5 browser.execute_script(‘window.scrollTo(0, document.body.scrollHeight)‘)
6 browser.execute_script(‘alert("123")‘) # 在滚动条移动到最底端后弹出 alert框
案例代码
1 from selenium import webdriver
2
3 browser = webdriver.Chrome()
4 browser.get(‘https://www.jd.com/‘)
5 browser.execute_script(‘window.scrollTo(0, document.body.scrollHeight)‘)
6 browser.execute_script(‘alert("123")‘) # 在滚动条移动到最底端后弹出 alert框
1 from selenium import webdriver
2 from selenium.webdriver.common.by import By # 按照什么方式查找,By.ID,By CSS_SELECT
3 from selenium.webdriver.support import expected_conditions as EC
4 from selenium.webdriver.support.wait import WebDriverWait
5
6 browser = webdriver.Chrome()
7 browser.get(‘https://www.amazon.cn/‘)
8 wait = WebDriverWait(browser, 10)
9 wait.until(EC.presence_of_all_elements_located((By.ID, ‘cc-lm-tcgShowImgContainer‘)))
10 tag = browser.find_element(By.CSS_SELECTOR, ‘#cc-lm-tcgShowImgContainer img‘)
11 # 获取标签属性
12 print(tag.get_attribute(‘src‘))
13
14 #获取标签ID,位置,名称,大小
15 print("1>>", tag.id) # 0.93926541719349-2
16 print("2>>", tag.location) # {‘x‘: -242, ‘y‘: 149}
17 print("3>>", tag.tag_name) # img
18 print("4>>", tag.size) # {‘height‘: 300, ‘width‘: 1500}
19 browser.close()
案例代码
1 from selenium import webdriver
2 from selenium.webdriver.common.by import By # 按照什么方式查找,By.ID,By CSS_SELECT
3 from selenium.webdriver.support import expected_conditions as EC
4 from selenium.webdriver.support.wait import WebDriverWait
5
6 browser = webdriver.Chrome()
7 browser.get(‘https://www.amazon.cn/‘)
8 wait = WebDriverWait(browser, 10)
9 wait.until(EC.presence_of_all_elements_located((By.ID, ‘cc-lm-tcgShowImgContainer‘)))
10 tag = browser.find_element(By.CSS_SELECTOR, ‘#cc-lm-tcgShowImgContainer img‘)
11 # 获取标签属性
12 print(tag.get_attribute(‘src‘))
13
14 #获取标签ID,位置,名称,大小
15 print("1>>", tag.id) # 0.93926541719349-2
16 print("2>>", tag.location) # {‘x‘: -242, ‘y‘: 149}
17 print("3>>", tag.tag_name) # img
18 print("4>>", tag.size) # {‘height‘: 300, ‘width‘: 1500}
19 browser.close()
09:延时等待
1 from selenium import webdriver
2 from selenium.webdriver import ActionChains
3 from selenium.webdriver.common.by import By
4 from selenium.webdriver.common.keys import Keys #键盘按键操作
5 from selenium.webdriver.support import expected_conditions as EC
6 from selenium.webdriver.support.wait import WebDriverWait #等待页面加载某些元素
7
8 browser=webdriver.Chrome()
9 #隐式等待:在查找所有元素的时候,如果尚未被加载,则最多等待10s
10 browser.implicitly_wait(10)
11 browser.get(‘https://www.baidu.com‘)
12 input_tag=browser.find_element_by_id(‘kw‘)
13 input_tag.send_keys(‘tom公仔‘)
14 input_tag.send_keys(Keys.ENTER)
15
16 contents=browser.find_element_by_id(‘content_left‘) #没有等待环节,没有则报错
17 print(contents) #
18 # -->
19
20 browser.close()
View Code
1 from selenium import webdriver
2 from selenium.webdriver import ActionChains
3 from selenium.webdriver.common.by import By
4 from selenium.webdriver.common.keys import Keys #键盘按键操作
5 from selenium.webdriver.support import expected_conditions as EC
6 from selenium.webdriver.support.wait import WebDriverWait #等待页面加载某些元素
7
8 browser=webdriver.Chrome()
9 #隐式等待:在查找所有元素的时候,如果尚未被加载,则最多等待10s
10 browser.implicitly_wait(10)
11 browser.get(‘https://www.baidu.com‘)
12 input_tag=browser.find_element_by_id(‘kw‘)
13 input_tag.send_keys(‘tom公仔‘)
14 input_tag.send_keys(Keys.ENTER)
15
16 contents=browser.find_element_by_id(‘content_left‘) #没有等待环节,没有则报错
17 print(contents) #
18 # -->
19
20 browser.close()
WordPress database error: [Table 'yf99682.wp_s6mz6tyggq_comments' doesn't exist]SELECT SQL_CALC_FOUND_ROWS wp_s6mz6tyggq_comments.comment_ID FROM wp_s6mz6tyggq_comments WHERE ( comment_approved = '1' ) AND comment_post_ID = 366 ORDER BY wp_s6mz6tyggq_comments.comment_date_gmt ASC, wp_s6mz6tyggq_comments.comment_ID ASC