07 reptile related

August 22, 2021By Simo Web Crawler

One. http/https related knowledge

1. http and https

1) HTTP protocol (HyperText Transfer Protocol): is a method of publishing and receiving HTML pages.

2) HTTPS (Hypertext Transfer Protocol over Secure Socket Layer) is simply the secure version of HTTP, adding an SSL layer under HTTP.

3) SSL (Secure Sockets Layer) is mainly used as a secure transmission protocol for the Web. The network connection is encrypted at the transmission layer to ensure the safety of data transmission on the Internet.

2. get and post

1) GET is to obtain data from the server, POST is to transmit data to the server

2) GET request parameter display, both Displayed on the web address of the browser, the HTTP server generates the response content according to the parameters in the URL contained in the request, that is, the parameters of the “Get” request are part of the URL. For example: http://www.baidu.com/s?wd=Chinese

3) POST request parameters are in the request body, and the message length is not limited and sent in an implicit way, usually used Submit a relatively large amount of data to the HTTP server (for example, the request contains many parameters or file upload operations, etc.). The requested parameters are included in the “Content-Type” message header, indicating the media type and encoding of the message body,

Note: Avoid using the Get method to submit the form, because it may cause security issues. For example, if you use the Get method in the login form, the user name and password entered by the user will be exposed in the address bar.

3. Cookie and Session

1) Cookie: Determine the identity of the user through the information recorded on the client.

2) Session: Determine the identity of the user through the information recorded on the server

4. Response status code

Share picture

Two. Requests use

1. The most basic GET request can use the get method directly

response = requests.get(“http://www.baidu.com/”)

You can also write like this

response = requests.request(“get”, “http://www.baidu.com/”)

2. Add headers and Query parameters

If you want to add headers, you can pass in the headers parameter to increase the header information in the request header. If you want to pass parameters in the url, you can use the params parameter.

import requests

kw = {'wd':'Great Wall '}
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36 "}

# params receives a dictionary or string query parameters, the dictionary type is automatically converted to url encoding, no urlencode()
response = requests.get("http://www.baidu.com/s?", params = kw, headers = headers)

# View the response content, response.text returns the data in Unicode format
print (response.text)

# View response content , Byte stream data returned by response.content
print (respones.content) 

# View full url address
print (response.url)

# View response header character encoding
print span> (response.encoding)
 
# View response code
print (response.status_code)

3. The most basic post method

response = requests.post(“http: //www.baidu.com/”, data = data)

import requests
 
formdata = {
 "type":"AUTO", 
 "i":"i love python",
 "doctype" span>:"json",
 "xmlVersion< /span>":"1.8 span>",
 " keyfrom":"< /span>fanyi.web",< br /> "ue ":"UTF-8",
 "action" :"FY_BY_ENTER"< /span>,
 " typoResult":"true "
}
 
url = "http://fanyi.youdao.com/translate?smartresult=dict&smartresult=rule&smartresult=ugc&sessionFrom=null span>"
 
headers={ "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36"}
 
response = requests.post(url, data = formdata, headers = headers)
 
print (response.text)
 
# If it is a json file, it can be displayed directly
print (response.json())

4. Agency

import requests
 
# According to the type of agreement, choose different agents
proxies = {
 " http": "< /span>http://12.34.56.79:9527",
 "https": "http://12.34.56.79:9527 span>",
}

response = requests .get("http://www.baidu.com", proxies = proxies)
print response .text

5. Processing HTTPS request SSL certificate verification

5. h4>

1) To check the SSL certificate of a certain host, you can use the verify parameter (or not write)

import requests

response = requests.get( "https://www.baidu.com/", verify=< span style="color: #000000">True)

# You can also omit it

# response = requests.get ("https://www.baidu.com/")

print (r.text)

2) If we want to skip the SSL certificate verification, set verify to False to request normally.

r = requests.get(“https://www.12306.cn/mormhweb/”, verify = False)

Three. Free proxy ip

1) Xicida free proxy IP: https://www.xicidaili.com/

2) Quick proxy free proxy: https://www.kuaidaili. com/free/inha/

3) The proxy IP of the entire network: http://www.goubanjia.com/

If there are enough proxy IPs, Just like randomly obtaining User-Agent, randomly select an agent to visit the website

import urllib.request
import random
 
proxy_list = [
 {"< span style="color: #800000">http": "124.88.67.81:80"},
 {"http" : "124.88.67.81:80< span style="color: #800000">"},
 {"< /span>http": " span>124.88.67.81:80"} ,
 {"http": "124.88.67.81:80"},
 {"< span style="color: #800000">http": "124.88.67.81:80"}
 ]
 
# Randomly choose a proxy
proxy = random.choice(proxy_list)
# Use the selected proxy to build a proxy handler object
httpproxy_handler = urllib.request.ProxyHandler(proxy)
 
opener = urllib.request.build_opener(httpproxy_handler) 
 
request = urllib.request.Request("http:/ /www.baidu.com/")
response =< span style="color: #000000"> opener.open(request)
print (response.read())

However, these free and open proxies are generally used by many people, and the proxies have short lifespan, slow speed, low anonymity, unstable HTTP/HTTPS support and other shortcomings (not good for free) ).

Four. XPath

1. Concept

XPath (XML Path Language) is a language for finding information in XML documents. It can be used to traverse elements and attributes in XML documents.

2. Select node

Share a picture < /p>

Share a picture

3. Predicate

Share pictures

4. Select unknown nodes

Share pictures

5. Select several paths

Share a picture

6. XPath operator

Operators that can be used in XPath

Share a picture

< h2>5. lxml

1. Parse and extract HTML/XML data

lxml is an HTML/XML parser. Its main function is how to parse and extract HTML/XML data.

# lxml_test.py< br /> 
# etree library using lxml
from lxml import etree
 
 text = '''

 < br /> first item

 second item

 third item

 fourth item

 < a href="link5.html">fifth item # Note that there is missing a 
 closing tag
 

 

'''
 
#Using etree.HTML to parse the string into an HTML document
html = etree.HTML(text)< br /> 
# Serialize HTML documents by string< br />result = etree.tostring(html)
 
print (result)

Output result:

<html>< br /> <body>
 <div>< /span>
 <ul >
 <li class="item-0" span>><a href="link1.html">first itema>li>
 <li class="item-1"><a href="link2.html">second itema>li>
 <li class ="item-inactive"><a href="link3.html">third itema>li>
 <li class="item-1"><a href="link4.html">fourth item a>li >
 <li class="item-0"><a href="link5.html">fifth itema>li>
 ul>
 div>< br /> body>
html >

lxml can automatically correct the html code, the example not only completes the li tag, but also adds body and html tags

2. Read File

In addition to reading strings directly, lxml also supports reading content from files. Let’s create a new hello.html file:

<div>
< ul>
<li class= "item-0"><a href="link1.html">first itema>li>
<li class ="item-1"><a href="link2.html">second itema>li>
<li class="item-inactive"><a href="link3.html" ><spa n class="bold">third itemspan>a>li>
<li class="item-1"><a href="link4.html">fourth itema>li< /span>>
<li class="item-0"><a href="link5.html" >fifth itema>li>
ul>
div>

Reuse the etree.parse() method to read the file.

# lxml_parse.py< br />
from lxml import etree

# read external file hello.html span>
html = etree.parse('./hello.html')
result = etree.tostring(html, pretty_print=True)

print(result)

Output The result is the same as before

3. Use

Get all the class attributes of the

tag

from lxml import etree

html < /span>= etree.parse('hello.html')
result = html.xpath('//li/@class')

print result

Result:

[‘item-0′,’item-1′,’item-inactive’,’item-1′, ‘item-0’]

VI. Use JSON crawler

1. import json

The json module provides four functions: dumps, dump, loads, load, which are used to convert between string and python data types.

2. json.loads()

Decoding and converting Json format strings into Python objects The type conversion comparison from json to python is as follows:

p>

import json

strList = '[1, 2, 3, 4]'

strDict = '{"city": "Beijing", "name": "大猫"}'

json.loads(strList)
# [1, 2, 3, 4]

json.loads(strDict) < /span># json data is automatically stored in Unicode
# {u'city': u'北京', u'name': u'大猫'}< /pre>

3. json.dumps()

Realize the conversion of python type into json string, return a str object and convert a Python object code into Json string.

The comparison of the conversion from the original python type to the json type is as follows:

Note: json.dumps() is the default ascii encoding used during serialization; add the parameter ensure_ascii=False to disable ascii encoding and press utf-8 encoding

import json
import chardet

listStr = [1, 2, 3, 4]
tupleStr = (1, 2, 3, 4)
dictStr = {"city": "Beijing", "name": "大刘"}
print json.dumps(dictStr, ensure_ascii=False)
# {"city": "Beijing", "name": "大刘"}

4. json.dump()

Serialize the built-in Python types into json objects and write them into files

import json

listStr = [{"city": "Beijing"}, {"name": "Da Liu"}]json.dump(listStr, open("listStr.json< span style="color: #800000">","w"), ensure_ascii=False)

dictStr = {"city": "Beijing", "name": "Big Liu"}
json.dump(dictStr, open("dictStr.json ",< span style="color: #800000">"w"), ensure_ascii =False)

5.    json.load()

读取文件中json形式的字符串元素转化成python类型

# json_load.py
import json

strList = json.load(open("listStr.json"))
print strList

# [{u'city': u'北京' }, {u'name': u'大刘'}]
strDict = json.load(open("dictStr.json"))
print strDict

# {u‘city‘: u‘北京‘, u‘name‘: u‘大刘‘}

七．实现爬虫的套路

1.    准备url

1）准备start_url

2）url地址规律不明显，总数不确定

3）通过代码提取下一页的url

3.1 xpath

3.2寻找url地址，部分参数在当前的响应中（比如，当前页码数和总的页码数在当前的响应中）

4）准备url_list

5）页码总数明确

6）url地址规律明显

2. 发送请求，获取响应

1）添加随机的User-Agent,反反爬虫

2）添加随机的代理ip，反反爬虫

3）在对方判断出我们是爬虫之后，应该添加更多的headers字段，包括cookie

4）cookie的处理可以使用session来解决

5）准备一堆能用的cookie，组成cookie池

6）如果不登录

6.1准备刚开始能够成功请求对方网站的cookie，即接收对方网站设置在response的cookie

6.2 下一次请求的时候，使用之前的列表中的cookie来请求

7）如果登录

7.1 准备多个账号

7.2 使用程序获取每个账号的cookie

7.3 之后请求登录之后才能访问的网站随机的选择cookie

3.    提取数据

1）确定数据的位置

2）如果数据在当前的url地址中

提取的是列表页的数据

直接请求列表页的url地址，不用进入详情页

提取的是详情页的数据

确定url

发送请求

提取数据

返回

八．多线程爬虫

# coding=utf-8
import requests
from lxml import etree
import json
from queue import Queue
import threading

class Qiubai:

def __init__(self):
self.headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_5) AppleWeb Kit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"}
self.url_queue = Queue() #实例化三个队列，用来存放内容
self.html_queue =Queue()
self.content_queue = Queue()

def get_total_url(self):
‘‘‘
获取了所有的页面url，并且返回urllist
return ：list
‘‘‘
url_temp = ‘https://www.qiushibaike.com/8hr/page/{}/‘
url_list = []
for i in range(1,36):
self.url_queue.put(url_temp.format(i))

def parse_url(self):
‘‘‘
一个发送请求，获取响应，同时etree处理html
‘‘‘
while self.url_queue.not_empty:
url = self.url_queue.get()
print("parsing url:",url)
response = requests.get(url,headers=self.headers,timeout=10) #发送请求
html = response.content.decode() #获取html字符串
html = etree.HTML(html) #获取element 类型的html
self.html_queue.put(html)
self.url_queue.task_done()

def get_content(self):
‘‘‘
:param url:
:return: 一个list，包含一个url对应页面的所有段子的所有内容的列表
‘‘‘
while self.html_queue.not_empty:
html = self.html_queue.get()
total_div = html.xpath(‘//div[@class="article block untagged mb15"]‘) #返回divelememtn的一个列表
items = []
for i in total_div: #遍历div标枪，获取糗事百科每条的内容的全部信息
author_img = i.xpath(‘./div[@class="author clearfix"]/a[1]/img/@s rc‘)
author_img = "https:" + author_img[0] if len(author_img) > 0 else None
author_name = i.xpath(‘./div[@class="author clearfix"]/a[2]/h2/text()‘)
author_name = author_name[0] if len(author_name) > 0 else None
item = dict(
author_name=author_name,
autho r_img=author_img
)
items.append(item)
self.content_queue.put(items)
self.html_queue.task_done() #task_done的时候，队列计数减一

def save_items(self):
‘‘‘
保存items
:param items:列表
‘‘‘

while self.content_queue.not_empty:
items = self.content_queue.get()
f = open("qiubai.txt","a")

for i in items:
json.dump(i,f,ensure_ascii=False,indent=2)
# f.write(json.dumps(i))

f.close()
self.content_queue.task_done()

def run(self):

# 1.获取url list

# url_list = self.get_total_url()

threa d_list = []

thread_url = threading.Thread(target=self.get_total_url)

thread_list.append(thread_url)

#发送网络请求

for i in range(10):

thread_parse = threading.Thread(target=self.parse_url)

thread_list.append(thread_parse)

#提取数据

thread_get_content = threading.Thread(target=self.get_content)

thread_list.append(thread_get_content)

#保存

thread_save = threading.Thread(target=self.save_items)

thread_list.append(thread_save)

for t in thread_list:

t.setDaemon(True) #为每个进程设置为后台进程，效果是主进程退出子进程也会退出

t.start() #为了解决程序结束无法退出的问题

#

# for t in thread_list:

# t.join()

self.url_queue.join() #让主线程等待，所有的队列为空的时候才能退出

self.html_queue.join()

self.content_queue.join()

if __name__ == "__main__":

qiubai = Qiubai()

qiubai.run()

九． Selenium与PhantomJS

1. Selenium

1）Selenium 可以根据我们的指令，让浏览器自动加载页面，获取需要的数据，甚至页面截屏，或者判断网站上某些动作是否发生。

2）Selenium 自己不带浏览器，不支持浏览器的功能，它需要与第三方浏览器结合在一起才能使用。

2. PhantomJS

1) PhantomJS 是一个基于Webkit的“无界面”(headless)浏览器，它会把网站加载到内存并执行页面上的 JavaScript，因为不会展示图形界面，所以运行起来比完整的浏览器要高效。

2)      如果我们把 Selenium 和 PhantomJS 结合在一起，就可以运行一个非常强大的网络爬虫了，这个爬虫可以处理 JavaScrip、Cookie、headers，以及任何我们真实用户需要做的事情。

3.    快速入门

1）导包

from selenium import webdriver
# 调用键盘按键操作时需要引入的Keys包
from selenium.webdriver.common.keys import Keys

2）调用环境变量指定的PhantomJS浏览器创建浏览器对象

driver = webdriver.PhantomJS()

3）访问页面

driver.get("http://www.baidu.com/")

4）截图

driver.save_screenshot("baidu.png")

5）在百度搜索框写入文字

driver.find_element_by_id("kw").send_keys(u"长城")

6）百度搜索按钮，click() 是模拟点击

driver.find_element_by_id("su").click()

7）ctrl+a 全选输入框内容

driver.find_element_by_id("kw").send_keys(Keys.CONTROL,‘a‘)

8）模拟Enter回车键

driver.find _element_by_id("su").send_keys(Keys.RETURN)

9）页面前进和后退

driver.forward()     #前进

driver.back()        # 后退

10）清除输入框内容

driver.find_element_by_id("kw").clear()

11）关闭页面或关闭浏览器

关闭页面：driver.close()

关闭浏览器：driver.quit()

4.    selenium使用的注意点

1）获取文本和获取属性：先定位到元素，然后调用`.text`或者`get_attribute`方法来去

2）selenium获取的页面数据是浏览器中elements的内容

3）find_element和find_elements的区别

find_element返回一个element，如果没有会报错

find_elements返回一个列表，没有就是空列表

在判断是否有下一页的时候，使用find_elements来根据结果的列表长度来判断

4）如果页面中含有iframe、frame，需要先调用driver.switch_to.frame的方法切换到frame中才能定位元素

5） selenium请求第一页的时候回等待页面加载完了之后在获取数据，但是在点击翻页之后，hi直接获取数据，此时可能会报错，因为数据还没有加载出来，需要time.sleep(3)

6）selenium中find_element_by_class_name智能接收一个class对应的一个值，不能传入多个

十．验证码识别

1. url不变，验证码不变

请求验证码的地址，获得相应，识别

2. url不变，验证码会变

思路：对方服务器返回验证码的时候，会和每个用户的信息和验证码进行一个对应，之后，在用户发送post请求的时候，会对比post请求中法的验证码和当前用户真正的存储在服务器端的验证码是否相同

1）实例化session

2）使用seesion请求登录页面，获取验证码的地址

3）使用session请求验证码，识别

4）使用session发送post请求’

3. 使用selenium登录，遇到验证码

1）url不变，验证码不变，同上

2）url不变，验证码会变

selenium请求登录页面，同时拿到验证码的地址

获取登录页面中driver中的cookie，交给requests模块发送验证码的请求，识别

输入验证码，点击登录

import requests

kw = {‘wd‘:‘长城‘}
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36"}

# params 接收一个字典或者字符串的查询参数，字典类型自动转换为url编码，不需要urlencode()
response = requests.get("http://www.baidu.com/s?", params = kw, headers = headers)

# 查看响应内容，response.text 返回的是Unicode格式的数据
print (response.text)

# 查看响应内容，response.content返回的字节流数据
print (respones.content)

# 查看完整url地址
print (response.url)

# 查看响应头部字符编码
print (response.encoding)

# 查看响应码
print (response.status_code)

import requests

formdata = {
"type":"AUTO",
"i":"i love python",
"doctype":"json",
"xmlVersion":"1.8",
"keyfrom":"fanyi.web",
"ue":"UTF-8",
"action":"FY_BY_ENTER",
"typoResult":"true"
}

url = "http://fanyi.youdao.com/translate?smartresult=dict&smartresult=rule&smartresult=ugc&sessionFrom=null"

headers={ "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36"}

response = requests.post(url, data = formdata, headers = headers)

print (response.text)

#< span style="color: #008000"> 如果是json文件可以直接显示
print (response.json())

import requests

# 根据协议类型，选择不同的代理
proxies = {
"http": "http://12.34.56.79:9527",
"https": "http://12.34.56.79:9527",
}< br />
response = requests.get("http://www.baidu.com", proxies = proxies)
print response.text

import requests

response = requests.get("https://www.baidu.com/", verify=True)

# 也可以省略不写

# response = requests.get("https://www.baidu.com/")

print (r.text)

import urllib.r equest
import random

proxy_list = [
{"http" : "124.88.67.81:80"},
{"http" : "124.88.67.81:80"},
{"http" : "124.88.67.81:80"},
{"http" : "124.88.67.81:80"},
{"http" : "124.88.67.81:80"}
]

# 随机选择一个代理
proxy = random.choice(proxy_list)
# 使用选择的代理构建代理处理器对象
httpproxy_handler = urllib.request.ProxyHandler(proxy)

open er = urllib.request.build_opener(httpproxy_handler)

request = urllib.request.Request("http://www.baidu.com/")
response = opener.open(request)
print (response.read())

# lxml_test.py

# 使用 lxml 的 etree 库
from lxml import etree

text = ‘‘‘

first item

second item

third item

fourth item

fifth item # 注意，此处缺少一个
闭合标签

‘‘‘

#利用etree.HTML，将字符串解析为HTML文档
html = etree.HTML(text)

# 按字符串序列化HTML文档
result = etree.tostring(html)

print(result)

<html>
<body>
<div>
<ul>
<li class="item-0"><a href="link1.html">first itema>li>
<li class="item-1"><a href="link2.html">second itema>li>
<li class="item-inactive"><a href="link3.html">third itema>li>
<li class="item-1"><a href="link4.html">fourth itema>li>
<li class="item-0"><a href="link5.html">fifth itema>li>
ul>
div>
body>
html>

<div>
<ul>
<li class="item-0"><a href="link1.html">first itema>li>
<li class="item-1"><a href="link2.html">second itema>li>
<li class="item-inactive"><a href="link3.html"><span class="bold">third itemspan>a>li>
<li class="item-1"><a href="link4.html">fourth itema>li>
<li class="item-0"><a href="link5.html">fifth itema>li>
ul>
div>

# lxml_parse.py

from lxml import etree

# 读取外部文件 hello.html
html = etree.parse(‘./hello.htm l‘)
result = etree.tostring(html, pretty_print=True)

print(result)

from lxml import etree

html = etree.parse(‘hello.html‘)
result = html.xpath(‘//li/@class‘)

print result

import json

strList = ‘[1, 2, 3, 4]‘

strDict = ‘{"city": "北京", "name": "大猫"}‘

json.loads(strList)
# [1, 2, 3, 4]

json.loads(strDict) # json数据自动按Unicode存储
# {u‘city‘: u‘北京‘, u‘name‘: u‘大猫‘}

import json
import chardet

listStr = [1, 2, 3, 4]
tupleStr = (1, 2, 3, 4)
dictStr = {"city": "北京", "name": "大刘"}
print json.dumps(dictStr, ensure_ascii=False)
# {"city": "北京", "name": "大刘"}

import json

listStr = [{"city": "北京"}, {"name": "大刘"}]
json.dump(listStr, open("listStr.json","w"), ensure_ascii=False)

dictStr = {"city": "北京", "name": "大刘"}
json.dump(dictStr, open("dictStr.json","w"), ensure_ascii=False)

# json_load.py
import json

strList = json.load(open("listStr.json"))
print strList

# [{u‘city‘: u‘北京‘}, {u‘name‘: u‘大刘‘}]
strDict = json.load(open("dictStr.json"))
print strDict

# {u‘city‘: u‘北京‘, u‘name‘: u‘大刘‘}

# coding=utf-8
import requests
from lxml import etree
import json
from queue import Queue
import threading

class Qiubai:

def __init__(self):
self.headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_5) AppleWeb Kit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"}
self.url_queue = Queue() #实例化三个队列，用来存放内容
self.html_queue =Queue()
self.content_queue = Queue()

def get_total_url(self):
‘‘‘
获取了所有的页面url，并且返回urllist
return ：list
‘‘‘
url_temp = ‘https://www.qiushibaike.com/8hr/page/{}/‘
url_list = []
for i in ran ge(1,36):
self.url_queue.put(url_temp.format(i))

def parse_url(self):
‘‘‘
一个发送请求，获取响应，同时etree处理html
‘‘‘
while self.url_queue.not_empty:
url = self.url_queue.get()
print("parsing url:",url)
response = requests.get(url,headers=self.headers,timeout=10) #发送请求
html = response .content.decode() #获取html字符串
html = etree.HTML(html) #获取element 类型的html
self.html_queue.put(html)
self.url_queue.task_done()

def get_content(self):
‘‘‘
:param url:
:return: 一个list，包含一个url对应页面的所有段子的所有内容的列表
‘‘‘
while self.html_queue.not_empty:
html = self.html_queue.get()
total_div = html.xpath(‘//div[@ class="article block untagged mb15"]‘) #返回divelememtn的一个列表
items = []
for i in total_div: #遍历div标枪，获取糗事百科每条的内容的全部信息
author_img = i.xpath(‘./div[@class="author clearfix"]/a[1]/img/@src‘)
author_img = "https:" + author_img[0] if len(author_img) > 0 else None
author_name = i.xpath(‘./div[@class="author clearfix"]/a[2]/h2/text()‘)
author_name = author_name[0] if len(author_name) > 0 else None
item = dict(
author_name=author_name,
author_img=author_img
)
items.append(item)
self.content_queue.put(items)
self.html_queue.task_done() #task_done的时候，队列计数减一

def save _items(self):
‘‘‘
保存items
:param items:列表
‘‘‘

while self.content_queue.not_empty:
items = self.content_queue.get()
f = open("qiubai.txt","a")

for i in items:
json.dump(i,f,ensure_ascii=False,indent=2)
# f.write(json.dumps(i))

f.close()
self.content_queue.task_done()

def run(self):

# 1.获取url list

# url_list = self.get_total_url()

thread_list = []

thread_url = threading.Thread(target=self.get_total_url)

thread_list.append(thread_url)

#发送网络请求

for i in range(10):

thread_parse = threading.Thread(target=self.parse_url)

thread_list.append(thread_parse)

#提取数据

thread_get_content = threading.Thread(target=self.get_content)

thread_list.append(thread_get_content)

#保存

thread_save = threading.Thread(target=self.save_items)

thread_list.append(thread_save)

for t in thread_list:

t.set Daemon(True) #为每个进程设置为后台进程，效果是主进程退出子进程也会退出

t.start() #为了解决程序结束无法退出的问题

#

# for t in thread_list:

# t.join()

self.url_queue.join() #让主线程等待，所有的队列为空的时候才能退出

self.html_queue.join()

self.content_queue.join()

if __name__ == "__main__":

qiubai = Qiubai()

qiubai.run()

from selenium import webdriver
# 调用键盘按键操作时需要引入的Keys包
from selenium.webdriver.common.keys import Keys

One. http/https related knowledge

1. http and https

2. get and post

3. Cookie and Session

4. Response status code

Two. Requests use

1. The most basic GET request can use the get method directly

2. Add headers and Query parameters

3. The most basic post method

4. Agency

5. Processing HTTPS request SSL certificate verification

Three. Free proxy ip

Four. XPath

1. Concept

2. Select node

3. Predicate

4. Select unknown nodes

5. Select several paths

6. XPath operator

1. Parse and extract HTML/XML data

2. Read File

3. Use

VI. Use JSON crawler

1. import json

2. json.loads()

3. json.dumps()

4. json.dump()

5. json.load()

七．实现爬虫的套路

1. 准备url

2. 发送请求，获取响应

3. 提取数据

八．多线程爬虫

九． Selenium与PhantomJS

1. Selenium

2. PhantomJS

3. 快速入门

4. selenium使用的注意点

十．验证码识别

1. url不变，验证码不变

2. url不变，验证码会变

3. 使用selenium登录，遇到验证码

Leave a Comment Cancel reply