requests module
Requests are written in python language based on urllib, using the HTTP library of the Apache2 Licensed open source protocol, Requests is more convenient than urllib, and requests are the most implemented in python Simple and easy-to-use HTTP library, it is recommended that crawlers use the requests library.
1. Installation:
pip install requests
2. Basic syntax< /strong>
1. Requests supported by the request module:
import requests
requests.get("http://httpbin.org/get ")
requests.post("http://httpbin.org/post ")
requests.put("http://httpbin.org/put ")
requests.delete("http://httpbin.org/delete ")
requests.head("http://httpbin.org/get ")
requests.options("http://httpbin.org/get ")
get request
1 Basic request
import requests
response=requests.get('https://www.jd .com/',)
with open("jd.html","wb< span style="color: #800000;">") as f:
f.write(response.content)
2 Request with parameters
import requests
response=requests.get('https://s.taobao .com/search?q=Mobile')
response=requests.get('https://s.taobao .com/search',params={"q":"beauties "})
3 Request with header
mport requests
response=requests.get('https://dig.chouti .com/',
headers={
'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.75 Safari/537.36',
}
)
4 Request with cookies
import uuid # Module for generating random strings
import requests
url = 'http://httpbin.org/cookies span>'
cookies = dict(sbid=str(uuid.uuid4()))
res = requests.get(url, cookies=cookies)
print(res.text)
# Cookie usage:
res =requests.get("https://www.autohome.com.cn/beijing/")
res_cookies =res.cookies
requests.post("https://www.autohome.com.cn/beijing/",cookies=res_cookies) pre>
5.session request
# equivalent to the usage of the above cookie Process
session=requests.session()
res1 = session.gett("https://github.com /login/")
res2 =session.post("https://github.com /session",headers=header,data=data)
post request
1 data parameter
< /strong>The usage of requests.post() is exactly the same as requests.get(), the special is that requests.post() has one more data parameter , used to store request body data
response=requests.post("http://httpbin.org/post",params={"a":"1"} , data={"name":"deng"})< /span>
2 Send json data
mport requests
res1=requests.post(url='http://httpbin .org/post', data ={'name'< /span>:'deng'}) #No request header is specified, #default request header: application/x-www-form-urlencoed
print(res1.json())
res2=requests.post(url='http://httpbin .org/post',json ={'age'< /span>:"11",}) #Default request header: application/json
print(res2.json())
response object
(1) Common attributes
import requests
respone=requests.get('https://sh.lianjia .com/ershoufang/')
# respone attributes
print(respone.text) # Decoded
print(respone.content) # Byte type
print(respone.status_code) # Response status code
print(respone.headers) # Request header data
print(respone.cookies) # i cookies
print(respone.cookies.get_dict())
print(respone.cookies.items())
print(respone.url)
print(respone.history) # Before redirection
print(respone.encoding) # Encoding
(2) Encoding issues
import requests
response=requests.get('http://www.autohome .com/news')
#response.encoding='gbk' #汽车之家网站回The content of the page is encoded by gb2312, and the default encoding of requests is ISO-8859-1. If it is not set to gbk, the Chinese characters will be garbled.
with open("res.html","w") as f:
f.write(response.text)
(3) Download binary files (pictures, videos, audios)
import requests
response=requests.get('http://bangimg1.dahe .cn/forum/201612/10/200447p36yk96im76vatyk.jpg')
with open("res.png","wb< span style="color: #800000;">") as f:
# f.write(response.content) # For example, when downloading a video , If the video is 100G, it is unreasonable to use response.content and then write it to the file all at once.
for line in response.iter_content(): # Generator
f.write(line)
(4) Parse json data
import requests
import json
response=requests.get('http://httpbin.org /get')
res1=json.loads(response.text) #too troublesome< /span>
res2=response.json() #Get json data directly
print(res1==res2)
(5) Redirection and History
By default, except for HEAD, Requests will automatically handle all redirects. You can use the history method of the response object to track redirects. Response.history is a list of Response objects that were created to complete the request. This list of objects is sorted from oldest to most recent request.
>>> r = requests.get('http://github.com')
>>> r.url
'https://github.com/< span style="color: #800000;">'
>>> r.status_code
200
>>> r.history
[]
In addition, you can also disable redirect processing via the allow_redirects parameter:
>>> r = requests.get('http://github.com< /span>', allow_redirects=False)
>>> r.status_code
301
>>> r.history
[]
requests advanced usage
agent
Quick proxy: https://www.kuaidaili.com/free/
Some websites will have corresponding anti-crawler measures. For example, many websites will detect the number of visits to a certain IP in a certain period of time. If the access frequency is too fast to look like a normal visitor, it may ban access to this IP. So we need to set up some proxy servers and change to another proxy every once in a while. Even if the IP is banned, you can still change the IP to continue crawling
res=requests.get(< span style="color: #800000;">'http://httpbin.org/ip', proxies={'http':'110.83.40.27:9999 span>'}).json()
print(res)
Crawler case
github's home page
#The idea to solve the problem:
#Log in to GitHub with your account first, and find the requested information you need to bring< /span>
#Analyze the request information that needs to be brought, the most important point here is: login Which url to submit the data to when, and how to dynamically obtain the authenticity_token in the data
# Get the login page
url ='https://github.com/login' # login address
session=requests.session()
login = session.get(url)
# Build data
authenticity_token=re.findall('name="authenticity_token" value="(.* ?)"',login.text)
data={
'commit': 'Sign in',
'utf8': '?',
'authenticity_token': authenticity_token,
'login': 'dengjn',
'password': 'd11111'}
header={
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64 ) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.67 Safari/537.36'
}
res2 =session.post("https://github.com /session",headers=header,data=data)
with open("github.html","wb< span style="color: #800000;">") as f:
f.write(res2.content)
print(login.status_code)
print(res2.status_code)
import requests
requests.get("http://httpbin.org/get ")
requests.post("http://httpbin.org/post ")
requests.put("http://httpbin.org/put ")
requests.delete("http://httpbin.org/delete ")
requests.head("http://httpbin.org/get ")
requests.options("http://httpbin.org/get ")
import< /span> requests
response=requests.get('https://www.jd .com/',)
with open("jd.html","wb< span style="color: #800000;">") as f:
f.write(response.content)
import requests
response=requests.get('https://s.taobao .com/search?q=Mobile')
response=requests.get('https://s.taobao .com/search',params={"q":"beauties "})
mport requests
response=requests.get('https://dig.chouti .com/',
headers={
'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.75 Safari/537.36',
}
)
import uuid # Module for generating random strings
import requests
url = 'http://httpbin.org/cookies span>'
cookies = dict(sbid=str(uuid.uuid4()))
res = requests.get(url, cookies=cookies)
print(res.text)
# Cookie usage:
res =requests.get("https://www.autohome.com.cn/beijing/")
res_cookies =res.cookies
requests.post("https://www.autohome.com.cn/beijing/",cookies=res_cookies) pre># Equivalent to the above cookie usage process
session=requests.session()
res1 = session.gett("https://github.com /login/")
res2 =session.post("https://github.com /session",headers=header,data=data)mport requests
res1=requests.post(url='http://httpbin .org/post', data ={'name'< /span>:'deng'}) #No request header is specified, #default request header: application/x-www-form-urlencoed
print(res1.json())
res2=requests.post(url='http://httpbin .org/post',json ={'age'< /span>:"11",}) #Default request header: application/json
print(res2.json())import requests
respone=requests.get('https://sh.lianjia .com/ershoufang/')
# respone attributes
print(respone.text) # Decoded
print(respone.content) # Byte type
print(respone.status_code) # Response status code
print(respone.headers) # Request header data
print(respone.cookies) # i cookies
print(respone.cookies.get_dict())
print(respone.cookies.items())
print(respone.url)
print(respone.history) # Before redirection
print(respone.encoding) # Encodingimport requests
response=requests.get('http://www.autohome .com/news')
#response.encoding='gbk' #汽车之家网站回The content of the page is encoded by gb2312, and the default encoding of requests is ISO-8859-1. If it is not set to gbk, the Chinese characters will be garbled.
with open("res.html","w") as f:
f.write(response.text)import requests
response=requests.get('http://bangimg1.dahe .cn/forum/201612/10/200447p36yk96im76vatyk.jpg')
with open("res.png","wb< span style="color: #800000;">") as f:
# f.write(response.content) # For example, when downloading a video , If the video is 100G, it is unreasonable to use response.content and then write it to the file all at once.
for line in response.iter_content(): # Generator
f.write(line)import requests
import json
response=requests.get('http://httpbin.org /get')
res1=json.loads(response.text) #too troublesome< /span>
res2=response.json() #Get json data directly
print(res1==res2)>>> r = requests.get( 'http://github.com')
>>> r.url
'https://github.com/< span style="color: #800000;">'
>>> r.status_code
200
>>> r.history
[] >>> r = requests.get('http://github.com', allow_redirects=< span style="color: #000000;">False)
>>> r.status_code
301
>>> r.history
[]res=requests.get('http://httpbin.org/ip', proxies={'http':'110.83.40.27:9999' }).json()
print(res)#解决问题的思路:
#先用自己的账户登录GitHub ,去找需要带上的请求信息
#分析需要带上的请求信息,此处最关键的点是:登录时将数据提交到哪个url ,数据中authenticity_token 如何动态获取
# 获取登录页面
url =‘https://github.com/login‘ # 登录地址
session=requests.session()
login = session.get(url)
# 构建数据
authenticity_token=re.findall(‘name="authenticity_token" value="(.*?)"‘,login.text)
data={
‘commit‘: ‘Sign in‘,
‘utf8‘: ‘?‘,
‘authenticity_token‘: authenticity_token,
‘login‘: ‘dengjn‘,
‘password‘: ‘d11111‘}
header={
‘User-Agent‘: ‘Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.67 Safari/537.36‘
}
res2 =session.post("https://github.com/session",headers=header,data=data)
with open("github.html","wb") as f:
f.write(res2.content)
print(login.status_code)
print(res2.status_code)