2 reptile Requests module

requests module

Requests are written in python language based on urllib, using the HTTP library of the Apache2 Licensed open source protocol, Requests is more convenient than urllib, and requests are the most implemented in python Simple and easy-to-use HTTP library, it is recommended that crawlers use the requests library.

1. Installation:

　　pip install requests

2. Basic syntax

1. Requests supported by the request module:

import requests
requests.get("http://httpbin.org/get ")
requests.post("http://httpbin.org/post ")
requests.put("http://httpbin.org/put ")
requests.delete("http://httpbin.org/delete ")
requests.head("http://httpbin.org/get ")
requests.options("http://httpbin.org/get ")　

get request

1 Basic request

import requests
response=requests.get('https://www.jd .com/',)

with open("jd.html","wb") as f:
f.write(response.content)

2 Request with parameters

import requests
response=requests.get('https://s.taobao .com/search?q=Mobile')
response=requests.get('https://s.taobao .com/search',params={"q":"beauties "})

3 Request with header

mport requests
response=requests.get('https://dig.chouti .com/',
headers={
'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.75 Safari/537.36',
}
)

4 Request with cookies

import uuid # Module for generating random strings
import requests

url = 'http://httpbin.org/cookies span>'
cookies = dict(sbid=str(uuid.uuid4()))

res = requests.get(url, cookies=cookies)
print(res.text)

# Cookie usage:
res =requests.get("https://www.autohome.com.cn/beijing/")
res_cookies =res.cookies
requests.post("https://www.autohome.com.cn/beijing/",cookies=res_cookies)

5.session request

# equivalent to the usage of the above cookie Process
session=requests.session()
res1 = session.gett("https://github.com /login/")
res2 =session.post("https://github.com /session",headers=header,data=data)

post request

1 data parameter
The usage of requests.post() is exactly the same as requests.get(), the special is that requests.post() has one more data parameter , used to store request body data　

response=requests.post("http://httpbin.org/post",params={"a":"1"} , data={"name":"deng"})　

2 Send json data

mport requests
res1=requests.post(url='http://httpbin .org/post', data ={'name':'deng'}) #No request header is specified, #default request header: application/x-www-form-urlencoed
print(res1.json())

res2=requests.post(url='http://httpbin .org/post',json ={'age':"11",}) #Default request header: application/json
print(res2.json())

response object

(1) Common attributes

import requests
respone=requests.get('https://sh.lianjia .com/ershoufang/')
# respone attributes
print(respone.text) # Decoded
print(respone.content) # Byte type
print(respone.status_code) # Response status code
print(respone.headers) # Request header data
print(respone.cookies) # i cookies
print(respone.cookies.get_dict())
print(respone.cookies.items())
print(respone.url)
print(respone.history) # Before redirection
print(respone.encoding) # Encoding

(2) Encoding issues

import requests
response=requests.get('http://www.autohome .com/news')
#response.encoding='gbk' #汽车之家网站回The content of the page is encoded by gb2312, and the default encoding of requests is ISO-8859-1. If it is not set to gbk, the Chinese characters will be garbled.
with open("res.html","w") as f:
f.write(response.text)

(3) Download binary files (pictures, videos, audios)

import requests
response=requests.get('http://bangimg1.dahe .cn/forum/201612/10/200447p36yk96im76vatyk.jpg')
with open("res.png","wb") as f:
# f.write(response.content) # For example, when downloading a video , If the video is 100G, it is unreasonable to use response.content and then write it to the file all at once.
for line in response.iter_content(): # Generator
f.write(line)

(4) Parse json data

import requests
import json

response=requests.get('http://httpbin.org /get')
res1=json.loads(response.text) #too troublesome
res2=response.json() #Get json data directly
print(res1==res2)

(5) Redirection and History

By default, except for HEAD, Requests will automatically handle all redirects. You can use the history method of the response object to track redirects. Response.history is a list of Response objects that were created to complete the request. This list of objects is sorted from oldest to most recent request.

>>> r = requests.get('http://github.com')
>>> r.url
'https://github.com/'
>>> r.status_code
200
>>> r.history
[]

In addition, you can also disable redirect processing via the allow_redirects parameter:

>>> r = requests.get('http://github.com', allow_redirects=False)
>>> r.status_code
301
>>> r.history
[]　

requests advanced usage

agent

Quick proxy: https://www.kuaidaili.com/free/

Some websites will have corresponding anti-crawler measures. For example, many websites will detect the number of visits to a certain IP in a certain period of time. If the access frequency is too fast to look like a normal visitor, it may ban access to this IP. So we need to set up some proxy servers and change to another proxy every once in a while. Even if the IP is banned, you can still change the IP to continue crawling

res=requests.get('http://httpbin.org/ip', proxies={'http':'110.83.40.27:9999 span>'}).json()
print(res)

Crawler case

github's home page

#The idea to solve the problem:
#Log in to GitHub with your account first, and find the requested information you need to bring
#Analyze the request information that needs to be brought, the most important point here is: login Which url to submit the data to when, and how to dynamically obtain the authenticity_token in the data

# Get the login page
url ='https://github.com/login' # login address
session=requests.session()
login = session.get(url)

# Build data
authenticity_token=re.findall('name="authenticity_token" value="(.* ?)"',login.text)
data={
'commit': 'Sign in',
'utf8': '?',
'authenticity_token': authenticity_token,
'login': 'dengjn',
'password': 'd11111'}
header={
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64 ) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.67 Safari/537.36'
}

res2 =session.post("https://github.com /session",headers=header,data=data)

with open("github.html","wb") as f:
f.write(res2.content)
print(login.status_code)
print(res2.status_code)

import requests
requests.get("http://httpbin.org/get ")
requests.post("http://httpbin.org/post ")
requests.put("http://httpbin.org/put ")
requests.delete("http://httpbin.org/delete ")
requests.head("http://httpbin.org/get ")
requests.options("http://httpbin.org/get ")　

import requests
response=requests.get('https://www.jd .com/',)

with open("jd.html","wb") as f:
f.write(response.content)

import requests
response=requests.get('https://s.taobao .com/search?q=Mobile')
response=requests.get('https://s.taobao .com/search',params={"q":"beauties "})

mport requests
response=requests.get('https://dig.chouti .com/',
headers={
'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.75 Safari/537.36',
}
)

import uuid # Module for generating random strings
import requests

url = 'http://httpbin.org/cookies span>'
cookies = dict(sbid=str(uuid.uuid4()))

res = requests.get(url, cookies=cookies)
print(res.text)

# Cookie usage:
res =requests.get("https://www.autohome.com.cn/beijing/")
res_cookies =res.cookies
requests.post("https://www.autohome.com.cn/beijing/",cookies=res_cookies)
# Equivalent to the above cookie usage process
session=requests.session()
res1 = session.gett("https://github.com /login/")
res2 =session.post("https://github.com /session",headers=header,data=data)

mport requests
res1=requests.post(url='http://httpbin .org/post', data ={'name':'deng'}) #No request header is specified, #default request header: application/x-www-form-urlencoed
print(res1.json())

res2=requests.post(url='http://httpbin .org/post',json ={'age':"11",}) #Default request header: application/json
print(res2.json())

import requests
respone=requests.get('https://sh.lianjia .com/ershoufang/')
# respone attributes
print(respone.text) # Decoded
print(respone.content) # Byte type
print(respone.status_code) # Response status code
print(respone.headers) # Request header data
print(respone.cookies) # i cookies
print(respone.cookies.get_dict())
print(respone.cookies.items())
print(respone.url)
print(respone.history) # Before redirection
print(respone.encoding) # Encoding

import requests
response=requests.get('http://www.autohome .com/news')
#response.encoding='gbk' #汽车之家网站回The content of the page is encoded by gb2312, and the default encoding of requests is ISO-8859-1. If it is not set to gbk, the Chinese characters will be garbled.
with open("res.html","w") as f:
f.write(response.text)

import requests
response=requests.get('http://bangimg1.dahe .cn/forum/201612/10/200447p36yk96im76vatyk.jpg')
with open("res.png","wb") as f:
# f.write(response.content) # For example, when downloading a video , If the video is 100G, it is unreasonable to use response.content and then write it to the file all at once.
for line in response.iter_content(): # Generator
f.write(line)

import requests
import json

response=requests.get('http://httpbin.org /get')
res1=json.loads(response.text) #too troublesome
res2=response.json() #Get json data directly
print(res1==res2)

>>> r = requests.get( 'http://github.com')
>>> r.url
'https://github.com/'
>>> r.status_code
200
>>> r.history
[]

>>> r = requests.get('http://github.com', allow_redirects=False)
>>> r.status_code
301
>>> r.history
[]　

res=requests.get('http://httpbin.org/ip', proxies={'http':'110.83.40.27:9999' }).json()
print(res)

#解决问题的思路：
#先用自己的账户登录GitHub ，去找需要带上的请求信息
#分析需要带上的请求信息，此处最关键的点是：登录时将数据提交到哪个url ,数据中authenticity_token 如何动态获取

# 获取登录页面
url =‘https://github.com/login‘ # 登录地址
session=requests.session()
login = session.get(url)

# 构建数据
authenticity_token=re.findall(‘name="authenticity_token" value="(.*?)"‘,login.text)
data={
‘commit‘: ‘Sign in‘,
‘utf8‘: ‘?‘,
‘authenticity_token‘: authenticity_token,
‘login‘: ‘dengjn‘,
‘password‘: ‘d11111‘}
header={
‘User-Agent‘: ‘Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.67 Safari/537.36‘
}

res2 =session.post("https://github.com/session",headers=header,data=data)

with open("github.html","wb") as f:
f.write(res2.content)
print(login.status_code)
print(res2.status_code)

requests module

1. Installation:

2. Basic syntax< /strong>

1. Requests supported by the request module:

get request

post request

response object

requests advanced usage

Crawler case

Leave a Comment Cancel reply