Introduction
- When the relevant portal website logs in, if the user logs in more than 3 or 5 times in a row, it will be dynamically generated on the login page Verification code. Through the verification code, the effect of diversion and anti-climbing is achieved.
- 1. Grab the page data carrying the verification code
– 2. The verification code in the page data can be parsed, and the verification code image can be downloaded to the local
< span class="hljs-bullet"> – 3. You can submit the verification code picture to the third party platform for identification, and return the Data value
-Cloud coding platform:
1. Register on the official website (general Users and developer users)
-2. Log in as a developer user :
-1. Download the example code (Development document-“Call example and The latest DLL-“PythonHTTP example download)
-2. Create a software: My software-“Add new software span>
< span class="hljs-bullet">< span class="hljs-code"> -3. Use the code in the source file in the sample code to modify and make it recognize Data value in the image of the verification code< /span>
< span class="hljs-bullet">< span class="hljs-code">Code display:
#This function calls the relevant interface of the coding platform to identify the specified verification code image and returns the data value on the image< /span>
def getCode(codeImg):
# The username of ordinary users of the cloud coding platform
username = 'bobo328410948'
# The password of ordinary users of the cloud coding platform
password = 'bobo328410948'
# The developer divides the software ID into the necessary parameters. Log in to the developer background [My Software] to get it!
appid = 6003
# The developer divides the software key into the necessary parameters. Log in to the developer background [My Software] to get it!
appkey = '1f4b564483ae5c907a1d34f8e2f2776c'
# Verification code image file
filename = codeImg
# Verification code type, # Example: 1004 means 4 alphanumeric characters , Different types of charges are different. Please fill in accurately, otherwise the recognition rate will be affected. Check all types here http://www.yundama.com/price.html
codetype = 3000
# Timeout, seconds
timeout = 20
# Check
if (username == 'username'):
print('Please set the relevant parameters and test again')
else:
# Initialization
yundama = YDMHttp(username, password, appid, appkey)
# Log in to the cloud to code
uid = yundama.login();
print('uid: %s' % uid)
# Check balance
balance = yundama.balance();
print('balance: %s' % balance)
# Start identification, image path, verification code type ID, timeout (Seconds), recognition result
cid, result = yundama.decode(filename, codetype, timeout);
print('cid: %s, result: %s' % (cid , result))
return result
import requests
from lxml import etree
import json
import time
import re
#1. Grab the page data carrying the verification code< /span>
url = 'https://www.douban.com/accounts/login?source =movie'
headers = {
'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Mobile Safari/537.36'
}
page_text = requests.get(url=url,headers=headers).text
#2. The verification code in the page data can be parsed and verified Download the code picture to the local
tree = etree.HTML(page_text)
codeImg_url = tree.xpath('//*[@id= "captcha_image"]/@src')[0]
#Get the binary data value corresponding to the verification code image
code_img = requests.get(url=codeImg_url,headers=headers).content
#Get capture_id
''< /span>
c_id = re.findall('',page_text ,re.S)[0]
with open('./code.png< span style="color: #800000;">','wb') as fp:
fp.write(code_img)
#Get the data value above the verification code image
codeText = getCode('./code.png')
print(codeText)
#Sign in
post = 'https://accounts.douban.com/login '
data = {
"source": "movie",
"redir": "https://movie.douban.com/< /span>",
"form_email": "15027900535",
"form_password": "[emailprotected]",
"captcha-solution":codeText,
"captcha-id":c_id,
"login": "Sign in",
}
print(c_id)
login_text = requests.post(url=post,data=data,headers=headers).text
with open('./login.html< span style="color: #800000;">','w',encoding=' utf-8') as fp:
fp.write(login_text)
IP proxy:
- What is a proxy?
-
Agent is a third party to handle related affairs instead of Ontology. For example: Agents in life: purchasing agent, intermediary, wechat business…
-
-
Why do you need an agent in crawlers
-
Some websites will have corresponding anti-crawler measures. For example, many websites will detect the number of visits to a certain IP in a certain period of time. If the visit frequency is too fast to look like a normal visitor, it It may prohibit access to this IP. So we need to set up some proxy IPs and change to a proxy IP every once in a while. Even if the IP is banned, we can still change the IP to continue crawling.
-
-
Classification of agent:
-
Forward agent: The agent client obtains data. The forward proxy is to protect the client from being held accountable.
-
Reverse proxy: The proxy server provides data. The reverse proxy is to protect the server or be responsible for load balancing.
-
-
Free proxy ip to provide website
-
http://www.goubanjia.com/
-
Xici Agent
-
Quick Agent
-
ul>
Code
# !/usr/bin/env python
# -*- coding:utf-8 -*-
import requests
import random
if __name__ == "__main__":
#UA for different browsers
header_list = [
# Travelling
{"user-agent": "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1 ; Maxthon 2.0)"},
# Firefox
{"user-agent": "Mozilla/5.0 (Windows NT 6.1; rv:2.0. 1) Gecko/20100101 Firefox/4.0.1"},
{
"user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11"}
]
#Different proxy IP
proxy_list = [
{"http": "112.115.57.20:3128< span style="color: #800000;">"},
{'http': '121.41.171.223:3128< span style="color: #800000;">'}
]
#Obtain UA and proxy IP randomly
header = random.choice(header_list) proxy = random.choice(proxy_list) url < /span>= 'http://www.baidu.com/s?ie=UTF-8&wd =ip'
#Parameter 3: Set proxy
response = requests.get(url=url,headers=header,proxies=proxy) response.encoding = 'utf-8'< span style="color: #000000;">
with open('daili.html', 'wb< span style="color: #800000;">') as fp:
fp.write(response.content)
#Switch to the original IP
requests.get(url, proxies={"http": ""})
#This function calls the relevant interface of the coding platform to specify the verification code image Recognize and return the data value on the picture
def getCode(codeImg):
# The username of ordinary users of the cloud coding platform
username = 'bobo328410948'
# The password of ordinary users of the cloud coding platform
password = 'bobo328410948'
# The developer divides the software ID into the necessary parameters. Log in to the developer background [My Software] to get it!
appid = 6003
# The developer divides the software key into the necessary parameters. Log in to the developer background [My Software] to get it!
appkey = '1f4b564483ae5c907a1d34f8e2f2776c'
# Verification code image file
filename = codeImg
# Verification code type, # Example: 1004 means 4 alphanumeric characters , Different types of charges are different. Please fill in accurately, otherwise the recognition rate will be affected. Check all types here http://www.yundama.com/price.html
codetype = 3000
# Timeout, seconds
timeout = 20
# Check
if (username == 'username'):
print('Please set the relevant parameters and test again')
else:
# Initialization
yundama = YDMHttp(username, password, appid, appkey)
# Log in to the cloud to code
uid = yundama.login();
print('uid: %s' % uid)
# Check balance
balance = yundama.balance();
print('balance: %s' % balance)
# Start identification, image path, verification code type ID, timeout (Seconds), recognition result
cid, result = yundama.decode(filename, codetype, timeout);
print('cid: %s, result: %s' % (cid , result))
return result
import requests
from lxml import etree
import json
import time
import re
#1. Grab the page data carrying the verification code< /span>
url = 'https://www.douban.com/accounts/login?source =movie'
headers = {
'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Mobile Safari/537.36'
}
page_text = requests.get(url=url,headers=headers).text
#2. The verification code in the page data can be parsed and verified Download the code picture to the local
tree = etree.HTML(page_text)
codeImg_url = tree.xpath('//*[@id= "captcha_image"]/@src')[0]
#Get the binary data value corresponding to the verification code image
code_img = requests.get(url=codeImg_url,headers=headers).content
#Get capture_id
''< /span>
c_id = re.findall('',page_text ,re.S)[0]
with open('./code.png< span style="color: #800000;">','wb') as fp:
fp.write(code_img)
#Get the data value above the verification code image
codeText = getCode('./code.png')
print(codeText)
#Sign in
post = 'https://accounts.douban.com/login '
data = {
"source": "movie",
"redir": "https://movie.douban.com/< /span>",
"form_email": "15027900535",
"form_password": "[emailprotected]",
"captcha-solution":codeText,
"captcha-id":c_id,
"login": "Sign in",
}
print(c_id)
login_text = requests.post(url=post,data=data,headers=headers).text
with open('./login.html< span style="color: #800000;">','w',encoding=' utf-8') as fp:
fp.write(login_text)
#!/usr/bin/env python
# -*- coding:utf-8 -*-
import requests
import random
if __name__ == "__main__":
#UA for different browsers
header_list = [
# Travelling
{"user-agent": "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1 ; Maxthon 2.0)"},
# Firefox
{"user-agent": "Mozilla/5.0 (Windows NT 6.1; rv:2.0. 1) Gecko/20100101 Firefox/4.0.1"},
{
"user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11"}
]
#不同的代理IP
proxy_list = [
{"http": "112.115.57.20:3128"},
{‘http‘: ‘121.41.171.223:3128‘}
]
#随机获取UA和代理IP
header = random.choice(header_list) proxy = random.choice(proxy_list) url = ‘http://www.baidu.com/s?ie=UTF-8&wd=ip‘
#参数3:设置代理
response = requests.get(url=url,headers=header,proxies=proxy) response.encoding = ‘utf-8‘
with open(‘daili.html‘, ‘wb‘) as fp:
fp.write(response.content)
#切换成原来的IP
requests.get(url, proxies={"http": ""})