Reptile verification code processing and IP processing

Introduction

When the relevant portal website logs in, if the user logs in more than 3 or 5 times in a row, it will be dynamically generated on the login page Verification code. Through the verification code, the effect of diversion and anti-climbing is achieved.

　　- 1. Grab the page data carrying the verification code

　　– 2. The verification code in the page data can be parsed, and the verification code image can be downloaded to the local

– 3. You can submit the verification code picture to the third party platform for identification, and return the Data value

　　　　-Cloud coding platform: 　

　　　　　　 1. Register on the official website (general Users and developer users)

　　　　　　-2. Log in as a developer user :

　　　　-1. Download the example code (Development document-“Call example and The latest DLL-“PythonHTTP example download)

　　　　-2. Create a software: My software-“Add new software span>

-3. Use the code in the source file in the sample code to modify and make it recognize Data value in the image of the verification code

Code display:

#This function calls the relevant interface of the coding platform to identify the specified verification code image and returns the data value on the image< /span>

def getCode(codeImg):

 # The username of ordinary users of the cloud coding platform

 username = 'bobo328410948'



 # The password of ordinary users of the cloud coding platform

 password = 'bobo328410948'



 # The developer divides the software ID into the necessary parameters. Log in to the developer background [My Software] to get it! 

 appid = 6003



 # The developer divides the software key into the necessary parameters. Log in to the developer background [My Software] to get it! 

 appkey = '1f4b564483ae5c907a1d34f8e2f2776c'



 # Verification code image file

 filename = codeImg



 # Verification code type, # Example: 1004 means 4 alphanumeric characters , Different types of charges are different. Please fill in accurately, otherwise the recognition rate will be affected. Check all types here http://www.yundama.com/price.html

 codetype = 3000



 # Timeout, seconds

 timeout = 20



 # Check

 if (username == 'username'):

 print('Please set the relevant parameters and test again')

 else:

 # Initialization

 yundama = YDMHttp(username, password, appid, appkey)



 # Log in to the cloud to code

 uid = yundama.login();

 print('uid: %s' % uid)



 # Check balance

 balance = yundama.balance();

 print('balance: %s' % balance)



 # Start identification, image path, verification code type ID, timeout (Seconds), recognition result

 cid, result = yundama.decode(filename, codetype, timeout);

 print('cid: %s, result: %s' % (cid , result))



 return result

import requests

from lxml import etree

import json

import time

import re

#1. Grab the page data carrying the verification code< /span>

url = 'https://www.douban.com/accounts/login?source =movie'

headers = {

 'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Mobile Safari/537.36'

}

page_text = requests.get(url=url,headers=headers).text



#2. The verification code in the page data can be parsed and verified Download the code picture to the local

tree = etree.HTML(page_text)

codeImg_url = tree.xpath('//*[@id= "captcha_image"]/@src')[0]

#Get the binary data value corresponding to the verification code image

code_img = requests.get(url=codeImg_url,headers=headers).content





#Get capture_id

''< /span>

c_id = re.findall('',page_text ,re.S)[0]

with open('./code.png< span style="color: #800000;">','wb') as fp:

 fp.write(code_img)



#Get the data value above the verification code image

codeText = getCode('./code.png')

print(codeText)

#Sign in

post = 'https://accounts.douban.com/login '

data = {

 "source": "movie",

 "redir": "https://movie.douban.com/< /span>",

 "form_email": "15027900535",

 "form_password": "[emailprotected]",

 "captcha-solution":codeText,

 "captcha-id":c_id,

 "login": "Sign in",

}

print(c_id)

login_text = requests.post(url=post,data=data,headers=headers).text

with open('./login.html< span style="color: #800000;">','w',encoding=' utf-8') as fp:

 fp.write(login_text)

IP proxy:

What is a proxy?
- Agent is a third party to handle related affairs instead of Ontology. For example: Agents in life: purchasing agent, intermediary, wechat business…
Why do you need an agent in crawlers
- Some websites will have corresponding anti-crawler measures. For example, many websites will detect the number of visits to a certain IP in a certain period of time. If the visit frequency is too fast to look like a normal visitor, it It may prohibit access to this IP. So we need to set up some proxy IPs and change to a proxy IP every once in a while. Even if the IP is banned, we can still change the IP to continue crawling.
Classification of agent:
- Forward agent: The agent client obtains data. The forward proxy is to protect the client from being held accountable.
- Reverse proxy: The proxy server provides data. The reverse proxy is to protect the server or be responsible for load balancing.
Free proxy ip to provide website
- http://www.goubanjia.com/
- Xici Agent
- Quick Agent

Code

# !/usr/bin/env python

# -*- coding:utf-8 -*-

import requests

import random

if __name__ == "__main__":

 #UA for different browsers

 header_list = [

 # Travelling

 {"user-agent": "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1 ; Maxthon 2.0)"},

 # Firefox

 {"user-agent": "Mozilla/5.0 (Windows NT 6.1; rv:2.0. 1) Gecko/20100101 Firefox/4.0.1"},

 # Google

 {

 "user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11"}

]

 #Different proxy IP

 proxy_list = [

 {"http": "112.115.57.20:3128< span style="color: #800000;">"},

 {'http': '121.41.171.223:3128< span style="color: #800000;">'}

]

 #Obtain UA and proxy IP randomly

 header = random.choice(header_list) proxy = random.choice(proxy_list) url < /span>= 'http://www.baidu.com/s?ie=UTF-8&wd =ip'

 #Parameter 3: Set proxy

 response = requests.get(url=url,headers=header,proxies=proxy) response.encoding = 'utf-8'< span style="color: #000000;">



 with open('daili.html', 'wb< span style="color: #800000;">') as fp:

 fp.write(response.content)

 #Switch to the original IP

 requests.get(url, proxies={"http": ""})

#This function calls the relevant interface of the coding platform to specify the verification code image Recognize and return the data value on the picture

def getCode(codeImg):

 # The username of ordinary users of the cloud coding platform

 username = 'bobo328410948'



 # The password of ordinary users of the cloud coding platform

 password = 'bobo328410948'



 # The developer divides the software ID into the necessary parameters. Log in to the developer background [My Software] to get it! 

 appid = 6003



 # The developer divides the software key into the necessary parameters. Log in to the developer background [My Software] to get it! 

 appkey = '1f4b564483ae5c907a1d34f8e2f2776c'



 # Verification code image file

 filename = codeImg



 # Verification code type, # Example: 1004 means 4 alphanumeric characters , Different types of charges are different. Please fill in accurately, otherwise the recognition rate will be affected. Check all types here http://www.yundama.com/price.html

 codetype = 3000



 # Timeout, seconds

 timeout = 20



 # Check

 if (username == 'username'):

 print('Please set the relevant parameters and test again')

 else:

 # Initialization

 yundama = YDMHttp(username, password, appid, appkey)



 # Log in to the cloud to code

 uid = yundama.login();

 print('uid: %s' % uid)



 # Check balance

 balance = yundama.balance();

 print('balance: %s' % balance)



 # Start identification, image path, verification code type ID, timeout (Seconds), recognition result

 cid, result = yundama.decode(filename, codetype, timeout);

 print('cid: %s, result: %s' % (cid , result))



 return result

import requests

from lxml import etree

import json

import time

import re

#1. Grab the page data carrying the verification code< /span>

url = 'https://www.douban.com/accounts/login?source =movie'

headers = {

 'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Mobile Safari/537.36'

}

page_text = requests.get(url=url,headers=headers).text



#2. The verification code in the page data can be parsed and verified Download the code picture to the local

tree = etree.HTML(page_text)

codeImg_url = tree.xpath('//*[@id= "captcha_image"]/@src')[0]

#Get the binary data value corresponding to the verification code image

code_img = requests.get(url=codeImg_url,headers=headers).content





#Get capture_id

''< /span>

c_id = re.findall('',page_text ,re.S)[0]

with open('./code.png< span style="color: #800000;">','wb') as fp:

 fp.write(code_img)



#Get the data value above the verification code image

codeText = getCode('./code.png')

print(codeText)

#Sign in

post = 'https://accounts.douban.com/login '

data = {

 "source": "movie",

 "redir": "https://movie.douban.com/< /span>",

 "form_email": "15027900535",

 "form_password": "[emailprotected]",

 "captcha-solution":codeText,

 "captcha-id":c_id,

 "login": "Sign in",

}

print(c_id)

login_text = requests.post(url=post,data=data,headers=headers).text

with open('./login.html< span style="color: #800000;">','w',encoding=' utf-8') as fp:

 fp.write(login_text)

#!/usr/bin/env python

# -*- coding:utf-8 -*-

import requests

import random

if __name__ == "__main__":

 #UA for different browsers

 header_list = [

 # Travelling

 {"user-agent": "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1 ; Maxthon 2.0)"},

 # Firefox

 {"user-agent": "Mozilla/5.0 (Windows NT 6.1; rv:2.0. 1) Gecko/20100101 Firefox/4.0.1"},

 # Google

 {

 "user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11"}

    ]

    #不同的代理IP

    proxy_list = [

        {"http": "112.115.57.20:3128"},

        {‘http‘: ‘121.41.171.223:3128‘}

    ]

    #随机获取UA和代理IP

    header = random.choice(header_list) proxy = random.choice(proxy_list) url = ‘http://www.baidu.com/s?ie=UTF-8&wd=ip‘

    #参数3：设置代理

    response = requests.get(url=url,headers=header,proxies=proxy) response.encoding = ‘utf-8‘



    with open(‘daili.html‘, ‘wb‘) as fp:

        fp.write(response.content)

    #切换成原来的IP

    requests.get(url, proxies={"http": ""})

Introduction

< span class="hljs-bullet">< span class="hljs-code">Code display:

IP proxy:

Leave a Comment Cancel reply