Reptile verification code processing and IP processing

Introduction

  • When the relevant portal website logs in, if the user logs in more than 3 or 5 times in a row, it will be dynamically generated on the login page Verification code. Through the verification code, the effect of diversion and anti-climbing is achieved.

  - 1. Grab the page data carrying the verification code

  – 2. The verification code in the page data can be parsed, and the verification code image can be downloaded to the local

< span class="hljs-bullet">  – 3. You can submit the verification code picture to the third party platform for identification, and return the Data value

    -Cloud coding platform:  

       1. Register on the official website (general Users and developer users)

      -2. Log in as a developer user :

    -1. Download the example code (Development document-“Call example and The latest DLL-“PythonHTTP example download)

    -2. Create a software: My software-“Add new software span>

< span class="hljs-bullet">< span class="hljs-code">    -3. Use the code in the source file in the sample code to modify and make it recognize Data value in the image of the verification code< /span>

< span class="hljs-bullet">< span class="hljs-code">Code display:

#This function calls the relevant interface of the coding platform to identify the specified verification code image and returns the data value on the image< /span>

def getCode(codeImg):
# The username of ordinary users of the cloud coding platform
username = 'bobo328410948'

# The password of ordinary users of the cloud coding platform
password = 'bobo328410948'

# The developer divides the software ID into the necessary parameters. Log in to the developer background [My Software] to get it!
appid = 6003

# The developer divides the software key into the necessary parameters. Log in to the developer background [My Software] to get it!
appkey = '1f4b564483ae5c907a1d34f8e2f2776c'

# Verification code image file
filename = codeImg

# Verification code type, # Example: 1004 means 4 alphanumeric characters , Different types of charges are different. Please fill in accurately, otherwise the recognition rate will be affected. Check all types here http://www.yundama.com/price.html
codetype = 3000

# Timeout, seconds
timeout = 20

# Check
if (username == 'username'):
print('Please set the relevant parameters and test again')
else:
# Initialization
yundama = YDMHttp(username, password, appid, appkey)

# Log in to the cloud to code
uid = yundama.login();
print('uid: %s' % uid)

# Check balance
balance = yundama.balance();
print('balance: %s' % balance)

# Start identification, image path, verification code type ID, timeout (Seconds), recognition result
cid, result = yundama.decode(filename, codetype, timeout);
print('cid: %s, result: %s' % (cid , result))

return result

import requests

from lxml import etree
import json
import time
import re
#1. Grab the page data carrying the verification code< /span>
url = 'https://www.douban.com/accounts/login?source =movie'
headers
= {
'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Mobile Safari/537.36'
}
page_text
= requests.get(url=url,headers=headers).text

#2. The verification code in the page data can be parsed and verified Download the code picture to the local
tree = etree.HTML(page_text)
codeImg_url
= tree.xpath('//*[@id= "captcha_image"]/@src')[0]
#Get the binary data value corresponding to the verification code image
code_img = requests.get(url=codeImg_url,headers=headers).content


#Get capture_id
'captcha'< /span>
c_id
= re.findall('',page_text ,re.S)[0]
with open(
'./code.png< span style="color: #800000;">'
,'wb') as fp:
fp.write(code_img)

#Get the data value above the verification code image
codeText = getCode('./code.png')
print(codeText)
#Sign in
post = 'https://accounts.douban.com/login '
data
= {
"source": "movie",
"redir": "https://movie.douban.com/< /span>",
"form_email": "15027900535",
"form_password": "[emailprotected]",
"captcha-solution":codeText,
"captcha-id":c_id,
"login": "Sign in",
}
print(c_id)
login_text
= requests.post(url=post,data=data,headers=headers).text
with open(
'./login.html< span style="color: #800000;">'
,'w',encoding=' utf-8') as fp:
fp.write(login_text)

IP proxy:

  • What is a proxy?
    • Agent is a third party to handle related affairs instead of Ontology. For example: Agents in life: purchasing agent, intermediary, wechat business…

  • Why do you need an agent in crawlers

    • Some websites will have corresponding anti-crawler measures. For example, many websites will detect the number of visits to a certain IP in a certain period of time. If the visit frequency is too fast to look like a normal visitor, it It may prohibit access to this IP. So we need to set up some proxy IPs and change to a proxy IP every once in a while. Even if the IP is banned, we can still change the IP to continue crawling.

  • Classification of agent:

    • Forward agent: The agent client obtains data. The forward proxy is to protect the client from being held accountable.

    • Reverse proxy: The proxy server provides data. The reverse proxy is to protect the server or be responsible for load balancing.

  • Free proxy ip to provide website

    • http://www.goubanjia.com/

    • Xici Agent

    • Quick Agent

  • Code

    # !/usr/bin/env python
    
    #
    -*- coding:utf-8 -*-
    import requests
    import random
    if __name__ == "__main__":
    #UA for different browsers
    header_list = [
    # Travelling
    {"user-agent": "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1 ; Maxthon 2.0)"},
    # Firefox
    {"user-agent": "Mozilla/5.0 (Windows NT 6.1; rv:2.0. 1) Gecko/20100101 Firefox/4.0.1"},
    # Google
    {
    "user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11"}
    ]
    #Different proxy IP
    proxy_list = [
    {
    "http": "112.115.57.20:3128< span style="color: #800000;">"},
    {
    'http': '121.41.171.223:3128< span style="color: #800000;">'}
    ]
    #Obtain UA and proxy IP randomly
    header = random.choice(header_list) proxy = random.choice(proxy_list) url < /span>= 'http://www.baidu.com/s?ie=UTF-8&wd =ip'
    #Parameter 3: Set proxy
    response = requests.get(url=url,headers=header,proxies=proxy) response.encoding = 'utf-8'
    < span style="color: #000000;">

    with open(
    'daili.html', 'wb< span style="color: #800000;">') as fp:
    fp.write(response.content)
    #Switch to the original IP
    requests.get(url, proxies={"http": ""})

    #This function calls the relevant interface of the coding platform to specify the verification code image Recognize and return the data value on the picture
    
    def getCode(codeImg):
    # The username of ordinary users of the cloud coding platform
    username = 'bobo328410948'

    # The password of ordinary users of the cloud coding platform
    password = 'bobo328410948'

    # The developer divides the software ID into the necessary parameters. Log in to the developer background [My Software] to get it!
    appid = 6003

    # The developer divides the software key into the necessary parameters. Log in to the developer background [My Software] to get it!
    appkey = '1f4b564483ae5c907a1d34f8e2f2776c'

    # Verification code image file
    filename = codeImg

    # Verification code type, # Example: 1004 means 4 alphanumeric characters , Different types of charges are different. Please fill in accurately, otherwise the recognition rate will be affected. Check all types here http://www.yundama.com/price.html
    codetype = 3000

    # Timeout, seconds
    timeout = 20

    # Check
    if (username == 'username'):
    print('Please set the relevant parameters and test again')
    else:
    # Initialization
    yundama = YDMHttp(username, password, appid, appkey)

    # Log in to the cloud to code
    uid = yundama.login();
    print('uid: %s' % uid)

    # Check balance
    balance = yundama.balance();
    print('balance: %s' % balance)

    # Start identification, image path, verification code type ID, timeout (Seconds), recognition result
    cid, result = yundama.decode(filename, codetype, timeout);
    print('cid: %s, result: %s' % (cid , result))

    return result

    import requests
    
    from lxml import etree
    import json
    import time
    import re
    #1. Grab the page data carrying the verification code< /span>
    url = 'https://www.douban.com/accounts/login?source =movie'
    headers
    = {
    'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Mobile Safari/537.36'
    }
    page_text
    = requests.get(url=url,headers=headers).text

    #2. The verification code in the page data can be parsed and verified Download the code picture to the local
    tree = etree.HTML(page_text)
    codeImg_url
    = tree.xpath('//*[@id= "captcha_image"]/@src')[0]
    #Get the binary data value corresponding to the verification code image
    code_img = requests.get(url=codeImg_url,headers=headers).content


    #Get capture_id
    'captcha'< /span>
    c_id
    = re.findall('',page_text ,re.S)[0]
    with open(
    './code.png< span style="color: #800000;">'
    ,'wb') as fp:
    fp.write(code_img)

    #Get the data value above the verification code image
    codeText = getCode('./code.png')
    print(codeText)
    #Sign in
    post = 'https://accounts.douban.com/login '
    data
    = {
    "source": "movie",
    "redir": "https://movie.douban.com/< /span>",
    "form_email": "15027900535",
    "form_password": "[emailprotected]",
    "captcha-solution":codeText,
    "captcha-id":c_id,
    "login": "Sign in",
    }
    print(c_id)
    login_text
    = requests.post(url=post,data=data,headers=headers).text
    with open(
    './login.html< span style="color: #800000;">'
    ,'w',encoding=' utf-8') as fp:
    fp.write(login_text)

    #!/usr/bin/env python
    
    #
    -*- coding:utf-8 -*-
    import requests
    import random
    if __name__ == "__main__":
    #UA for different browsers
    header_list = [
    # Travelling
    {"user-agent": "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1 ; Maxthon 2.0)"},
    # Firefox
    {"user-agent": "Mozilla/5.0 (Windows NT 6.1; rv:2.0. 1) Gecko/20100101 Firefox/4.0.1"},
    # Google
    {
    "user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11"}
    ]
    #不同的代理IP
    proxy_list = [
    {
    "http": "112.115.57.20:3128"},
    {
    http: 121.41.171.223:3128}
    ]
    #随机获取UA和代理IP
    header = random.choice(header_list) proxy = random.choice(proxy_list) url = http://www.baidu.com/s?ie=UTF-8&wd=ip
    #参数3:设置代理
    response = requests.get(url=url,headers=header,proxies=proxy) response.encoding = utf-8


    with open(
    daili.html, wb) as fp:
    fp.write(response.content)
    #切换成原来的IP
    requests.get(url, proxies={"http": ""})

Leave a Comment

Your email address will not be published.