Reptile Frame SCRAPY (2)

One: Introduction to the core components of scrapy

1: Engine (scrapy): responsible for data processing of the entire system process, triggering things (core)

2: Scheduling Scheduler: Put the page address that needs to be crawled into the queue (the url will be automatically deduplicated), and request to return again in the engine

3: Downloader: Used to download The content is returned to the spider (scrapy is based on the twisted asynchronous model)

4: spider: for data extraction (item)

5: pipelines: for Data storage

share picture

Two: Proxy and cookie

1: Cookie operation

Scrapy gets the secondary subpage and automatically carries the cookie for request access.

2: scrapy initiates a post request

The first method for you: rewrite the start_requests method, and modify the method attribute in the scrapy.Request method to post. This method is not recommended

# -*- coding: utf-8 -*-
import scrapy


class App01Spider(scrapy.Spider):
    #Crawler file name By file name, locate which crawler file needs to be executed
    name = ‘app01’
    #allowed_domains = [‘www.baidu.com‘]# Allowed domain names, only the page data under this domain name can be crawled, and comments can be made
    start_urls = [‘https://fanyi.baidu.com/sug‘]#Start url
    #Rewrite start_requests modify the method attribute to post request
    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(url=url,callback=self.parse,method=‘post’)
    #Analysis method, to perform specified data analysis on the obtained page content. Request to run once
    def parse(self, response):
          print(response.text)
    #The method executed after the crawler file ends, often used to close the resource file
    def closed(self):
       pass

   The second method: rewrite the start_requests method and use the FormRequest() method to send post requests (recommended)

# -*- coding: utf-8 -*-
import scrapy


# Crawling Baidu translation results
class App01Spider(scrapy.Spider):
    #Crawler file name By file name, locate which crawler file needs to be executed
    name = ‘app01’
    #allowed_domains = [‘www.baidu.com‘]# Allowed domain names, only the page data under this domain name can be crawled, and comments can be made
    start_urls = [‘https://fanyi.baidu.com/sug‘]#Start url
    #Rewrite start_requests modify the method attribute to post request
    def start_requests(self):
        data={
            ‘Kw’: ‘man’
        }
        for url in self.start_urls:
            yield scrapy.FormRequest(url=url,formdata=data,callback=self.parse,)
    #Analysis method, to perform specified data analysis on the obtained page content. Request to run once
    def parse(self, response):
          print(response.text)
    #The method executed after the crawler file ends, often used to close the resource file
    def closed(self):
       pass

 3: Proxy operation

Principle: When the scheduler submits the request in the queue to the downloader, it will pass through the middleware, and the middleware can perform the request Intercept and then modify the request.ip

Step 1: Customize the intermediate class in the middlewares.py file and rewrite the process_request() method

 class myProxy_Middleware(object):

def process_request(self, request, spider):
#Modify ip Go to the face without ip address to find an ip Note the protocol header
request.meta['proxy']='http://120.76.77.152:9999< /span>'

Step 2: Modify the settings.py file and register the written middleware class.

DOWNLOADER_MIDDLEWARES = {
   ‘MyProxy_Middleware’: 543,
}  

Three: log level

  error: error
  warning: warning
  info: general
  debug: debugging information
  setting. py Set the log printing level
  –Add LOG_LEVEL=’ERROER’ at any position
  –Print the log information to the specified file LOG_FILE=’log.txt’

Four: Request to transfer parameters /h3>

–In order to solve the need, the requested data is not on the same page. Set the meta method in the scrapy.Request() method

—passing parameters

yield scrapy.Request(url,callback=self.parse, meta={‘item’:item })

–Get

Response.meta.get(‘item’)

Five: CrawlSpider

1: Question, Crawl all the data of the drawer net.

Solution 1: Manual request sending (inconvenient)

Solution 2: Use CrawlSpider (recommended), which is a subclass of spider and has more powerful functions (link Extractor, rule parser, so the performance is more powerful)

Two: Create CrawlSpider project

  1: Create project: scrapy startproject xxxxxx

  2: Create file : Scrapy genspider -t crawl file_name url

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class ChoutiSpider(CrawlSpider):
    name = ‘chouti’
    # allowed_domains = [‘www.chouti.com‘]
    start_urls = [‘https://dig.chouti.com/‘]
    ‘‘‘
      1: Create a link extractor object
         --allow extract the regular expression of the page url
         --The link extractor can obtain the extracted url from the specified page according to the regular expression
         --Send all the extracted URL links to the rule parser
    ‘‘‘
    link=LinkExtractor(allow=r‘all/hot/recent/\d+‘)

    ‘‘‘
    2: Instantiate a rule parser object
      --callback: The function that parses the request result page is returned
      --fllow: Whether to extract the pages extracted by the link extractor, continue to use the link extractor to continue extracting urls that meet the rules
         --A large number of repeated pages may automatically appear, but they will be automatically deduplicated
    ‘‘‘
    rules = (
        Rule(link, callback=‘parse_item‘, follow=True),
    )
    #Callback
    def parse_item(self, response):
        item = {}
        print(response)
        return item

# -*- coding: utf-8 -*-
import scrapy


class App01Spider(scrapy.Spider):
    #Crawler file name By file name, locate which crawler file needs to be executed
    name = ‘app01’
    #allowed_domains = [‘www.baidu.com‘]# Allowed domain names, only the page data under this domain name can be crawled, and comments can be made
    start_urls = [‘https://fanyi.baidu.com/sug‘]#Start url
    #Rewrite start_requests modify the method attribute to post request
    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(url=url,callback=self.parse,method=‘post’)
    #Analysis method, to perform specified data analysis on the obtained page content. Request to run once
    def parse(self, response):
          print(response.text)
    #The method executed after the crawler file ends, often used to close the resource file
    def closed(self):
       pass

# -*- coding: utf-8 -*-
import scrapy


# Crawling Baidu translation results
class App01Spider(scrapy.Spider):
    #Crawler file name By file name, locate which crawler file needs to be executed
    name = ‘app01’
    #allowed_domains = [‘www.baidu.com‘]# Allowed domain names, only the page data under this domain name can be crawled, and comments can be made
    start_urls = [‘https://fanyi.baidu.com/sug‘]#Start url
    #Rewrite start_requests modify the method attribute to post request
    def start_requests(self):
        data={
            ‘Kw’: ‘man’
        }
        for url in self.start_urls:
            yield scrapy.FormRequest(url=url,formdata=data,callback=self.parse,)
    #Analysis method, to perform specified data analysis on the obtained page content. Request to run once
    def parse(self, response):
          print(response.text)
    #The method executed after the crawler file ends, often used to close the resource file
    def closed(self):
       pass

class myProxy_Middleware(object):

def process_request(self, request, spider):
#Modify ip Go to the face without ip address to find an ip Note the protocol header
request.meta['proxy']='http://120.76.77.152:9999< /span>'

DOWNLOADER_MIDDLEWARES = {
   ‘MyProxy_Middleware’: 543,
}  

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class ChoutiSpider(CrawlSpider):
    name = ‘chouti’
    # allowed_domains = [‘www.chouti.com‘]
    start_urls = [‘https://dig.chouti.com/‘]
    ‘‘‘
      1: Create a link extractor object
         --allow extract the regular expression of the page url
         --The link extractor can obtain the extracted url from the specified page according to the regular expression
         --Send all the extracted URL links to the rule parser
    ‘‘‘
    link=LinkExtractor(allow=r‘all/hot/recent/\d+‘)

    ‘‘‘
    2: Instantiate a rule parser object
      --callback: The function that parses the request result page is returned
      --fllow: Whether to extract the pages extracted by the link extractor, continue to use the link extractor to continue extracting urls that meet the rules
         --A large number of repeated pages may automatically appear, but they will be automatically deduplicated
    ‘‘‘
    rules = (
        Rule(link, callback=‘parse_item‘, follow=True),
    )
    #Callback
    def parse_item(self, response):
        item = {}
        print(response)
        return item

Leave a Comment

Your email address will not be published.