One: Introduction to the core components of scrapy
1: Engine (scrapy): responsible for data processing of the entire system process, triggering things (core)
2: Scheduling Scheduler: Put the page address that needs to be crawled into the queue (the url will be automatically deduplicated), and request to return again in the engine
3: Downloader: Used to download The content is returned to the spider (scrapy is based on the twisted asynchronous model)
4: spider: for data extraction (item)
5: pipelines: for Data storage
Two: Proxy and cookie
1: Cookie operation
Scrapy gets the secondary subpage and automatically carries the cookie for request access.
2: scrapy initiates a post request
The first method for you: rewrite the start_requests method, and modify the method attribute in the scrapy.Request method to post. This method is not recommended
# -*- coding: utf-8 -*- import scrapy class App01Spider(scrapy.Spider): #Crawler file name By file name, locate which crawler file needs to be executed name = ‘app01’ #allowed_domains = [‘www.baidu.com‘]# Allowed domain names, only the page data under this domain name can be crawled, and comments can be made start_urls = [‘https://fanyi.baidu.com/sug‘]#Start url #Rewrite start_requests modify the method attribute to post request def start_requests(self): for url in self.start_urls: yield scrapy.Request(url=url,callback=self.parse,method=‘post’) #Analysis method, to perform specified data analysis on the obtained page content. Request to run once def parse(self, response): print(response.text) #The method executed after the crawler file ends, often used to close the resource file def closed(self): pass
The second method: rewrite the start_requests method and use the FormRequest() method to send post requests (recommended)
# -*- coding: utf-8 -*- import scrapy # Crawling Baidu translation results class App01Spider(scrapy.Spider): #Crawler file name By file name, locate which crawler file needs to be executed name = ‘app01’ #allowed_domains = [‘www.baidu.com‘]# Allowed domain names, only the page data under this domain name can be crawled, and comments can be made start_urls = [‘https://fanyi.baidu.com/sug‘]#Start url #Rewrite start_requests modify the method attribute to post request def start_requests(self): data={ ‘Kw’: ‘man’ } for url in self.start_urls: yield scrapy.FormRequest(url=url,formdata=data,callback=self.parse,) #Analysis method, to perform specified data analysis on the obtained page content. Request to run once def parse(self, response): print(response.text) #The method executed after the crawler file ends, often used to close the resource file def closed(self): pass
3: Proxy operation
Principle: When the scheduler submits the request in the queue to the downloader, it will pass through the middleware, and the middleware can perform the request Intercept and then modify the request.ip
Step 1: Customize the intermediate class in the middlewares.py file and rewrite the process_request() method
class myProxy_Middleware(object):
def process_request(self, request, spider):
#Modify ip Go to the face without ip address to find an ip Note the protocol header
request.meta['proxy']='http://120.76.77.152:9999< /span>'
Step 2: Modify the settings.py file and register the written middleware class.
DOWNLOADER_MIDDLEWARES = { ‘MyProxy_Middleware’: 543, }
Three: log level
error: error
warning: warning
info: general
debug: debugging information
setting. py Set the log printing level
–Add LOG_LEVEL=’ERROER’ at any position
–Print the log information to the specified file LOG_FILE=’log.txt’
Four: Request to transfer parameters /h3>
–In order to solve the need, the requested data is not on the same page. Set the meta method in the scrapy.Request() method
—passing parameters
yield scrapy.Request(url,callback=self.parse, meta={‘item’:item })
–Get
Response.meta.get(‘item’)
Five: CrawlSpider
1: Question, Crawl all the data of the drawer net.
Solution 1: Manual request sending (inconvenient)
Solution 2: Use CrawlSpider (recommended), which is a subclass of spider and has more powerful functions (link Extractor, rule parser, so the performance is more powerful)
Two: Create CrawlSpider project
1: Create project: scrapy startproject xxxxxx
2: Create file : Scrapy genspider -t crawl file_name url
import scrapy from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rule class ChoutiSpider(CrawlSpider): name = ‘chouti’ # allowed_domains = [‘www.chouti.com‘] start_urls = [‘https://dig.chouti.com/‘] ‘‘‘ 1: Create a link extractor object --allow extract the regular expression of the page url --The link extractor can obtain the extracted url from the specified page according to the regular expression --Send all the extracted URL links to the rule parser ‘‘‘ link=LinkExtractor(allow=r‘all/hot/recent/\d+‘) ‘‘‘ 2: Instantiate a rule parser object --callback: The function that parses the request result page is returned --fllow: Whether to extract the pages extracted by the link extractor, continue to use the link extractor to continue extracting urls that meet the rules --A large number of repeated pages may automatically appear, but they will be automatically deduplicated ‘‘‘ rules = ( Rule(link, callback=‘parse_item‘, follow=True), ) #Callback def parse_item(self, response): item = {} print(response) return item
# -*- coding: utf-8 -*- import scrapy class App01Spider(scrapy.Spider): #Crawler file name By file name, locate which crawler file needs to be executed name = ‘app01’ #allowed_domains = [‘www.baidu.com‘]# Allowed domain names, only the page data under this domain name can be crawled, and comments can be made start_urls = [‘https://fanyi.baidu.com/sug‘]#Start url #Rewrite start_requests modify the method attribute to post request def start_requests(self): for url in self.start_urls: yield scrapy.Request(url=url,callback=self.parse,method=‘post’) #Analysis method, to perform specified data analysis on the obtained page content. Request to run once def parse(self, response): print(response.text) #The method executed after the crawler file ends, often used to close the resource file def closed(self): pass
# -*- coding: utf-8 -*- import scrapy # Crawling Baidu translation results class App01Spider(scrapy.Spider): #Crawler file name By file name, locate which crawler file needs to be executed name = ‘app01’ #allowed_domains = [‘www.baidu.com‘]# Allowed domain names, only the page data under this domain name can be crawled, and comments can be made start_urls = [‘https://fanyi.baidu.com/sug‘]#Start url #Rewrite start_requests modify the method attribute to post request def start_requests(self): data={ ‘Kw’: ‘man’ } for url in self.start_urls: yield scrapy.FormRequest(url=url,formdata=data,callback=self.parse,) #Analysis method, to perform specified data analysis on the obtained page content. Request to run once def parse(self, response): print(response.text) #The method executed after the crawler file ends, often used to close the resource file def closed(self): pass
class myProxy_Middleware(object):
def process_request(self, request, spider):
#Modify ip Go to the face without ip address to find an ip Note the protocol header
request.meta['proxy']='http://120.76.77.152:9999< /span>'
DOWNLOADER_MIDDLEWARES = { ‘MyProxy_Middleware’: 543, }
import scrapy from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rule class ChoutiSpider(CrawlSpider): name = ‘chouti’ # allowed_domains = [‘www.chouti.com‘] start_urls = [‘https://dig.chouti.com/‘] ‘‘‘ 1: Create a link extractor object --allow extract the regular expression of the page url --The link extractor can obtain the extracted url from the specified page according to the regular expression --Send all the extracted URL links to the rule parser ‘‘‘ link=LinkExtractor(allow=r‘all/hot/recent/\d+‘) ‘‘‘ 2: Instantiate a rule parser object --callback: The function that parses the request result page is returned --fllow: Whether to extract the pages extracted by the link extractor, continue to use the link extractor to continue extracting urls that meet the rules --A large number of repeated pages may automatically appear, but they will be automatically deduplicated ‘‘‘ rules = ( Rule(link, callback=‘parse_item‘, follow=True), ) #Callback def parse_item(self, response): item = {} print(response) return item