Introduction:
When we browse related webpages, we will find that some websites will regularly update a batch of data on the basis of the original webpage data. For example, a movie website will update a batch of recent popular movies in real time. The novel website will update the latest chapter data in real time according to the author’s creation progress. So, in a similar situation, when we encounter it during the crawling process, do we need to update the program regularly so that we can crawl the most recently updated data in the website?
1. Incremental crawler
- Concept: monitor the data update status of a website through a crawler program so that it can be crawled New data updated.
- How to perform incremental crawling:
- Before sending the request, determine whether the URL has been crawled before
- After parsing the content, determine this part Whether the content has been crawled before
- When writing to the storage medium, determine whether the content already exists in the medium
- Analysis:
It is not difficult to find that, in fact, the core of incremental crawling It is De-duplication. As for the step of the de-duplication operation, we can only say that each has its own advantages and disadvantages. In my opinion, the first two ideas need to be selected according to the actual situation (or both may be used). The first idea is suitable for websites where new pages are constantly appearing, such as new chapters of novels, the latest news every day, etc.; the second idea is suitable for websites where page content will be updated. The third idea is equivalent to the last line of defense. Doing so can achieve the purpose of de-duplication to the greatest extent.
- Analysis:
- Deduplication method
- Store the URL generated during the crawling process and store it in redis In the set. When crawling the data next time, first judge the URL corresponding to the request to be initiated in the set of stored URLs, if it exists, no request is made, otherwise, the request is made.
- Specify the unique identification of the crawled web content, and then store the unique representation in the redis set. When crawling web page data next time, before performing persistent storage, you can first determine whether the unique identifier of the data exists in the redis set, and then decide whether to perform persistent storage.
2. Project case
– Requirements: Crawl all the movie details data in the 4567tv website.
Crawler file:
# -*- coding: utf-8 -*- < span class="hljs-keyword">import scrapy from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rule from redis import Redis from incrementPro.items import IncrementproItem class MovieSpider(CrawlSpider): name = 'movie' # allowed_domains = ['www.xxx.com'] start_urls = ['http://www.4567tv.tv/frim/index7-11.html'] rules = (Rule(LinkExtractor(allow=r'/frim/index7-\d+\.html'), callback='parse_item', follow=True),) #Create a redis link object conn = Redis(host=< span class="hljs-string">'127.0.0.1',port=6379) < span class="hljs-function">def parse_item(self, response): li_list = response.xpath('//li[@class="p1 m1"]') for li in li_list: #Get the url of the detail page detail_url = < span class="hljs-string">'http://www.4567tv.tv'+li.xpath('./a/@href').extract_first() #store the url of the details page in the set of redis ex = self.conn.sadd('urls',detail_url) if ex == 1: print('The url has not been crawled, and data can be crawled') yield scrapy. Request(url=detail_url,callback=self.parst_detail) else: print('Data has not been updated yet, there is no new data to crawl! ') #Analyze the movie name and type in the details page for persistent storagedef parst_detail(self,response): item = IncrementproItem() item['name'] = response.xpath('//dt[@class="name"]/text()').extract_first() item['kind'] = response.xpath('//div[@class="ct-c"]/dl/dt[4]//text() ').extract() item['kind'] = ".join(item['kind']) yield item span> span> span> span>
Pipe file:
# -*- coding: utf-8 -*- # Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: https://doc.scrapy.org/en/latest/topics/item-pipeline. html from redis import Redis class IncrementproPipeline(object) : conn = None def o pen_spider(self,spider): self.conn = Redis(host='127.0.0.1',port=6379) def process_item(self, item, spider): dic = {'name':item['name'], 'kind':item['kind']} print(dic) self.conn.lpush('movieData',dic) return item span> span>
– Requirements: crawl Paragraphs and author data in the Encyclopedia of Embarrassment.
Crawler file:
# -*- coding: utf-8 -*- < span class="hljs-keyword">import scrapy from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rule from incrementByDataPro.items < span class="hljs-keyword">import IncrementbydataproItem from redis import Redis import hashlib class QiubaiSpider(CrawlSpider): name = 'qiubai' # allowed_domains = [' www.xxx.com'] start_urls = ['https://www.qiushibaike.com/text/'] rules = (Rule( LinkExtractor(allow=r'/text/page/\d+/'), callback=< span class="hljs-string">'parse_item', follow=True), Rule(LinkExtractor(allow=r'/text/$'), callback='parse_item', follow =True),) #Create a redis link object conn = Redis(host='127.0.0.1',port=6379) < span class="hljs-keyword">def parse_item(self, response): div_list = response .xpath('//div[@id="content-left"]/div') for div in div_list: item = IncrementbydataproItem() item['author'] = div.xpath('./div[1]/ a[2]/h2/text() | ./div[1]/span[2]/h2/text()').extract_first() item['content'] = div.xpath('.//div[@class="content"]/span/text()').extract_first() #将The parsed data value generates a unique identifier for redis storage source = item['author']+item['content'] source_id = hashlib.sha256(source.encode()).hexdigest() #Store the unique representation of the parsed content in the data_id of redis ex = self.conn.sadd('data_id',source_id ) if ex == 1: print(< span class="hljs-string">'This piece of data has not been crawled, you can crawl...') < span class="hljs-keyword">yield item else: print('This piece of data has already been crawled, and there is no need to crawl it again!!!') span> span> span>
Pipe file:
# -*- coding: utf-8 -*- # Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_ PIPELINES setting # See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html from redis import Redis class IncrementbydataproPipeline(object): conn = None < span class="hljs-keyword">def open_spider(self, spider): self.conn = Redis(host='127.0.0.1', port=6379) def process_item(self, item, spider): dic = {'author': item['author'], 'content': item['content']} # print(dic) self.conn.lpush(' qiubaiData', dic) return item< /span>< /span>
< p>Incremental crawler
Introduction:
When we browse related webpages, we will find that some websites will regularly update the original webpage data. Batch data, for example, a movie website will update a batch of recently popular movies in real time. The novel website will update the latest chapter data in real time according to the author’s creation progress. So, in a similar situation, when we encounter it during the crawling process, do we need to update the program regularly so that we can crawl the most recently updated data in the website?
1. Incremental crawler
- Concept: monitor the data update status of a website through a crawler program so that it can be crawled New data updated.
- How to perform incremental crawling:
- Before sending the request, determine whether the URL has been crawled before
- After parsing the content, determine this part Whether the content has been crawled before
- When writing to the storage medium, determine whether the content already exists in the medium
- Analysis:
It is not difficult to find that, in fact, the core of incremental crawling It is De-duplication. As for the step of the de-duplication operation, we can only say that each has its own advantages and disadvantages. In my opinion, the first two ideas need to be selected according to the actual situation (or both may be used). The first idea is suitable for websites where new pages are constantly appearing, such as new chapters of novels, the latest news every day, etc.; the second idea is suitable for websites where page content will be updated. The third idea is equivalent to the last line of defense. Doing so can achieve the purpose of de-duplication to the greatest extent.
- Analysis:
- Deduplication method
- Store the URL generated during the crawling process and store it in redis In the set. When crawling the data next time, first judge the URL corresponding to the request to be initiated in the set of stored URLs, if it exists, no request is made, otherwise, the request is made.
- Specify the unique identification of the crawled web content, and then store the unique representation in the redis set. When crawling web page data next time, before performing persistent storage, you can first determine whether the unique identifier of the data exists in the redis set, and then decide whether to perform persistent storage.
2. Project case
– Requirements: Crawl all the movie details data in the 4567tv website.
Crawler file:
# -*- coding: utf-8 -*- < span class="hljs-keyword">import scrapy from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rule from redis import Redis from incrementPro.items import IncrementproItem class MovieSpider(CrawlSpider): name = 'movie' # allowed_domains = ['www.xxx.com'] start_urls = [ 'http://www.4567tv.tv/frim/index7-11.html'] rules = (Rule(LinkExtractor(allow=r '/frim/index7-\d+\.html'), callback='parse_item', follow=True),) #Create a redis link object conn = Redis(host='127.0.0.1',port=6379) def parse_item(self, response): li_list = response.xpath( '//li[@class="p1 m1"]') for li in li_list: #Get the url of the detail page detail_url = 'http://www.4567tv.tv'+li.xpath('./a/@href').extract_first() #store the url of the detail page in the redis set ex = self.conn.sadd('urls',detail_url) if ex == 1: print('The url has not been crawled, and data can be crawled') yield scrapy.Re quest(url=detail_url,callback=self.parst_detail) else: print('Data has not been updated yet, there is no new data to crawl! ') #Analyze the movie name and type in the details page for persistent storagedef parst_detail(self,response): item = IncrementproItem() item['name'] = response.xpath('//dt[@class="name"]/text()').extract_first() item['kind'] = response.xpath('//div[@class="ct-c"]/dl/dt[4]//text() ').extract() item['kind'] = ".join(item['kind']) yield item span> span> span> span>
Pipe file:
# -*- coding: utf-8 -*- # Define your i tem pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: https://doc.scrapy.org/en/latest/topics/item-pipeline. html from redis import Redis class IncrementproPipeline(object) : conn = None def ope n_spider(self,spider): self.conn = Redis(host='127.0.0.1',port=6379) def process_item(self, item, spider): dic = {'name':item['name'], 'kind':item['kind']} print(dic) self.conn.lpush('movieData',dic) return item < /span>< /span>
– Requirements: crawling embarrassment Paragraphs and author data in the encyclopedia.
Crawler file:
# -*- coding: utf-8 -*- < span class="hljs-keyword">import scrapy from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rule from incrementByDataPro.items < span class="hljs-keyword">import IncrementbydataproItem from redis import Redis import hashlib class QiubaiSpider(CrawlSpider): name = 'qiubai' # allowed_domains = ['www. xxx.com'] start_urls = ['https://www.qiushibaike.com/text/'] rules = (Rule(LinkExtractor( allow=r'/text/page/\d+/'), callback='parse_item', follow=True), Rule(LinkExtractor(allow=r'/text/$'), callback='parse_item', follow=< span class="hljs-keyword">True),) #Create redis link object conn = Redis(host='127.0.0.1',port=6379) def < span class="hljs-title">parse_item(self, response): div_list = response. xpath('//div[@id="content-left"]/div') for div in div_list: item = IncrementbydataproItem() item['author'] = div.xpath('./div[1]/a [2]/h2/text() | ./div[1]/span[2]/h2/text()').extract_first() item['content'] = div.xpath('.//div[@class="content"]/span/text()').extract_first() #will resolve The received data value generates a unique identifier for redis storage source = item['author']+item['content'] source_id = hashlib.sha256(source.encode()).hexdigest() #Store the unique representation of the parsed content in the data_id of redis ex = self.conn.sadd('data_id',source_id) if ex == 1: print('This piece of data has not been crawled, you can crawl...') yield item else: print('This piece of data has already been crawled, there is no need to crawl it again!!!') < /span> span> span> span>
Pipe file:
# -*- coding: utf-8 -*- # Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPE LINES setting # See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html from redis import Redis class IncrementbydataproPipeline(object): conn = None < span class="hljs-keyword">def open_spider(self, spider): self.conn = Redis(host=‘127.0.0.1‘, port=6379) def process_item(self, item, spider): dic = { ‘author‘: item[‘author‘], ‘content‘: item[‘content‘] } # print(dic) self.conn.lpush(‘qiubaiData‘, dic) return item