Introduction:
When we browse related webpages, we will find that some websites will regularly update a batch of data on the basis of the original webpage data, for example, a movie website will update in real time A batch of recent hot movies. The novel website will update the latest chapter data in real time according to the author’s creation progress. So, in a similar situation, when we encounter it during the crawling process, do we need to update the program regularly so that we can crawl the most recently updated data in the website?
1. Incremental crawler
- Concept: Use a crawler program to monitor the data update situation of a website so that the new data updated by the website can be crawled.
- How to perform incremental crawling:
- Before sending the request, determine whether the URL has been crawled before
- After parsing the content, determine this part Whether the content has been crawled before
- When writing to the storage medium, determine whether the content already exists in the medium
- Analysis:
It is not difficult to find that, in fact, the core of incremental crawling It is De-duplication. As for the step of the de-duplication operation, we can only say that each has its own advantages and disadvantages. In my opinion, the first two ideas need to be selected according to the actual situation (or both may be used). The first idea is suitable for websites where new pages are constantly appearing, such as new chapters of novels, the latest news every day, etc.; the second idea is suitable for websites where page content will be updated. The third idea is equivalent to the last line of defense. Doing so can achieve the purpose of de-duplication to the greatest extent.
- Analysis:
- Deduplication method
- Store the URL generated during the crawling process and store it in redis In the set. When crawling the data next time, first judge the URL corresponding to the request to be initiated in the set of stored URLs, if it exists, no request is made, otherwise, the request is made.
- Specify the unique identification of the crawled web content, and then store the unique representation in the redis set. When crawling web page data next time, before performing persistent storage, you can first determine whether the unique identifier of the data exists in the redis set, and then decide whether to perform persistent storage.
2. Project case
– Requirement 1: Crawl all the movie details data in the 4567tv website.
- crawler file:
import scrapy from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rule from redis import Redis from movieIncrement.items import MovieincrementItem class MovieSpider(CrawlSpider): name = ‘movie’ # allowed_domains = [‘www.xxx.com‘] start_urls = [‘https://www.4567tv.tv/index.php/vod/show/id/7.html‘] rules = ( Rule(LinkExtractor(allow=r‘/index.php/vod/show/id/7/page/\d+\.html‘), callback=‘parse_item‘, follow=True), ) def parse_item(self, response): conn = Redis(host=‘127.0.0.1’,port=6379) detail_url_list = response.xpath(‘//li[@class="col-md-6 col-sm-4 col-xs-3"]/div/a/@href‘).extract() for url in detail_url_list: #ex == 1: No url is stored in the set url = ‘https://www.4567tv.tv’ + url ex = conn.sadd(‘movies_url‘, url) if ex == 1: yield scrapy.Request(url=url, callback=self.parse_detail) else: print(‘The website has no updated data, there is no new data to climb!’) def parse_detail(self, response): item = MovieincrementItem() item[‘name’] = response.xpath(‘/html/body/div[1]/div/div/div/div[2]/h1/text()‘).extract_first() item[‘actor’] = response.xpath(‘/html/body/div[1]/div/div/div/div[2]/p[3]/a/text()‘).extract_first() yield item
- Pipe file:
from redis import Redis class MovieincrementPipeline(object): conn = None def open_spider(self, spider): self.conn = Redis(host=‘127.0.0.1’, port=6379) def process_item(self, item, spider): print(‘New data has been crawled and is being stored...’) self.conn.lpush(‘movie_data’, item) return item
– Requirements 2: Climbing Take the paragraph and author data in the Encyclopedia of Embarrassment.
- crawler file
import scrapy from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rule from increment2_Pro.items import Increment2ProItem from redis import Redis import hashlib class QiubaiSpider(CrawlSpider): name = ‘qiubai’ # allowed_domains = [‘www.xxx.com‘] start_urls = [‘https://www.qiushibaike.com/text/‘] rules = ( Rule(LinkExtractor(allow=r‘/text/page/\d+/‘), callback=‘parse_item’, follow=True), ) def parse_item(self, response): div_list = response.xpath(‘//div[@class="article block untagged mb15 typs_hot"]’) conn = Redis(host=‘127.0.0.1’,port=6379) for div in div_list: item = Increment2ProItem() item[‘content’] = div.xpath(‘.//div[@class="content"]/span//text()‘).extract() item[‘content‘] = ‘‘.join(item[‘content‘]) item[‘author‘] = div.xpath(‘./div/a[2]/h2/text() | ./div[1]/span[2]/h2/text()‘).extract_first() source = item[‘author‘]+item[‘content’] #I have developed a form of data fingerprint hashValue = hashlib.sha256(source.encode()).hexdigest() ex = conn.sadd(‘qiubai_hash‘,hashValue) if ex == 1: yield item else: print(‘No update data to climb!!!’)
- Pipeline file: span>
from redis import Redis class Increment2ProPipeline(object): conn = None def open_spider(self,spider): self.conn = Redis(host=‘127.0.0.1’,port=6379) def process_item(self, item, spider): dic = { ‘Author‘:item[‘author‘], ‘Content’:item[‘content’] } self.conn.lpush(‘qiubaiData‘,dic) print(‘A piece of data was crawled and it is being stored...’) return item
import scrapy from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rule from redis import Redis from movieIncrement.items import MovieincrementItem class MovieSpider(CrawlSpider): name = ‘movie’ # allowed_domains = [‘www.xxx.com‘] start_urls = [‘https://www.4567tv.tv/index.php/vod/show/id/7.html‘] rules = ( Rule(LinkExtractor(allow=r‘/index.php/vod/show/id/7/page/\d+\.html‘), callback=‘parse_item‘, follow=True), ) def parse_item(self, response): conn = Redis(host=‘127.0.0.1’,port=6379) detail_url_list = response.xpath(‘//li[@class="col-md-6 col-sm-4 col-xs-3"]/div/a/@href‘).extract() for url in detail_url_list: #ex == 1: No url is stored in the set url = ‘https://www.4567tv.tv’ + url ex = conn.sadd(‘movies_url‘, url) if ex == 1: yield scrapy.Request(url=url, callback=self.parse_detail) else: print(‘The website has no updated data, there is no new data to climb!’) def parse_detail(self, response): item = MovieincrementItem() item[‘name’] = response.xpath(‘/html/body/div[1]/div/div/div/div[2]/h1/text()‘).extract_first() item[‘actor’] = response.xpath(‘/html/body/div[1]/div/div/div/div[2]/p[3]/a/text()‘).extract_first() yield item
from redis import Redis class MovieincrementPipeline(object): conn = None def open_spider(self, spider): self.conn = Redis(host=‘127.0.0.1’, port=6379) def process_item(self, item, spider): print(‘New data has been crawled and is being stored...’) self.conn.lpush(‘movie_data’, item) return item
import scrapy from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rule from increment2_Pro.items import Increment2ProItem from redis import Redis import hashlib class QiubaiSpider(CrawlSpider): name = ‘qiubai’ # allowed_domains = [‘www.xxx.com‘] start_urls = [‘https://www.qiushibaike.com/text/‘] rules = ( Rule(LinkExtractor(allow=r‘/text/page/\d+/‘), callback=‘parse_item’, follow=True), ) def parse_item(self, response): div_list = response.xpath(‘//div[@class="article block untagged mb15 typs_hot"]’) conn = Redis(host=‘127.0.0.1’,port=6379) for div in div_list: item = Increment2ProItem() item[‘content’] = div.xpath(‘.//div[@class="content"]/span//text()‘).extract() item[‘content‘] = ‘‘.join(item[‘content‘]) item[‘author‘] = div.xpath(‘./div/a[2]/h2/text() | ./div[1]/span[2]/h2/text()‘).extract_first() source = item[‘author‘]+item[‘content’] #I have developed a form of data fingerprint hashValue = hashlib.sha256(source.encode()).hexdigest() ex = conn.sadd(‘qiubai_hash‘,hashValue) if ex == 1: yield item else: print(‘No update data to climb!!!’)
from redis import Redis class Increment2ProPipeline(object): conn = None def open_spider(self,spider): self.conn = Redis(host=‘127.0.0.1’,port=6379) def process_item(self, item, spider): dic = { ‘Author‘:item[‘author‘], ‘Content’:item[‘content’] } self.conn.lpush(‘qiubaiData‘,dic) print(‘A piece of data was crawled and it is being stored...’) return item