Distributed reptile - distributed, reptile

distributed crawler

Use scrapy_redis

dupefilter to remove duplicates:

request_fingerpint() request fingerprint

Use haslib.sha1 for request.method, request.url, request.headers, request.body is encrypted

40-character hexadecimal string, the disadvantage is too much memory

Optimization:

Change the default dupefilter in scrapy_redis to :

Use BloomFilter to remove duplication, small memory, faster

1. Use the following settings in the project:

#Enable storage request queue scheduling in redis. 
SCHEDULER = " scrapy_redis.scheduler.Scheduler "
?
# Make sure that all spiders share the same duplicate filter through redis. 
DUPEFILTER_CLASS = " scrapy_redis.dupefilter.RFPDupeFilter "
?
#Store the requested data in redis for post-processing. 
ITEM_PIPELINES = {
 ' scrapy_redis.pipelines.RedisPipeline': 300
}
?
Specify the connection to Redis The host and port used at the time (optional). 
REDIS_HOST = 'localhost' 
REDIS_PORT = < span class="cm-number">6379  span>

Specific usage reference link: https://github.com/rmax/scrapy-redis

2. Use

 import scrapy
from scrapy_redis.spiders import RedisSpider # Use RedisSpider
?
from reading_network.items import ReadingNetworkItem
?
# spider in the basc template Modifications based on:
class GuoxueSpider(RedisSpider): # Change to inherit RedisSpider
 name = 'guoxue'
  span class="cm-variable">redis_key = 'start_urls' # Set the key to store the crawling address to start_urls, and the value to start Starting url
? span>  span>

Note: The running command is different from the previous one. Order:

cd spiders directory

scrapy runspider guoxue.py (crawler file)

3. Data persistence, stored in mysql

import  pymysql
from redis import  Redis
import json
?
# Take out the data from redis and save it to mysql
conn = pymysql.connect(host='127.0.0.1', port=3306, user=< span class="cm-string">'root', password='450502',
 charset='utf8', database='spider1')
 cursor = conn.cursor()
redis_conn = Redis()
?
?
def insert_mysql():
 while 1 :
 # Get data from redis
 # print(redis_conn.brpop('guoxue:items) ', timeout=60)) # There will be a 60s delay when it is not available, and an error will be reported after the timeout, (b'guoxue:items', b'{'name':'xxx')')
 try:
 _, data = redis_conn.brpop('guoxue:items', timeout=60)
 except:
 cursor.close()
 conn.close()
 break
?
 # Convert to python objects
 data = json.loads(data, encoding ='utf-8')
 keys = ','.< span class="cm-property">join(data.keys())
 values = ','.j oin(['%s'] * len(data)) #'%s, %s, %s'
 sql = f'insert into guoxue({keys}) values({values})'
 try:
 cursor.execute(sql, tuple(data.values())) # Pass a tuple to replace %s
 conn.commit()
 except Exception as e:
 print(e) 
 conn.rollback()
?
?
if __name__ == '__main__':
 insert_mysql() span>  span>  span>  span>  span>

Use distributed

1 .Modify configuration

The redis configuration file redis.windows.conf needs to be modified under window, restart redis

bind 127.0. 0.1 changed to 0.0.0.0

protected-mode no # Close protected mode

requirepass 123456 # Remote access to other servers requires a password
- Redis modify the configuration file redis.windows.conf under linux, restart redis
  
  bind 0.0.0.0
  
  requirepass 123456
  
  Other servers connect to local redis: redis-cli -h local ip
  
  A password is required to access: auth 123456
settings need to change the server address of the shared redis database to:

REDIS_URL =  'redis://:[email protected]local ip:6379' # is not a local ip127.0.0.1

2. Use

< span> is basically the same as above, just using crawlspider; the inherited class is changed to RedisCrawlSpider

import scrapy
from scrapy_redis.< span class="cm-property">spiders import RedisCrawlSpider # Use RedisCrawlSpider
?
from reading_network.items import ReadingNetworkItem
?
# Make changes based on the spider of the crawl template:
class GuoxueSpider(RedisCrawlSpider): # Change to inherit RedisCrawlSpider
 name = 'guoxue'
 redis_key = 'start_urls' # Set the key to store the start crawling address as start_urls, and the value as the starting url
?  span>

distributed crawler

Use scrapy_redis

Use distributed

Leave a Comment Cancel reply