Distributed reptile

distributed crawler

Use scrapy_redis

dupefilter to remove duplicates:

request_fingerpint() request fingerprint

Use haslib.sha1 for request.method, request.url, request.headers, request.body is encrypted

40-character hexadecimal string, the disadvantage is too much memory

Optimization:

Change the default dupefilter in scrapy_redis to :

Use BloomFilter to remove duplication, small memory, faster

1. Use the following settings in the project:

#Enable storage request queue scheduling in redis. 
SCHEDULER = " scrapy_redis.scheduler.Scheduler "
?
# Make sure that all spiders share the same duplicate filter through redis.
DUPEFILTER_CLASS = " scrapy_redis.dupefilter.RFPDupeFilter "
?
#Store the requested data in redis for post-processing.
ITEM_PIPELINES = {
' scrapy_redis.pipelines.RedisPipeline': 300
}
?
Specify the connection to Redis The host and port used at the time (optional).
REDIS_HOST = 'localhost'
REDIS_PORT = < span class="cm-number">6379
span>

Specific usage reference link: https://github.com/rmax/scrapy-redis

2. Use

 import scrapy
from scrapy_redis.spiders import RedisSpider # Use RedisSpider
?
from reading_network.items import ReadingNetworkItem
?
# spider in the basc template Modifications based on:
class GuoxueSpider(RedisSpider): # Change to inherit RedisSpider
name = 'guoxue'
span class="cm-variable">redis_key = 'start_urls' # Set the key to store the crawling address to start_urls, and the value to start Starting url
?
span>
span>

Note: The running command is different from the previous one. Order:

cd spiders directory

scrapy runspider guoxue.py (crawler file)

3. Data persistence, stored in mysql

import  pymysql
from redis import Redis
import json
?
# Take out the data from redis and save it to mysql
conn = pymysql.connect(host='127.0.0.1', port=3306, user=< span class="cm-string">'root', password='450502',
charset='utf8', database='spider1')
cursor = conn.cursor()
redis_conn = Redis()
?
?
def insert_mysql():
while 1 :
# Get data from redis
# print(redis_conn.brpop('guoxue:items) ', timeout=60)) # There will be a 60s delay when it is not available, and an error will be reported after the timeout, (b'guoxue:items', b'{'name':'xxx')')
try:
_, data = redis_conn.brpop('guoxue:items', timeout=60)
except:
cursor.close()
conn.close()
break
?
# Convert to python objects
data = json.loads(data, encoding ='utf-8')
keys = ','.< span class="cm-property">join(data.keys())
values ​​= ','.j oin(['%s'] * len(data)) #'%s, %s, %s'
sql = f'insert into guoxue({keys}) values({values})'
try:
cursor.execute(sql, tuple(data.values())) # Pass a tuple to replace %s
conn.commit()
except Exception as e:
print(e)
conn.rollback()
?
?
if __name__ == '__main__':
insert_mysql()
span>
span>
span>
span>
span>

p>

Use distributed

1 .Modify configuration

  • The redis configuration file redis.windows.conf needs to be modified under window, restart redis

    bind 127.0. 0.1 changed to 0.0.0.0

    protected-mode no # Close protected mode

    requirepass 123456 # Remote access to other servers requires a password

    • Redis modify the configuration file redis.windows.conf under linux, restart redis

      bind 0.0.0.0

      requirepass 123456

      Other servers connect to local redis: redis-cli -h local ip

      A password is required to access: auth 123456

  • settings need to change the server address of the shared redis database to:

REDIS_URL =  'redis://:[email protected]local ip:6379' # is not a local ip127.0.0.1 

2. Use

< span> is basically the same as above, just using crawlspider; the inherited class is changed to RedisCrawlSpider

import scrapy
from scrapy_redis.< span class="cm-property">spiders import RedisCrawlSpider # Use RedisCrawlSpider
?
from reading_network.items import ReadingNetworkItem
?
# Make changes based on the spider of the crawl template:
class GuoxueSpider(RedisCrawlSpider): # Change to inherit RedisCrawlSpider
name = 'guoxue'
redis_key = 'start_urls' # Set the key to store the start crawling address as start_urls, and the value as the starting url
?
span>

Leave a Comment

Your email address will not be published.