distributed crawler
Use scrapy_redis
dupefilter to remove duplicates:
request_fingerpint() request fingerprint
Use haslib.sha1 for request.method, request.url, request.headers, request.body is encrypted
40-character hexadecimal string, the disadvantage is too much memory
Optimization:
Change the default dupefilter in scrapy_redis to :
Use BloomFilter to remove duplication, small memory, faster
1. Use the following settings in the project:
#Enable storage request queue scheduling in redis.
SCHEDULER = " scrapy_redis.scheduler.Scheduler "
?
# Make sure that all spiders share the same duplicate filter through redis.
DUPEFILTER_CLASS = " scrapy_redis.dupefilter.RFPDupeFilter "
?
#Store the requested data in redis for post-processing.
ITEM_PIPELINES = {
' scrapy_redis.pipelines.RedisPipeline': 300
}
?
Specify the connection to Redis The host and port used at the time (optional).
REDIS_HOST = 'localhost'
REDIS_PORT = < span class="cm-number">6379 span>
Specific usage reference link: https://github.com/rmax/scrapy-redis
2. Use
import scrapy
from scrapy_redis.spiders import RedisSpider # Use RedisSpider
?
from reading_network.items import ReadingNetworkItem
?
# spider in the basc template Modifications based on:
class GuoxueSpider(RedisSpider): # Change to inherit RedisSpider
name = 'guoxue'
span class="cm-variable">redis_key = 'start_urls' # Set the key to store the crawling address to start_urls, and the value to start Starting url
? span> span>
Note: The running command is different from the previous one. Order:
cd spiders directory
scrapy runspider guoxue.py (crawler file)
3. Data persistence, stored in mysql
import pymysql
from redis import Redis
import json
?
# Take out the data from redis and save it to mysql
conn = pymysql.connect(host='127.0.0.1', port=3306, user=< span class="cm-string">'root', password='450502',
charset='utf8', database='spider1')
cursor = conn.cursor()
redis_conn = Redis()
?
?
def insert_mysql():
while 1 :
# Get data from redis
# print(redis_conn.brpop('guoxue:items) ', timeout=60)) # There will be a 60s delay when it is not available, and an error will be reported after the timeout, (b'guoxue:items', b'{'name':'xxx')')
try:
_, data = redis_conn.brpop('guoxue:items', timeout=60)
except:
cursor.close()
conn.close()
break
?
# Convert to python objects
data = json.loads(data, encoding ='utf-8')
keys = ','.< span class="cm-property">join(data.keys())
values = ','.j oin(['%s'] * len(data)) #'%s, %s, %s'
sql = f'insert into guoxue({keys}) values({values})'
try:
cursor.execute(sql, tuple(data.values())) # Pass a tuple to replace %s
conn.commit()
except Exception as e:
print(e)
conn.rollback()
?
?
if __name__ == '__main__':
insert_mysql() span> span> span> span> span>
p>
Use distributed
1 .Modify configuration
-
The redis configuration file redis.windows.conf needs to be modified under window, restart redis
bind 127.0. 0.1 changed to 0.0.0.0
protected-mode no # Close protected mode
requirepass 123456 # Remote access to other servers requires a password
-
Redis modify the configuration file redis.windows.conf under linux, restart redis
bind 0.0.0.0
requirepass 123456
Other servers connect to local redis: redis-cli -h local ip
A password is required to access: auth 123456
-
-
settings need to change the server address of the shared redis database to:
REDIS_URL = 'redis://:[email protected]local ip:6379' # is not a local ip127.0.0.1
2. Use
< span> is basically the same as above, just using crawlspider; the inherited class is changed to RedisCrawlSpider
import scrapy
from scrapy_redis.< span class="cm-property">spiders import RedisCrawlSpider # Use RedisCrawlSpider
?
from reading_network.items import ReadingNetworkItem
?
# Make changes based on the spider of the crawl template:
class GuoxueSpider(RedisCrawlSpider): # Change to inherit RedisCrawlSpider
name = 'guoxue'
redis_key = 'start_urls' # Set the key to store the start crawling address as start_urls, and the value as the starting url
? span> span>