Distributed crawler-Bilibili comment

It is really necessary for the course, otherwise I would have given up on the way. After maintaining the intermittent bug debugging for more than half a month, it suddenly came true. I am very pleased. On the Internet, there are some introductions about distributed crawlers. There are not many actual combats and they are all similar. What is said is just the process of the project. It may be that the project went smoothly, and I kept bugs. It’s good to record it for your reference.

About scrapy-redis environment configuration and framework process will not be described. There is also a lot of content online.

The main contents are:

1. How to write a distributed crawler

First create an ordinary crawler, which is the basis for ensuring the normal operation of this crawler Make changes on it, and then become a distributed crawler. My project is to store and allocate requests in the redis database on linux. I wrote bilibili comment capture. The project path is as follows:

share picture

a. Change the spiders –> bilibili.py inheritance class from scrapy.Spider to RedisSpider.

Comment out: "allowed_domains"  "start_urls< /span>"

Add: redis_key
= 'bilibili:start_urls'

b. setting.py modification

# redis database connection parameters

REDIS_HOST = ‘remote ip‘ # windows setting
# REDIS_HOST = ‘localhost’ # linux setting
REDIS_PORT = 6379
REDIS_PARAMS
= {
'password': '123456' # Set the redis database with a password setting
}

# Specify the scheduler using scrapy-redis
SCHEDULER = 'scrapy_redis.scheduler.Scheduler'

# Specify the scrapy-redis de-duplication mechanism
DUPEFILTER_CLASS = 'scrapy_redis.dupefilter.RFPDupeFilter'

# Specify the queue when sorting and crawling addresses
#
Use first-in-first-out (FIFO) sorting
SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.SpiderPriorityQueue'

# Set a breakpoint to resume uploading
SCHEDULER_PERSIST = False

DOWNLOADER_MIDDLEWARES
= {
  'spider_parallel.middlewares.SpiderParallelDownloaderMiddleware span>': 543,
}

ITEM_PIPELINES
= {
'spider_parallel.pipelines.SpiderParallelPipeline': 300,
'scrapy_redis.pipelines.RedisPipeline': 400
}

2. The order of running the crawler “the deepest pit”

a. First enter the redis database in linux

redis-cli

auth
123456

b. Run spider on the server side:

scrapy crawl bilibili

At this time, I saw that the program has been waiting for redis to insert start_url

c. Execute in the redis command line:

lpush bilibili:start_urls https://api.bilibili.com/x/v2/reply?pn=1&type=1&oid=329437&sort=2

This link is the address of the comment package, and oid is the video av number. How do you know this address? ? ? Let’s take a look at the basics of crawlers. (Search in the content loaded by page F12)

d. Run spider in windows:

scrapy crawl bilibili

< p>The order is wrong, one of them will always be in a state of waiting for monitoring. Also, I want to make the name values ​​in the two spiders–>bilibili be different to distinguish which crawler grabbed the different comments. No, it must be the same. My distinction is that server-side crawlers are stored in the server-side mongodb database, and windows crawlers are stored in the windows mongodb database. It is also possible to store them in the same database. Record the crawler class when distinguishing, write the __init__ () function to inherit the content of the parent class and so on.

No problem, you can see that the two crawlers are crawling and executing together.

The project has been uploaded to git.

Original is not easy, respect copyright. Please indicate the source for reprinting: http://www.cnblogs.com/xsmile/

Comment out: "allowed_domains"  "start_urls"

Add: redis_key
= 'bilibili:start_urls'

# redis database connection parameters

REDIS_HOST = ‘remote ip‘ # windows setting
# REDIS_HOST = ‘localhost’ # linux setting
REDIS_PORT = 6379
REDIS_PARAMS
= {
'password': '123456' # Set the redis database with a password setting
}

# Specify the scheduler using scrapy-redis
SCHEDULER = 'scrapy_redis.scheduler.Scheduler'

# Specify the scrapy-redis de-duplication mechanism
DUPEFILTER_CLASS = 'scrapy_redis.dupefilter.RFPDupeFilter'

# Specify the queue when sorting and crawling addresses
#
Use first-in-first-out (FIFO) sorting
SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.SpiderPriorityQueue'

# Set a breakpoint to resume uploading
SCHEDULER_PERSIST = False

DOWNLOADER_MIDDLEWARES
= {
  'spider_parallel.middlewares.SpiderParallelDownloaderMiddleware span>': 543,
}

ITEM_PIPELINES
= {
'spider_parallel.pipelines.SpiderParallelPipeline': 300,
'scrapy_redis.pipelines.RedisPipeline': 400
}

redis-cli

auth
123456

scrapy crawl bilibili

lpush bilibili:start_urls https://api .bilibili.com/x/v2/reply?pn=1&type=1&oid=329437&sort=2

scrapy crawl bilibili

Leave a Comment

Your email address will not be published.