Hands-On Implementation of Scrapy-Redis Distributed Crawler: A Hands-On Guide from Configuration to Final Runs

-redis environment preparation

pip install scrapy-redis

Once installed, just make sure it can be imported and used properly.

2. Realization

Next we can configure the distributed crawler in just a few simple steps.

2.1 Modifying the Scheduler

In the previous lessons we explained the concept of Scheduler, which is used to handle the scheduling logic for Request, Item, etc. By default, the queue for Request is in the /Memory/In order to realize the distributed, we need to migrate the queue to Redis, then we need to modify the Scheduler, the modification is very simple, just need to add the following code inside:

SCHEDULER = "scrapy_redis."

Here we change the Scheduler class to the Scheduler class provided by Scrapy-Redis so that the Request queue will be in Redis when we run the crawler.

2.2 Modifying Redis Connection Information

We also need to modify the Redis connection information so that Scrapy can successfully connect to the Redis database, in the following format:

REDIS_URL = 'redis://[user:pass]@hostname:9001'

Here we need to change the format according to the above. Since my Redis is running locally, I don't need to fill in the username and password here, just set it as below:

REDIS_URL = 'redis://localhost:6379'

2.3 Modifying the de-duplication class

Since the Request queue migrated to Redis, then the corresponding de-duplication operation we also need to migrate to Redis inside, the previous session we explained the principle of Dupefilter, here we modify the de-duplication class to achieve Redis-based de-duplication:

DUPEFILTER_CLASS = "scrapy_redis."

2.4 Configuring Persistence

Generally speaking after opening the Redis distributed queue, we do not want the crawler will be closed when the entire queue and de-emphasize all the information deleted, because it is very likely that in some cases we will manually shut down the crawler or the crawler suffered an unexpected termination, in order to solve this problem, we can configure the Redis queue persistence, modify the following:

SCHEDULER_PERSIST = True

Well, at this point we are done with the configuration of the distributed crawler.

3. Running

Above what we have accomplished is not really a distributed crawler, because the Redis queue we use is the local Redis, so more than one crawler needs to run locally to be able to, if you want to realize the true meaning of distributed crawler, you can use remote Redis, so we can run crawlers in multiple hosts to connect to this Redis to achieve the true meaning of distributed crawler! Crawler.