Scrapy module to get started and practice: PenQiGe novel network crawl

Basic use of the scrapy framework

Create a project (Crawling the Penumbra Fiction Network)

scrapy startproject novels

Creating a spider

cd novels
scrapy genspider  fffffff

Execute the genspider command. The first parameter is the name of the Spider and the second parameter is the domain name of the website. After execution, the content is shown below:

import scrapy


class BqguiCcSpider():
    name = "bqgui_cc"
    allowed_domains = ["fffffffffff"]
    start_urls = ["https://fffffffffff"]

    def parse(self, response):
        pass

There are three attributes here - name, allowed_domains and start_urls - and one method, parse.

name: it is a unique name for each project, used to distinguish between different Spiders.
allowed_domains: it is allowed to crawl the domain name, if the initial or subsequent request link is not under this domain name, the request link will be filtered out. Prevent access to some small ads and so on
start_urls: it contains the list of urls that the Spider crawls at startup, the initial request is defined by it.
parse: This is a method of Spider. By default, the response returned by a request composed of links inside the called start_urls is passed to this function as the only parameter once the download is completed. This method is responsible for parsing the returned response, extracting data or generating further requests to be processed.

Create item

To create an Item, you need to inherit from the class and define a field of type. Looking at the target site, we can get text, author, tags.

Define Item, which will be modified as follows:

import scrapy

class QuoteItem():

    text = ()
    author = ()
    tags = ()

Parsing the Response

    def parse(self, response):
        # print()
        types = ('//div[@class="nav"]/ul/li/a/@href').extract()[1:9]
        for index, type_url in enumerate(types):
            print(type, index)
            for page in range(1, 31):
                # Interface to access novel information
                yield (f"/json?sortid={index}&page={page}",
                                     self.get_info)

Using item

Having defined Item above, we will now use it. Item can be interpreted as a dictionary, but it needs to be instantiated when declaring it. Then you can use the result of parsing to assign each field of Item in turn, and finally return Item.

    def get_info(self, response):
        books = ()
        item = NovelsItem()
        match = (r'sortid=(\d+)', )
        book_type = (1)
        if len(books) > 0:
            for book in books:
                book_author = book['author']
                book_name = book['articlename']
                book_content = book['intro']
                item['book_author'] = book_author
                item['book_name'] = book_name
                item['book_content'] = book_content
                item['book_type'] = book_type
                yield item

Follow-up requests

This is needed to construct the request. Here we pass two parameters - url and callback, which are described below.

url: it is the request link.
callback: it is a callback function. When the request for which the callback function is specified completes, a response is retrieved and the engine passes that response as an argument to the callback function. The callback function parses or generates the next request, as shown in parse() above.

yield (f"/json?sortid={index}&page={page}",
                     self.get_info)

(of a computer) run

Run File

from import execute

# scrapy crawl domain name
execute(['scrapy', 'crawl', 'bqgui_cc'])

command-line operation
```
scrapy crawl crawl name
```

Save to file

After running Scrapy, we only see the output on the console. What if we want to save the results?

You don't really need any additional code to accomplish this task, Scrapy provides Feed Exports that can easily output the results of the crawl. For example, if we want to save the above results as a JSON file, we can execute the following command:

scrapy crawl quotes -o

maybe

scrapy crawl quotes -o

The output format also supports many kinds, such as csv, xml, pickle, marshal, etc. It also supports remote output such as ftp, s3, etc. In addition, you can customize ItemExporter to achieve other output.

For example, the output of the following commands is in csv, xml, pickle, marshal format, and ftp remote output:

scrapy crawl quotes -o 
scrapy crawl quotes -o 
scrapy crawl quotes -o 
scrapy crawl quotes -o 
scrapy crawl quotes -o ftp://user:pass@/path/to/

Among them, the ftp output needs to be correctly configured with username, password, address, and output path, or it will report an error.

With the Feed Exports provided by Scrapy, we can easily export the crawl results to a file. For small projects, this should be sufficient. However, for more complex output, such as to a database, we can use Item Pileline.

Using the Item Pipeline

If you want to perform more complex operations, such as saving the results to a MongoDB database or filtering for certain useful Items, you can define an Item Pipeline to do so.

Item Pipeline is an item pipeline. When an Item is generated, it will be automatically sent to the Item Pipeline for processing. We often use the Item Pipeline to do the following operations.

Cleansing of HTML data;
Validate the crawled data and check the crawled fields;
Check and discard duplicate content;
Store the crawl results in a database.

Implementing the Item Pipeline is as simple as defining a class and implementing the process_item method, which is called automatically when the Item Pipeline is enabled. This method is automatically called by Item Pipeline when Item Pipeline is enabled. process_item method must either return a dictionary or Item object containing data, or throw a DropItem exception.

The process_item method takes two arguments. One argument is item, which is passed as an argument every time a Spider generates an Item. The other parameter is spider, which is an instance of Spider.

Next, we implement an Item Pipeline that sifts out items with text lengths greater than 50 and saves the results to MongoDB.

Modify the files in the project, delete the contents of the files automatically generated by the command line, and add a TextPipeline class with the following contents:

from  import DropItem

class TextPipeline(object):
    def __init__(self):
         = 50
    
    def process_item(self, item, spider):
        if item['text']:
            if len(item['text']) > :
                item['text'] = item['text'][0:].rstrip() + '...'
            return item
        else:
            return DropItem('Missing Text')

This code defines a limit of 50 in the constructor method and implements the process_item method with item and spider arguments, which first determines whether the text attribute of the item exists, and if it doesn't, throws a DropItem exception; if it does, it determines whether the length is greater than 50, and if it is, then truncates the item and splices it with an ellipsis. If it does, then determine whether the length is greater than 50, then truncate and splice the ellipsis, and then return the item.

Next, we store the processed item into MongoDB and define another Pipeline. Again, we implement another class, MongoPipeline, as shown below:

import pymongo

class MongoPipeline(object):
    def __init__(self, mongo_uri, mongo_db):
        self.mongo_uri = mongo_uri
        self.mongo_db = mongo_db

    @classmethod
    def from_crawler(cls, crawler):
        return cls(mongo_uri=('MONGO_URI'),
            mongo_db=('MONGO_DB')
        )

    def open_spider(self, spider):
         = (self.mongo_uri)
         = [self.mongo_db]

    def process_item(self, item, spider):
        name = item.__class__.__name__
        [name].insert(dict(item))
        return item

    def close_spider(self, spider):
        ()

The MongoPipeline class implements several additional methods defined by the API.

from_crawler: this is a class method, identified by @classmethod, a kind of dependency injection, the method's parameter is crawler5, through the crawler parameter we can get the global configuration of each configuration information, in the global configuration we can define the MONGO_URI and MONGO_DB to specify the address and database name needed to connect to MongoDB. In the global configuration, we can define MONGO_URI and MONGO_DB to specify the address and database name needed for MongoDB connection, and return the class object after getting the configuration information. So the definition of this method is mainly used to get the configuration of the class.
open_spider: this method is called when the Spider is opened. Some initialization is done here.
close_spider: This method is called when the Spider is closed, where the database connection is closed.

The primary process_item method performs the data insertion operation.

Once the TextPipeline and MongoPipeline classes have been defined, we need to use them in the MongoDB connection information needs to be defined.

We add the following to the text:

ITEM_PIPELINES = {
   '': 300,
   '': 400,
}
MONGO_URI='localhost'
MONGO_DB='tutorial'

Assigns the ITEM_PIPELINES dictionary, where the key name is the class name of the Pipeline, and the key value is the invocation priority, which is a number; the smaller the number, the more the corresponding Pipeline is invoked first.

Re-execute the crawl again with the command shown below:

scrapy crawl quotes

PenQiGe novel network crawling practical project

The core crawler code is provided here:

import re

import scrapy
from import NovelsItem


class BqguiCcSpider():
    name = "bqgui_cc"
    # allowed_domains = ["/"]
    start_urls = ["/"]

    def parse(self, response):
        # print()
        types = ('//div[@class="nav"]/ul/li/a/@href').extract()[1:9]
        for index, type_url in enumerate(types):
            print(type, index)
            for page in range(1, 31):
                # Interface to access novel information
                yield (f"/json?sortid={index}&page={page}",
                                     self.get_info)

    def get_info(self, response):
        books = ()
        item = NovelsItem()
        match = (r'sortid=(\d+)', )
        book_type = (1)
        if len(books) > 0:
            for book in books:
                book_author = book['author']
                book_name = book['articlename']
                book_content = book['intro']
                item['book_author'] = book_author
                item['book_name'] = book_name
                item['book_content'] = book_content
                item['book_type'] = book_type
                yield item

Scrapy framework speed understanding

Site if there is no anti-climbing mechanism, the speed of the network is not a problem, the fire is on, 200,000 pieces of data an hour, the speed is still enough.

Common sense understanding of crawler internships

Smaller companies typically have 3 or 4 crawler engineers

2 servers

Deploy 10 to 20 crawler projects on a single server

Hundreds of thousands of data for small websites

Millions of data on large sites

Scrapy module to get started and practice: PenQiGe novel network crawl

Create a project (Crawling the Penumbra Fiction Network)

Creating a spider

name: it is a unique name for each project, used to distinguish between different Spiders.

allowed_domains: it is allowed to crawl the domain name, if the initial or subsequent request link is not under this domain name, the request link will be filtered out. Prevent access to some small ads and so on

start_urls: it contains the list of urls that the Spider crawls at startup, the initial request is defined by it.

Create item

Parsing the Response

Using item

Follow-up requests

url: it is the request link.

callback: it is a callback function. When the request for which the callback function is specified completes, a response is retrieved and the engine passes that response as an argument to the callback function. The callback function parses or generates the next request, as shown in parse() above.

(of a computer) run

Save to file

Using the Item Pipeline

open_spider: this method is called when the Spider is opened. Some initialization is done here.

close_spider: This method is called when the Spider is closed, where the database connection is closed.

PenQiGe novel network crawling practical project

Scrapy framework speed understanding

Common sense understanding of crawler internships

More sophisticated content: [CodeRealm]