Location>code7788 >text

VII. Scrapy Framework - Case 1

Popularity：523 ℃/2024-09-21 12:22:03

1. Crawling the Top Rankings of Bean Ballads

1.1 Building a scrapy project

Installing the Scrapy library
```
pip install scrapy
```
Creating a Scrapy Project

Enter the command window via cmd and execute the command scrapy startproject xxxx (xxxx is the scrapy project name) to create the scrapy project.
```
scrapy startproject douban_spider2024
```
Creating a Crawler Project

Execute scrapy genspider xxx (crawler name) xxx (url) to create a crawler project.
```
scrapy genspider douban 
```

1.2 Virtual environment construction

Use Pycharm to open the created douban_spider2024 folder and enter the project.
Building virtual environments (venv)
Install the dependency libraries using the documentation, or you can pip them in one by one yourself.
- View dependent libraries: pip freeze >
- Install dependent libraries: pip install -r

1.3 Main program writing

The main program () is used to write code that parses the main content of the page.

Get the list of urls through the start_requests function and encapsulate it in a Request (with the download middleware enabled in it).
Web page parsing via the parse function.

1.4 Settings

Inherited custom class SongItem, imported into the main program for storing crawled fields.

1.5 Settings

Parameters used to control components in the Scrapy framework, such as USER_AGENT, COOKIES, proxies, middleware start/stop, and so on.

Modify USER_AGENT to simulate a browser login.
Close Obey rules and set True to False.
Setting the download delay
Open the downloader middleware (downloader_middlewares) to enable intercepting and modifying the Request's request content.

1.6 Settings

Cookies settings

Go to Settings in the program, add a new function to handle cookies, and execute the cookies function to return a dictionary COOKIE_ITEM containing cookies.

Configure COOKIES_ITEM in the process_request function in the xxDownloaderMiddleware class.
scrapy utilizing sock proxy?

1.7 Multi-layer url parsing

Use callback function to parse multi-layer url: parse at the end of the parse function to get the new url, and submit a new Request, and pass the item to the callback function parse_detail to parse.
Add new item information in.

1.8 Settings

By building an Excel storage pipeline for storing the crawled data into excel.

<Previous Next>