1. Crawling the Top Rankings of Bean Ballads
1.1 Building a scrapy project
-
Installing the Scrapy library
pip install scrapy
-
Creating a Scrapy Project
Enter the command window via cmd and execute the command scrapy startproject xxxx (xxxx is the scrapy project name) to create the scrapy project.
scrapy startproject douban_spider2024
-
Creating a Crawler Project
Execute scrapy genspider xxx (crawler name) xxx (url) to create a crawler project.
scrapy genspider douban
1.2 Virtual environment construction
-
Use Pycharm to open the created douban_spider2024 folder and enter the project.
-
Building virtual environments (venv)
-
Install the dependency libraries using the documentation, or you can pip them in one by one yourself.
-
View dependent libraries: pip freeze >
-
Install dependent libraries: pip install -r
-
1.3 Main program writing
The main program () is used to write code that parses the main content of the page.
-
Get the list of urls through the start_requests function and encapsulate it in a Request (with the download middleware enabled in it).
-
Web page parsing via the parse function.
1.4 Settings
-
Inherited custom class SongItem, imported into the main program for storing crawled fields.
1.5 Settings
Parameters used to control components in the Scrapy framework, such as USER_AGENT, COOKIES, proxies, middleware start/stop, and so on.
-
Modify USER_AGENT to simulate a browser login.
-
Close Obey rules and set True to False.
-
Setting the download delay
-
Open the downloader middleware (downloader_middlewares) to enable intercepting and modifying the Request's request content.
1.6 Settings
-
Cookies settings
Go to Settings in the program, add a new function to handle cookies, and execute the cookies function to return a dictionary COOKIE_ITEM containing cookies.
Configure COOKIES_ITEM in the process_request function in the xxDownloaderMiddleware class.
-
scrapy utilizing sock proxy?
1.7 Multi-layer url parsing
-
Use callback function to parse multi-layer url: parse at the end of the parse function to get the new url, and submit a new Request, and pass the item to the callback function parse_detail to parse.
-
Add new item information in.
1.8 Settings
-
By building an Excel storage pipeline for storing the crawled data into excel.