Location>code7788 >text

VII. Scrapy Framework - Case 1

Popularity:523 ℃/2024-09-21 12:22:03

1. Crawling the Top Rankings of Bean Ballads

1.1 Building a scrapy project

  1. Installing the Scrapy library

    pip install scrapy
    
  2. Creating a Scrapy Project

    Enter the command window via cmd and execute the command scrapy startproject xxxx (xxxx is the scrapy project name) to create the scrapy project.

    scrapy startproject douban_spider2024
    

  3. Creating a Crawler Project

    Execute scrapy genspider xxx (crawler name) xxx (url) to create a crawler project.

    scrapy genspider douban 
    


1.2 Virtual environment construction

  1. Use Pycharm to open the created douban_spider2024 folder and enter the project.

  2. Building virtual environments (venv)

  3. Install the dependency libraries using the documentation, or you can pip them in one by one yourself.

    • View dependent libraries: pip freeze >

    • Install dependent libraries: pip install -r

1.3 Main program writing

The main program () is used to write code that parses the main content of the page.

  • Get the list of urls through the start_requests function and encapsulate it in a Request (with the download middleware enabled in it).

  • Web page parsing via the parse function.

1.4 Settings

  • Inherited custom class SongItem, imported into the main program for storing crawled fields.

1.5 Settings

Parameters used to control components in the Scrapy framework, such as USER_AGENT, COOKIES, proxies, middleware start/stop, and so on.

  • Modify USER_AGENT to simulate a browser login.

  • Close Obey rules and set True to False.

  • Setting the download delay

  • Open the downloader middleware (downloader_middlewares) to enable intercepting and modifying the Request's request content.

1.6 Settings

  • Cookies settings

    Go to Settings in the program, add a new function to handle cookies, and execute the cookies function to return a dictionary COOKIE_ITEM containing cookies.

    Configure COOKIES_ITEM in the process_request function in the xxDownloaderMiddleware class.

  • scrapy utilizing sock proxy?

1.7 Multi-layer url parsing

  • Use callback function to parse multi-layer url: parse at the end of the parse function to get the new url, and submit a new Request, and pass the item to the callback function parse_detail to parse.

  • Add new item information in.

1.8 Settings

  • By building an Excel storage pipeline for storing the crawled data into excel.