Python Web Crawler Practice Case: Crawling Cat's Eye Movie Top100

The following is a hands-on Python web crawler case that demonstrates how to use Python to crawl the Cat's Eye Movie Top100 for information such as movie name, starring actor and release date, and save this information to a TXT file. This case uses therequestslibrary to send HTTP requests using thereThe library performs regular expression matching and includes detailed code explanations to ensure that the code is ready to run.

1. Preparatory work

Before we begin, we need to make sure we have installed therequestsLibrary. We can install it using the following command:

bash copy code

pip install requests

2. Cat's Eye Movie Top100 information is crawled from which pages?

The information for the Cat's Eye Movie Top 100 was obtained from the official Cat's Eye Movie website (e.g./board/4 ) crawled. Specifically, this page shows the Cat's Eye Movie Top100 list, containing detailed information about the movie's ranking, name, starring role, release date, rating, and more.

During the crawling process, the crawler program will simulate the browser behavior to send HTTP requests to the URL of the page and receive the HTML content returned from the server. Then, the program will use regular expressions or parsing libraries (e.g., BeautifulSoup, lxml, etc.) to parse the HTML content and extract the required information (e.g., name of the movie, starring actor, release date, etc.).

As the page structure and anti-crawler mechanism of Cat's Eye Movie may change, the crawler may need to be adjusted and optimized according to the actual situation in the actual application. In addition, the crawling of website data should comply with relevant laws and regulations and the use agreement of the website, and shall not be used for illegal purposes.

It should be noted that since the Cat's Eye Movie Top100 list is dynamically changing, the information crawled may only be a snapshot of a certain moment. If you need to get the latest or real-time list information, the crawler program needs to run regularly and update the data.

3. Code realization

Below is the full code example:

import requests
import re

# Request URL
url = '/board/4'

# Request headers, simulating a browser request
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
}

# Parse page function
def parse_html(html).
    # Match movie information with regular expressions.
    pattern = (r'<p class="name"><a href=". *?" title="(. *?)" data-act="boarditem-click" data-val="{movieId:\\\\d+}">(. *?) </a></p>. *? <p class="star">(. *?) </p>. *? <p class="releasetime">(. *?) </p>. *?) </p>', )
    items = (pattern, html)

    # Convert the matched information into dictionary format
    for item in items.
        yield {
            
            'Starring in': item[2].strip(),
            'Release date': item[3]
        }

# Save data function
def save_data().
    # Open the file for writing
    f = open('maoyan_top100.txt', 'w', encoding='utf-8')

    # Crawl the data on a paginated basis, 10 items per page
    for i in range(10).
        # Construct the paging URL
        page_url = f'/board/4?offset={i*10}'

        # Send an HTTP request for the page content
        response = (page_url, headers=headers)

        # Parse the page content
        for item in parse_html().
            # Write the information to a file
            (str(item) + '\n')

    # Close the file
    ()

# Main function
if __name__ == '__main__'.
    save_data()

4. Code interpretation

Request URL and headers: Defines the URLs and request headers of the Top100 Cat's Eye movies to be crawled, simulating a browser request to avoid being intercepted by anti-crawler mechanisms.
Parsing Page Functions：parse_htmlfunction uses a regular expression to match the movie information on the page, including the name of the movie, its starring actor, and its release date. The regular expression in thedenote by a sign that lets.Matches all characters including newlines.
Save Data Functions：save_dataThe function is responsible for crawling the data in pages and writing the parsed information to a TXT file. By looping 10 times, each time building the paging URL and sending a request, then parsing the page content and writing it to the file.
main function: in__main__Calls in the blocksave_datafunction starts crawling the data.

5. The code contains other functional modules

In the provided code, although the main function is to crawl the information of Cat's Eye Movie Top100, the code structure itself reflects several key functional modules. These modules make the code clearer, easier to maintain and extend. Here are the other functional modules included in the code:

(1) Request sending module:

utilizationfunction sends an HTTP GET request to the specified URL.
pass (a bill or inspection etc)headersparameter sets the request header to simulate browser behavior.

(2) Page parsing module (parse_html function):

Using regular expressions (cap (a poem)) parses the HTML content and extracts the required information.
The regular expression defines the structure of the content to be matched, including the name of the movie, the starring actor, and the release date.
Returns the matched information as a dictionary (via the generator)yield(return one by one to save memory).

(3) Data saving module (save_data function):

Responsible for saving the parsed data to a file.
Paged crawling is implemented, where URLs for different pages are constructed and requests are sent through a loop.
Convert each piece of movie information to a string and write it to a file, with each piece of information taking up one line.

(4) Main program module (if)name== main:: Part):

as the entry point to the program, callingsave_datafunction starts the crawling task.
Ensure that the crawl operation is performed only when the script is run as the main program, and not when it is imported by another script.

(5) Error handling module (implicit):

Although the code does not have an explicittry-exceptblock to handle possible exceptions (e.g., network request failures, parsing errors, etc.), but in practice it is important to add error handling.
Code robustness and user-friendliness can be enhanced by adding exception handling.

(6) Scalability module (implicit):

The code is clearly structured, making it relatively easy to add new features (e.g., crawling for more information, supporting other sites, etc.).
The code can be extended by modifying regular expressions, adding new parsing functions or data saving logic.

It should be noted that although the code contains these modules in its structure, it may need to be further improved in practical applications, such as adding logging, optimizing regular expressions to improve parsing efficiency, handling dynamically loaded content (which may require the use of tools such as Selenium), and so on. In addition, the code may need to be adjusted according to the actual situation due to changes in the website structure and anti-crawler mechanism.

6. Running the code

Save the above code as a Python file (e.g.maoyan_spider.py), and then run the file from the command line:

bash copy code

python maoyan_spider.py

After the run completes, we will find a file in the current directory namedmaoyan_top100.txtThe file contains information about the names, stars and release dates of the movies in the Cat's Eye Top 100.

7. Cautions

Since the structure of the website and the anti-crawler mechanism may change, the code may need to be adjusted accordingly in practical applications.
Crawling website data should comply with relevant laws and regulations and the use agreement of the website, and should not be used for illegal purposes.

Through this case, we can learn the basic steps and methods of how to use Python for web crawling, including sending HTTP requests, parsing page content and saving data. Hope this case is helpful to you!