Location>code7788 >text

Write a crawler to download good-looking wallpapers from public accounts

Popularity:269 ℃/2025-01-20 23:01:09

Preface

Many years ago when I was still in college, I wrote a similar article, but at that time I collected beautiful wallpapers from a certain game’s official website.

Recently, the WeChat public account always recommends various wallpapers to me. There are many good-looking ones, but saving them one by one is too troublesome, so I simply wrote a crawler to automatically download them.

The function points of this crawler

Let’s briefly list the functional points involved in this project, but not every one of them will be written in this article, mainly the crawler part.

If any students are interested in other functions, I will share them later.

  • Get all articles of the specified official account
  • Download wallpapers that comply with the rules in the article
  • Filter irrelevant pictures, such as guiding attention to small icons
  • Data persistence (try asynchronous ORM and lightweight NoSQL)
  • Image analysis (size information, perceptual hashing, file MD5)
  • All running processes are displayed with a progress bar, which is very friendly

Reptile related articles

I have written many articles related to crawlers in the past few years.

  • Write a crawler to automatically download beautiful wallpapers from a certain game official website
  • Selenium crawler practice: intercept pictures on web pages
  • Selenium crawler practice (trampling on pit records) ajax request packet capture, browser exit
  • Reptile Notes: Improve data collection efficiency! Use of proxy pool and thread pool
  • C# crawler development summary
  • Put the crawler on your phone and run! A preliminary study on the Flutter crawler framework~

Project structure

Still use the pdm tool for dependency management.

The dependencies used in this project include these

dependencies = [
    "requests>=2.32.3",
    "bs4>=0.0.2",
    "loguru>=0.7.3",
    "tqdm>=4.67.1",
    "tinydb>=4.8.2",
    "pony>=0.7.19",
    "tortoise-orm[aiosqlite]>=0.23.0",
    "orjson>=3.10.14",
    "aerich[toml]>=0.8.1",
    "pillow>=11.1.0",
    "imagehash>=4.3.1",
]

There is also a dev dependency for observing the database (tried lightweight NoSQL, no visual method)

[dependency-groups]
dev = [
    "jupyterlab>=4.3.4",
]

Data persistence

Every time I do this kind of project, I try different data persistence solutions.

For relational databases, I used the ORM peewee last time.

Later I found that the main problem is that automatic migration is not supported (it may be supported now, but it was a few years ago when I used it)

Everything else is okay, just fine.

This time I didn't do persistence at the beginning, but several shutdowns caused the progress to be lost. It was really troublesome to write a bunch of rules to match.

Everything was refactored directly later.

I have tried tinydb (single file document NoSQL), pony (relational ORM), tortoise-orm

I finally chose tortoise-orm because the syntax is very similar to Django ORM and I didn’t want to step out of my comfort zone.

Model definition

from  import Model
from tortoise import fields


class Article(Model):
    id = (primary_key=True)
    raw_id = ()
    title = ()
    url = ()
    created_at = ()
    updated_at = ()
    html = ()
    raw_json = ()

    def __str__(self):
        return 


class Image(Model):
    id = (primary_key=True)
    article = ('', related_name='images')
    url = ()
    is_downloaded = (default=False)
    downloaded_at = (null=True)
    local_file = (null=True)
    size = (null=True, description='unit: bytes')
    width = (null=True)
    height = (null=True)
    image_hash = (null=True)
    md5_hash = (null=True)

    def __str__(self):
        return 

These two models can meet all the needs of this project, and can even further implement subsequent functions, such as similar image recognition, image classification, etc.

Get all articles of the specified official account

This method requires a public account.

Obtain the article list through the function of adding "hyperlink" in the official account.

See reference materials for specific operations.

Preparation

Here are just a few key points. After entering the hyperlink menu, press F12 to capture the packet.

Mainly look at/cgi-bin/appmsgThis interface needs to be extracted

  • Cookie
  • token
  • fakeid - public account ID, base64 encoding

The first two are different every time you log in. You can consider using selenium with a local proxy to capture packets and update automatically. For details, please refer to the article I wrote before:Selenium crawler practice (trampling on pit records) ajax request packet capture, browser exit

Code implementation

I encapsulate the operation as class

class ArticleCrawler:
     def __init__(self):
          = "Interface address, based on the packet capture address"
          = ""
          = {
             "Cookie": ,
             "User-Agent": "Fill in the appropriate UA",
         }
         self.payload_data = {} # Based on the data obtained from the actual packet capture
          = ()
         ()

     def fetch_html(self, url):
         """Get article HTML"""
         try:
             response = (url, timeout=10)
             response.raise_for_status()
             return
         except Exception as e:
             (f"Failed to fetch HTML for {url}: {e}")
             return None

     @property
     def total_count(self):
         """Get the total number of articles"""
         content_json = (, params=self.payload_data).json()
         try:
             count = int(content_json["app_msg_cnt"])
             return count
         except Exception as e:
             (e)
             (f'response json: {content_json}')

         return None

     async def crawl_list(self, count, per_page=5):
         """Get the article list and store it in the database"""
         (f'Getting article list, total count: {count}')

         created_articles = []

         page = int((count / per_page))
         for i in tqdm(range(page), ncols=100, desc="Get article list"):
             payload = self.payload_data.copy()
             payload["begin"] = str(i * per_page)
             resp_json = (, params=payload).json()
             articles = resp_json["app_msg_list"]

             # Deposit
             for item in articles:
                 # Check if it already exists to avoid repeated insertion
                 if await (raw_id=item['aid']).exists():
                     continue

                 created_item = await (
                     raw_id=item['aid'],
                     title=item['title'],
                     url=item['link'],
                     created_at=(item["create_time"]),
                     updated_at=(item["update_time"]),
                     html='',
                     raw_json=item,
                 )
                 created_articles.append(created_item)

             ((3, 6))

         (f'created articles: {len(created_articles)}')

     async def crawl_all_list(self):
         return self.crawl_list(self.total_count)

     async def crawl_articles(self, fake=False):
         # Here, filter out wallpaper articles based on the actual situation
         qs = (
             (title__icontains='Wallpaper')
             .filter(Q(html='') | Q(html__isnull=True))
         )

         count = await ()

         (f'Number of eligible articles without HTML: {count}')

         if fake: return

         with tqdm(
                 total=count,
                 ncols=100,
                 desc="⬇ Downloading articles",
                 # Optional colors [hex (#00ff00), BLACK, RED, GREEN, YELLOW, BLUE, MAGENTA, CYAN, WHITE]
                 colour='green',
                 unit="page",
                 bar_format='{l_bar}{bar} | {n_fmt}/{total_fmt} pages [{rate_fmt}]',
         ) as pbar:
             async for article in qs:
                 article: Article
                  = self.fetch_html()
                 await()
                 (1)
                 ((2, 5))

What does this code do?

It should be said that this class has what functions.

  • Get the total number of articles of the specified official account
  • Loop through pages to obtain articles from public accounts, including article titles, addresses, and content.
  • Save article to database

Code analysis

The key iscrawl_listmethod

In fact, the code is relatively rough, there is no error handling, and the database is accessed in each loop, so the performance is definitely not good.

The correct approach is to first read the article IDs already in the database, and then not query the database every time it loops.

But it is a simple crawler and has not been optimized.

Then use it each time((3, 6))Pause for a random amount of time.

progress bar

The tqdm library is used here to implement the progress bar (the python ecosystem seems to have a simpler progress bar library, which I have used before, but most of them are based on tqdm encapsulation)

bar_formatParameter usage: Use bar_format to customize the format of the progress bar, which can display the number of processed files, total number of files, processing speed, etc.

  • {l_bar} is the left part of the progress bar, containing the description and percentage.
  • {bar} is the actual progress bar.
  • {n_fmt}/{total_fmt} displays the current progress and total.
  • {rate_fmt} displays the processing rate.

Parse web pages

Previously, we just downloaded the HTML of the article, and we also had to extract the image address from the web page.

At this time, you need to write a parsing method

def parse_html(html: str) -> list:
    soup = BeautifulSoup(html, '')
    img_elements = ('-img')

    images = []

    for img_element in img_elements:
        img_url = img_element['data-src']
        (img_url)

    return images

Simply use css selector to extract images

Extract pictures

Remember that the model has an Image, right?

Haven't used it so far.

In this section, we will extract and store it in the database

async def extract_images_from_articles():
 #Write a query based on the actual situation
     qs = (
         (title__icontains='Wallpaper')
         .exclude(Q(html='') | Q(html__isnull=True))
     )

     article_count = await ()

     with tqdm(
             total=article_count,
             ncols=100,
             desc="⬇ extract images from articles",
             colour='green',
             unit="article",
             bar_format='{l_bar}{bar} | {n_fmt}/{total_fmt} articles [{rate_fmt}]',
     ) as pbar:
         async for article in qs:
             article: Article
             images = parse_html()
             for img_url in images:
                 if await (url=img_url).exists():
                     continue

                 await (
                     article=article,
                     url=img_url,
                 )

             (1)

     (f'article count: {article_count}, image count: {await ().count()}')

This method first reads the article in the database, then extracts the pictures from the HTML of the article, and finally stores all the pictures in the database.

The code here also has the problem of repeatedly querying the database in a loop, but I am too lazy to optimize it...

Download pictures

Similarly, I wrote ImageCrawler class

class ImageCrawler:
    def __init__(self):
         = ()
        (headers)
        self.images_dir = ('output', 'images')
        (self.images_dir, exist_ok=True)

    def download_image(self, url):
        img_path = (self.images_dir, f'{()}.{extract_image_format_re(url)}')
        img_fullpath = ((), img_path)

        try:
            response = (url)
            with open(img_fullpath, 'wb') as f:
                ()

            return img_path
        except Exception as e:
            (e)

        return None

This code is much simpler, just download the image.

I used a timestamp for the file name of the image.

But actually collecting the pictures is not that simple.

Next, write a method to download pictures

async def download_images():
    images = await (is_downloaded=False)

    if not images:
        (f'no images to download')
        return

    c = ImageCrawler()

    with tqdm(
            total=len(images),
            ncols=100,
            desc="⬇ download images",
            colour='green',
            unit="image",
            bar_format='{l_bar}{bar} | {n_fmt}/{total_fmt} images [{rate_fmt}]',
    ) as pbar:
        for image in images:
            image: Image
            img_path = c.download_image()
            if not img_path:
                continue

            image.is_downloaded = True
            image.local_file = img_path
            await ()

            (1)
            ((1, 3))

Filter the undownloaded pictures, update the database after downloading, and save the download path of the pictures.

Run the program

Finally, the various parts of the program need to be strung together like candied haws.

This time, asynchronous is used, so everything will be a little different.

async def main():
    await init()
    await extract_images_from_articles()
    await download_images()

Finally called at the program entry

if __name__ == '__main__':
    run_async(main())

The run_async method is provided by tortoise-orm, which can wait for the asynchronous method to complete and recycle the database connection.

Development records

I exported the git submission records and simply organized them to form this development record form.

Date & Time Message
2025-01-18 19:02:21 🍹image_crawler minor modifications
2025-01-18 18:09:11 🍹Updated cookies; the crawl_articles method adds a fake function; after the crawl_list method is completed, it will display how many articles have been updated
2025-01-12 15:48:15 🥤hash_size has been changed to 32, but I feel that the speed has not changed much.
2025-01-12 15:13:06 🍟Added multiple hash algorithm support
2025-01-12 15:00:43 🍕The image analysis script is completed, and now the image information is completely filled in
2025-01-11 23:41:14 🌭Fixed a bug, you can download it all the time tonight
2025-01-11 23:36:46 🍕The logic of downloading images has been completed (not tested); the pillow and imagehash libraries have been added, and the image recognition function will be done later. Download it first.
2025-01-11 23:25:26 🥓Preliminary reconstruction of the image crawler, extracting the image link from the article html; I want to use aerich for migration, but it has not been completed yet
2025-01-11 22:27:04 🍔Another function completed: collecting the HTML of articles and storing them in the database
2025-01-11 21:19:19 🥪Successfully transformed article_crawler to use tortoise-orm

How to export such records?

Use git command to export commit records

git log --pretty=format:"- %s (%ad)" --date=iso

The list format of markdown is used here

After generation, you can adjust it into a table according to your needs.

summary

There is nothing much to say about crawlers. This kind of simplicity is easy to learn. I am not boasting. I can write it in any language. After all, crawlers are also an entry-level content in many programming courses. It is not difficult. What is interesting is that every time I write a crawler, it is paired with something new. Try something, or use different technology stacks or even devices to try crawlers (just like I did when I put the crawler on a mobile phone). Maybe in the future, I can put the crawler to run on a microcontroller? (It seems not feasible. The memory and storage space are too small. The Raspberry Pi is fine, but it is a small server.)

PS: I should try writing a crawler in Rust

After channels, I continued to write asynchronous Python code. Asynchrony is indeed good. Unfortunately, Python’s asynchronous development was relatively late and the ecosystem is not yet complete.

I really hope that Django ORM will support asynchronous support soon, so that I can happily use it with channels...