Preface
Many years ago when I was still in college, I wrote a similar article, but at that time I collected beautiful wallpapers from a certain game’s official website.
Recently, the WeChat public account always recommends various wallpapers to me. There are many good-looking ones, but saving them one by one is too troublesome, so I simply wrote a crawler to automatically download them.
The function points of this crawler
Let’s briefly list the functional points involved in this project, but not every one of them will be written in this article, mainly the crawler part.
If any students are interested in other functions, I will share them later.
- Get all articles of the specified official account
- Download wallpapers that comply with the rules in the article
- Filter irrelevant pictures, such as guiding attention to small icons
- Data persistence (try asynchronous ORM and lightweight NoSQL)
- Image analysis (size information, perceptual hashing, file MD5)
- All running processes are displayed with a progress bar, which is very friendly
Reptile related articles
I have written many articles related to crawlers in the past few years.
- Write a crawler to automatically download beautiful wallpapers from a certain game official website
- Selenium crawler practice: intercept pictures on web pages
- Selenium crawler practice (trampling on pit records) ajax request packet capture, browser exit
- Reptile Notes: Improve data collection efficiency! Use of proxy pool and thread pool
- C# crawler development summary
- Put the crawler on your phone and run! A preliminary study on the Flutter crawler framework~
Project structure
Still use the pdm tool for dependency management.
The dependencies used in this project include these
dependencies = [
"requests>=2.32.3",
"bs4>=0.0.2",
"loguru>=0.7.3",
"tqdm>=4.67.1",
"tinydb>=4.8.2",
"pony>=0.7.19",
"tortoise-orm[aiosqlite]>=0.23.0",
"orjson>=3.10.14",
"aerich[toml]>=0.8.1",
"pillow>=11.1.0",
"imagehash>=4.3.1",
]
There is also a dev dependency for observing the database (tried lightweight NoSQL, no visual method)
[dependency-groups]
dev = [
"jupyterlab>=4.3.4",
]
Data persistence
Every time I do this kind of project, I try different data persistence solutions.
For relational databases, I used the ORM peewee last time.
Later I found that the main problem is that automatic migration is not supported (it may be supported now, but it was a few years ago when I used it)
Everything else is okay, just fine.
This time I didn't do persistence at the beginning, but several shutdowns caused the progress to be lost. It was really troublesome to write a bunch of rules to match.
Everything was refactored directly later.
I have tried tinydb (single file document NoSQL), pony (relational ORM), tortoise-orm
I finally chose tortoise-orm because the syntax is very similar to Django ORM and I didn’t want to step out of my comfort zone.
Model definition
from import Model
from tortoise import fields
class Article(Model):
id = (primary_key=True)
raw_id = ()
title = ()
url = ()
created_at = ()
updated_at = ()
html = ()
raw_json = ()
def __str__(self):
return
class Image(Model):
id = (primary_key=True)
article = ('', related_name='images')
url = ()
is_downloaded = (default=False)
downloaded_at = (null=True)
local_file = (null=True)
size = (null=True, description='unit: bytes')
width = (null=True)
height = (null=True)
image_hash = (null=True)
md5_hash = (null=True)
def __str__(self):
return
These two models can meet all the needs of this project, and can even further implement subsequent functions, such as similar image recognition, image classification, etc.
Get all articles of the specified official account
This method requires a public account.
Obtain the article list through the function of adding "hyperlink" in the official account.
See reference materials for specific operations.
Preparation
Here are just a few key points. After entering the hyperlink menu, press F12 to capture the packet.
Mainly look at/cgi-bin/appmsg
This interface needs to be extracted
- Cookie
- token
- fakeid - public account ID, base64 encoding
The first two are different every time you log in. You can consider using selenium with a local proxy to capture packets and update automatically. For details, please refer to the article I wrote before:Selenium crawler practice (trampling on pit records) ajax request packet capture, browser exit
Code implementation
I encapsulate the operation as class
class ArticleCrawler:
def __init__(self):
= "Interface address, based on the packet capture address"
= ""
= {
"Cookie": ,
"User-Agent": "Fill in the appropriate UA",
}
self.payload_data = {} # Based on the data obtained from the actual packet capture
= ()
()
def fetch_html(self, url):
"""Get article HTML"""
try:
response = (url, timeout=10)
response.raise_for_status()
return
except Exception as e:
(f"Failed to fetch HTML for {url}: {e}")
return None
@property
def total_count(self):
"""Get the total number of articles"""
content_json = (, params=self.payload_data).json()
try:
count = int(content_json["app_msg_cnt"])
return count
except Exception as e:
(e)
(f'response json: {content_json}')
return None
async def crawl_list(self, count, per_page=5):
"""Get the article list and store it in the database"""
(f'Getting article list, total count: {count}')
created_articles = []
page = int((count / per_page))
for i in tqdm(range(page), ncols=100, desc="Get article list"):
payload = self.payload_data.copy()
payload["begin"] = str(i * per_page)
resp_json = (, params=payload).json()
articles = resp_json["app_msg_list"]
# Deposit
for item in articles:
# Check if it already exists to avoid repeated insertion
if await (raw_id=item['aid']).exists():
continue
created_item = await (
raw_id=item['aid'],
title=item['title'],
url=item['link'],
created_at=(item["create_time"]),
updated_at=(item["update_time"]),
html='',
raw_json=item,
)
created_articles.append(created_item)
((3, 6))
(f'created articles: {len(created_articles)}')
async def crawl_all_list(self):
return self.crawl_list(self.total_count)
async def crawl_articles(self, fake=False):
# Here, filter out wallpaper articles based on the actual situation
qs = (
(title__icontains='Wallpaper')
.filter(Q(html='') | Q(html__isnull=True))
)
count = await ()
(f'Number of eligible articles without HTML: {count}')
if fake: return
with tqdm(
total=count,
ncols=100,
desc="⬇ Downloading articles",
# Optional colors [hex (#00ff00), BLACK, RED, GREEN, YELLOW, BLUE, MAGENTA, CYAN, WHITE]
colour='green',
unit="page",
bar_format='{l_bar}{bar} | {n_fmt}/{total_fmt} pages [{rate_fmt}]',
) as pbar:
async for article in qs:
article: Article
= self.fetch_html()
await()
(1)
((2, 5))
What does this code do?
It should be said that this class has what functions.
- Get the total number of articles of the specified official account
- Loop through pages to obtain articles from public accounts, including article titles, addresses, and content.
- Save article to database
Code analysis
The key iscrawl_list
method
In fact, the code is relatively rough, there is no error handling, and the database is accessed in each loop, so the performance is definitely not good.
The correct approach is to first read the article IDs already in the database, and then not query the database every time it loops.
But it is a simple crawler and has not been optimized.
Then use it each time((3, 6))
Pause for a random amount of time.
progress bar
The tqdm library is used here to implement the progress bar (the python ecosystem seems to have a simpler progress bar library, which I have used before, but most of them are based on tqdm encapsulation)
bar_format
Parameter usage: Use bar_format to customize the format of the progress bar, which can display the number of processed files, total number of files, processing speed, etc.
- {l_bar} is the left part of the progress bar, containing the description and percentage.
- {bar} is the actual progress bar.
- {n_fmt}/{total_fmt} displays the current progress and total.
- {rate_fmt} displays the processing rate.
Parse web pages
Previously, we just downloaded the HTML of the article, and we also had to extract the image address from the web page.
At this time, you need to write a parsing method
def parse_html(html: str) -> list:
soup = BeautifulSoup(html, '')
img_elements = ('-img')
images = []
for img_element in img_elements:
img_url = img_element['data-src']
(img_url)
return images
Simply use css selector to extract images
Extract pictures
Remember that the model has an Image, right?
Haven't used it so far.
In this section, we will extract and store it in the database
async def extract_images_from_articles():
#Write a query based on the actual situation
qs = (
(title__icontains='Wallpaper')
.exclude(Q(html='') | Q(html__isnull=True))
)
article_count = await ()
with tqdm(
total=article_count,
ncols=100,
desc="⬇ extract images from articles",
colour='green',
unit="article",
bar_format='{l_bar}{bar} | {n_fmt}/{total_fmt} articles [{rate_fmt}]',
) as pbar:
async for article in qs:
article: Article
images = parse_html()
for img_url in images:
if await (url=img_url).exists():
continue
await (
article=article,
url=img_url,
)
(1)
(f'article count: {article_count}, image count: {await ().count()}')
This method first reads the article in the database, then extracts the pictures from the HTML of the article, and finally stores all the pictures in the database.
The code here also has the problem of repeatedly querying the database in a loop, but I am too lazy to optimize it...
Download pictures
Similarly, I wrote ImageCrawler class
class ImageCrawler:
def __init__(self):
= ()
(headers)
self.images_dir = ('output', 'images')
(self.images_dir, exist_ok=True)
def download_image(self, url):
img_path = (self.images_dir, f'{()}.{extract_image_format_re(url)}')
img_fullpath = ((), img_path)
try:
response = (url)
with open(img_fullpath, 'wb') as f:
()
return img_path
except Exception as e:
(e)
return None
This code is much simpler, just download the image.
I used a timestamp for the file name of the image.
But actually collecting the pictures is not that simple.
Next, write a method to download pictures
async def download_images():
images = await (is_downloaded=False)
if not images:
(f'no images to download')
return
c = ImageCrawler()
with tqdm(
total=len(images),
ncols=100,
desc="⬇ download images",
colour='green',
unit="image",
bar_format='{l_bar}{bar} | {n_fmt}/{total_fmt} images [{rate_fmt}]',
) as pbar:
for image in images:
image: Image
img_path = c.download_image()
if not img_path:
continue
image.is_downloaded = True
image.local_file = img_path
await ()
(1)
((1, 3))
Filter the undownloaded pictures, update the database after downloading, and save the download path of the pictures.
Run the program
Finally, the various parts of the program need to be strung together like candied haws.
This time, asynchronous is used, so everything will be a little different.
async def main():
await init()
await extract_images_from_articles()
await download_images()
Finally called at the program entry
if __name__ == '__main__':
run_async(main())
The run_async method is provided by tortoise-orm, which can wait for the asynchronous method to complete and recycle the database connection.
Development records
I exported the git submission records and simply organized them to form this development record form.
Date & Time | Message |
---|---|
2025-01-18 19:02:21 | 🍹image_crawler minor modifications |
2025-01-18 18:09:11 | 🍹Updated cookies; the crawl_articles method adds a fake function; after the crawl_list method is completed, it will display how many articles have been updated |
2025-01-12 15:48:15 | 🥤hash_size has been changed to 32, but I feel that the speed has not changed much. |
2025-01-12 15:13:06 | 🍟Added multiple hash algorithm support |
2025-01-12 15:00:43 | 🍕The image analysis script is completed, and now the image information is completely filled in |
2025-01-11 23:41:14 | 🌭Fixed a bug, you can download it all the time tonight |
2025-01-11 23:36:46 | 🍕The logic of downloading images has been completed (not tested); the pillow and imagehash libraries have been added, and the image recognition function will be done later. Download it first. |
2025-01-11 23:25:26 | 🥓Preliminary reconstruction of the image crawler, extracting the image link from the article html; I want to use aerich for migration, but it has not been completed yet |
2025-01-11 22:27:04 | 🍔Another function completed: collecting the HTML of articles and storing them in the database |
2025-01-11 21:19:19 | 🥪Successfully transformed article_crawler to use tortoise-orm |
How to export such records?
Use git command to export commit records
git log --pretty=format:"- %s (%ad)" --date=iso
The list format of markdown is used here
After generation, you can adjust it into a table according to your needs.
summary
There is nothing much to say about crawlers. This kind of simplicity is easy to learn. I am not boasting. I can write it in any language. After all, crawlers are also an entry-level content in many programming courses. It is not difficult. What is interesting is that every time I write a crawler, it is paired with something new. Try something, or use different technology stacks or even devices to try crawlers (just like I did when I put the crawler on a mobile phone). Maybe in the future, I can put the crawler to run on a microcontroller? (It seems not feasible. The memory and storage space are too small. The Raspberry Pi is fine, but it is a small server.)
PS: I should try writing a crawler in Rust
After channels, I continued to write asynchronous Python code. Asynchrony is indeed good. Unfortunately, Python’s asynchronous development was relatively late and the ecosystem is not yet complete.
I really hope that Django ORM will support asynchronous support soon, so that I can happily use it with channels...