summary: This article records in detail the process of using PyTorch to build an image classification model from scratch, covering convolutional neural networks (CNNs), data preprocessing, model design, training and debugging and optimization. Through the processing practice of the CIFAR-10 dataset, combining classical literature and latest research trends in 2025, technical details are deeply explored, supplemented by the process and conclusions of the complete practical source code. I chose to build an image classification model with PyTorch, which is not only due to my interest in deep learning, but also because it remains popular in the technology community in 2025. Through this practice, I hope to master the core principles of CNN, and record the process at the same time, providing reference for other beginners.
Keywords: PyTorch, Image Classification, CNN, Deep Learning, Model Optimization, CIFAR-10, Debugging Experience, Frontier Trends
introduction
Background introduction
Image classification is the core task of computer vision and is widely used in fields such as autonomous driving, medical image analysis and face recognition.
Deep learning, especially convolutional neural networks (CNN), has greatly promoted the development of this technology. In 2012, AlexNet significantly reduced the classification error rate in the ImageNet challenge, marking the beginning of the era of deep learning.
Since then, the model architecture has evolved from ResNet to VisionTransformer (ViT), and its performance has been continuously improved. By 2025, hardware computing power enhancement and optimization of frameworks such as PyTorch (such as()
) makes image classification tasks more efficient. In practice, taking datasets such as CIFAR-10 as an example, a simple CNN can be trained on a modern GPU in a few minutes, showing the technological progress.
Objectives and structure of this article
- Objective: Build and optimize a CNN to complete the CIFAR-10 classification task.
- Structure: Theoretical explanation → Source code implementation → Debugging and analysis → Optimization and prospect.
Convolutional Neural Network (CNN)
Why do you need CNN?
Traditional artificial neural networks (ANNs) have huge parameters in image processing and high computational cost. CNN byLocal receptive field、Weight sharingandHierarchical feature extraction(This article is only available, and none of these three are involved.), reduce the amount of calculation and improve the model capability.
CNN development beganLeNet-5(1998), there will beAlexNet(2012)、VGG(2014)、ResNet(2015), greatly promotes the progress of computer vision.
Convolutional Neural Network (CNN) is the cornerstone of processing image tasks in deep learning, and its core lies inThe spatial features are extracted, parameters are optimized through convolution operations, and the advanced semantic representation is gradually constructed.
The core composition of CNN
1. Convolutional Layer: "Smart Filter" for Feature Extraction
The core of CNN isConvolution operation, scan the input image through a small sliding window (convolution kernel, kernel), calculate the weighted sum of local areas, and extract low-level features such as edges and textures.
Convolution includes parameters such as fill and stride, and also includes bias parameters like the fully connected layer. Convolution operation is equivalent to the "filter operation" in image processing.
The convolution operation will first prepare the input data for initialization.In addition to the high and long directions, the data also needs to process the channel direction to form three-dimensional data.(Channel, Height, Width)
. Combined with batch parameters, according to(batch_num, channel, height, width)
Save the data in sequence to form four-dimensional data.
Convolutional operations reprocess the data through filling and stride.Filling means filling in 0 around the input data, mainly to adjust the size of the output. Avoid the situation where the data space reduced by each convolution operation ultimately leads to the unavailability of the convolution operation. The stride is the data position interval of each operation of the convolution operation, and the output data matrix size is calculated by the stride.
Convolution operation completes data operations through convolution kernel.A convolution kernel, a filter, is a "filter" that uses it to complete the specified operation target (identification, focus, etc.).
The convolution operation completes the data output through the bias function.Bias is a unified processing for data before data output to adapt to specific scenarios (all staff plus one class).
Simply put, convolution is like sliding a "magnifying glass" over an image, focusing a small piece at a time to find the key clues.
Convolution operationSimilar to "Blind man touches elephant"The working process, for example: Suppose you look at a photo of a cat, the first convolution may find the edge of the ear, the second time it is found the texture of the hair, and finally piece together the complete outline of the "cat". This progressive process is the charm of convolution.
①Input data and convolution kernel
Suppose the input image is a 4x4 grayscale matrix (single channel), the convolution kernel is 3x3, the stride is 1, and there is no padding.
- Input matrix
- Convolution kernel (filter)
② Calculation process (negligible)
The convolution operation calculates the dot product block by block on the input matrix through the sliding convolution kernel.
The output feature map size is 2x2.I[height]-K[height]+stretch * I[width]-K[width]+stretch
:
-
formula:For input
I
and convolution kernelK
, outputO[i,j]
for:
③Manual calculation
- Top left corner 2+0+3+0+1+4+3+0+2=15
- Top right corner 4+0+0+0+2+6+0+0+4=16
- Lower left corner 0+0+2+0+0+2+2+0+0=6
- Lower right corner 2+0+3+0+1+4+3+0+2=15
The output feature map is:
2. Pooling Layer: dimensionality reduction and focus on key information
Pooling is an operation to reduce the space in the height and long directions. Used to compress feature maps, reduce the amount of calculations, while retaining important information.
LeCun's LeNet-5 introduces Average Pooling: calculates the average value of the target area.
AlexNet popularizes Max Pooling: calculates the maximum value of the target area.
Imagine you are looking at a painting, and pooling is like squinting your eyes, focusing only on the brightest highlights (maximum values), and ignoring the details. This not only reduces the resolution, but also makes the network focus more on distinctive features, such as the beard on the cat's face rather than background noise.
Modern applications of pooling are more flexible and reflect the evolution of CNNs. For example, global Average Pooling is often used to replace the fully connected layer and improve the generalization capabilities of the model.
Example of maximum pooling (2×2 pooling, stride 2):
Calculate the input data of 4x4 according to step 2, and the output is 2x2, which is divided into 4 areas:
- Top left: [1,2,5,6], maximum value 6
- Top right: [2,4,8,7], maximum value 8
- Lower left: [9,10,13,15], maximum value of 15
- Lower right: [12,11,14,16], maximum value 16
As shown in the figure below:
The above is Max pooling. If it is Average pooling, it can be changed to the average value. I won't go into details.
3. Activation Function: Grant nonlinear expression power
Convolution and pooling features need to introduce nonlinearity through activation function, otherwise the network can only learn linear transformation.
ReLUIt is the most commonly used nonlinear transformation for CNN:
It is computationally efficient and avoids gradient vanishing problems.
ReLU functions like a "switch": negative values are turned off and positive values are retained, allowing the network to learn more complex patterns, such as distinguishing the different outlines of cats and dogs.
There are also variants such as Leaky ReLU and Swish in modern times, but ReLU is still the starting point for understanding CNN nonlinearity. For example, ReLU is like when filtering clues, only retaining "useful evidence" and discarding irrelevant information.
4. Fully Connected Layer (FC): Feature integration and classification
After convolution and pooling, the CNN flattens the features and inputs them into a fully connected layer for classification or regression tasks.
The full connection layer is like the "decision center" of the brain, combining all the clues to give the final judgment: Is this picture a "cat" or a "dog"?
However, modern trends gradually reduce fully connected layer dependencies, such as ResNet (2015) simplifies output with global pooling to mitigate the risk of overfitting.
Application of CNN in Computer Vision
- Image classification(AlexNet、ResNet)
- Target detection(YOLO、Faster R-CNN)
- Semantic segmentation(U-Net、DeepLab)
- Medical imaging analysis(CT diagnosis)
Practical environment and preparation
Tools and dependencies
- Python 3.10、PyTorch 2.1、torchvision 0.16。
- Hardware: NVIDIA RTX 3060 GPU, 16GB RAM.
Anaconda environment configuration
1. Install Anaconda
Install Anaconda using Homebrew:
brew install --cask anaconda
2. Initialize Conda
Add conda to PATH:
echo 'export PATH="/opt/homebrew/anaconda3/bin:$PATH"' >> ~/.zshrc
Initialize conda to use in zsh:
conda init zsh
Reload shell configuration:
source ~/.zshrc
3. Create and configure project environment
Create a new conda environment:
conda create -n simpletorch python=3.9 -y
Activate the environment:
conda activate simpletorch
Installation project dependencies:
pip install -r
4. Verify the environment
Check Python version and installed packages:
python --version
pip list | grep -E "torch|numpy|matplotlib"
5. Commonly used commands
- Activate the environment:
conda activate simpletorch
- Exit environment:
conda deactivate
- View all environments:
conda env list
- Delete the environment:
conda env remove -n simpletorch
6. Environmental information
- Environment name: simpletorch
- Python version: 3.9.21
- Main package version:
- PyTorch 2.6.0
- torchvision 0.21.0
- NumPy 2.0.2
- Matplotlib 3.9.4
Dataset selection and loading
Overview of CIFAR-10 Datasets
- Dataset size: 50,000 images
- Image size: 3x32x32 (RGB image)
- Category: 10 (airplanes, cars, birds, cats, deer, dogs, frogs, horses, boats, trucks)
Data file content
The files in the data directory are CIFAR-10 dataset files, which are automatically downloaded through .CIFAR10 when running or show_dataset.py. Specifically, these files come from:
- Official download address:/~kriz/
- When the code executes to this line:
trainset = datasets.CIFAR10(root='./data', train=True, download=True, transform=transform)
-
root='./data'
Specify the data folder downloaded to the current directory -
download=True
It means that if the data does not exist, it will be downloaded automatically. -
The downloaded file will be automatically decompressed
data/cifar-10-batches-py/
In the directory
The downloaded files include:
-
data_batch_1 to data_batch_5: training data
-
: Category information
-
test_batch: test data (although we are not currently using it)
These files are part of the dataset and usually do not require manual management, and PyTorch automatically handles the download and decompression process. If you want to re-download the dataset, you can:
- Delete the data directory
- Rerun the program and it will automatically re-download
Data preprocessing
transform = ([
(),
((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
])
Dataset visualization
Run the following command to view the dataset samples and distribution:
python show_dataset.py
This will generate two visual files:
-
cifar10_samples.png
: Dataset sample picture -
class_distribution.png
: Category distribution
Source code analysis
import torch
from torchvision import datasets, transforms
import as plt
def show_dataset():
# Data preprocessing
transform = ([
(),
((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
])
# Load the training set
print("Loading CIFAR-10 dataset...")
trainset = datasets.CIFAR10(root='./data', train=True, download=True, transform=transform)
# Define categories
classes = ('plane', 'car', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck')
# Display dataset information
print(f"\nDataset size: {len(trainset)} images")
print(f"Image size: {trainset[0][0].shape}")
print(f"Number of classes: {len(classes)}")
# Show sample pictures
(figsize=(15, 5))
for i in range(5):
img, label = trainset[i]
img = img / 2 + 0.5
img = ()
(1, 5, i + 1)
((1, 2, 0))
(f'Class: {classes[label]}')
('off')
plt.tight_layout()
('cifar10_samples.png')
# Show the number of samples for each category
class_counts = (10)
for _, label in trainset:
class_counts[label] += 1
(figsize=(10, 5))
(classes, class_counts)
('Number of samples per class')
(rotation=45)
plt.tight_layout()
('class_distribution.png')
print("\nVisualization files saved:")
print("- cifar10_samples.png")
print("- class_distribution.png")
if __name__ == '__main__':
show_dataset()
Running results
Loading CIFAR-10 dataset...
Dataset size: 50000 images
Image size: ([3, 32, 32])
Number of classes: 10
Visualization files saved:
- cifar10_samples.png
- class_distribution.png
Model design and source code analysis
Network structure design
-
Structure: 2-layer convolution + 2-layer pooling + 1-layer full connection.
-
Source code:
import torch import as nn class SimpleNet(): def __init__(self): super(SimpleNet, self).__init__() # The first convolution layer # Input: 3 channels (RGB image) -> Output: 16 feature maps # For example: A 32x32 color picture (3x32x32) -> 16 32x32 feature pictures self.conv1 = nn.Conv2d(3, 16, 3, padding=1) # 448 Parameters # The second convolution layer # Input: 16 feature maps -> Output: 32 feature maps # For example: 16 16x16 feature maps -> 32 16x16 feature maps self.conv2 = nn.Conv2d(16, 32, 3, padding=1) # 4,640 Parameters # Full connection layer # Input: 32 * 8 * 8 = 2048 features -> Output: 10 categories # For example: 32 8x8 feature maps flattened -> Probability of 10 numbers (0-9) self.fc1 = (32 * 8 * 8, 10) # 20,490 Parameters def forward(self, x): # Enter the shape of x: [batch_size, 3, 32, 32] # For example: [32, 3, 32, 32] represents 32 32x32 RGB pictures # First Convolutional Layer + ReLU Activation # Output shape: [batch_size, 16, 32, 32] # For example: [32, 16, 32, 32] represents 32 pictures, each with 16 32x32 feature maps x = (self.conv1(x)) # Maximum pooling layer, reduce the feature map size by half # Output shape: [batch_size, 16, 16, 16] # For example: [32, 16, 16, 16] represents 32 pictures, each with 16 16x16 feature maps x = torch.max_pool2d(x, 2) # The second convolution layer + ReLU activation # Output shape: [batch_size, 32, 16, 16] # For example: [32, 32, 16, 16] represents 32 pictures, each with 32 16x16 feature maps x = (self.conv2(x)) # Maximum pooling layer, reduce the feature map size by half again # Output shape: [batch_size, 32, 8, 8] # For example: [32, 32, 8, 8] represents 32 pictures, each with 32 8x8 feature maps x = torch.max_pool2d(x, 2) # Flatten the feature map into a one-dimensional vector # Output shape: [batch_size, 32 * 8 * 8] # For example: [32, 2048] represents 32 pictures, and the features of each picture are flattened into a 2048-dimensional vector x = ((0), -1) # Full connection layer to get the final category prediction # Output shape: [batch_size, 10] # For example: [32, 10] represents 32 pictures, each picture corresponds to the prediction probability of 10 categories x = self.fc1(x) Return x
Parameter and design analysis
-
Overall parameters:This is a "light feature extraction, reclassification" model with a total parameter volume of 25,578
- Full connection layer parameters account for the majority (about 80%)
- Convolutional layer parameters are relatively few (about 20%)
- The total parameters are moderate, suitable for introductory learning
-
Convolutional kernel: 3x3 convolution kernel
- High calculation efficiency (9 parameters)
- Moderate receptive field (can capture local features)
- It is a standard choice in CNN
-
Padding fill design:padding=1, input 32x32, output 32x32
- Keep the feature map size unchanged
- Avoid loss of edge information
- Convenient for network design
3x3 convolution kernel + padding=1) is a classic configuration in CNN, which not only ensures feature extraction effect, but also maintains computing efficiency.
The three most important points for overall design analysis are as follows:
- Use 3x3 convolution kernel with padding=1 to efficiently extract local features while maintaining the feature map size.
- Through two-layer convolution (3→16→32 channels) and two pooling (32x32→16x16→8x8), a gradual extraction from basic features to complex features is achieved.
- Finally, it is directly classified using a fully connected layer (2048→10), with a total parameter amount of 25,578, and a simple structure but effective.
Training and debugging
Training Process
-
Source code:
# Train an epoch print("Training for 1 epoch...") () # Set to training mode (enable dropout and other training specific layers) # traversal the data loader for i, (inputs, labels) in enumerate(trainloader): # Move data to the specified device (GPU/CPU) inputs, labels = (device), (device) # Forward communication optimizer.zero_grad() # Clear gradient outputs = model(inputs) # Model forward propagation loss = criteria(outputs, labels) # Calculate the loss () # Backpropagation, calculate gradient () # Update model parameters # Print loss every 100 batches if (i + 1) % 100 == 0: print(f'Batch [{i + 1}], Loss: {():.4f}') print("Training finished!")
Debugging experience
- Data first: Check data loading and preprocessing to ensure that the shape, range and distribution of the input data are in line with expectations.
- Monitoring is king: Pay close attention to the trend of loss value changes and diagnose training status by regularly printing loss, learning rate and gradient information.
- Adjust the qualitative parameters: Reasonably adjust hyperparameters, such as batch_size, learning rate, etc. according to the training effect, to avoid overfitting or underfitting.
Test execution and result analysis
Test execution
python
Results and Analysis
Using device: cpu
Loading CIFAR-10 dataset...
Training for 1 epoch...
Batch [100], Loss: 2.2419
Batch [200], Loss: 2.2013
Batch [300], Loss: 1.8651
Batch [400], Loss: 1.9359
Batch [500], Loss: 1.9718
Batch [600], Loss: 1.9448
Batch [700], Loss: 1.7974
Batch [800], Loss: 1.6378
Batch [900], Loss: 1.5137
Batch [1000], Loss: 1.5045
Batch [1100], Loss: 1.8072
Batch [1200], Loss: 1.9754
Batch [1300], Loss: 1.8177
Batch [1400], Loss: 1.7377
Batch [1500], Loss: 1.8140
Training finished!
- Device usage: The program runs on the CPU
- Dataset: CIFAR-10 dataset was successfully loaded
- Training process:
- Completed 1 epoch training
- Print once every 100 batches loss
- The Loss value gradually drops from the initial 2.24 to about 1.81, indicating that the model is learning
- The entire training process was completed smoothly
Loss is an indicator that measures the gap between the model's prediction results and the real value, just like the test score - for example, the probability that the model predicts a picture is a cat is 60%, but it is actually a cat (100%). This 40% gap is reflected in the loss value. The smaller the loss, the more accurate the model's prediction.
Optimization and cutting-edge exploration
Architectural Advantages
-
Lightweight design: the total parameter volume is about 25K, suitable for rapid deployment and iteration
-
Clear structure: Use classic CNN+ pooling layer combination for easy understanding and optimization
-
Modular implementation: the code is organized reasonably and easy to expand
The gap with cutting-edge trends
-
Missing attention mechanism (such as Transformer structure)
-
No residual connections are used (ResNet feature)
-
Lack of regularization strategies (such as Dropout)
Optimized direction
-
Adding BatchNorm to improve training stability
-
Introduce modern activation functions (such as GELU, Swish)
-
Implement learning rate scheduling strategy
Summary and reflection
Harvest and deficiencies
- Master the basic processes of CNN and PyTorch.
- Disadvantages: The model depth is limited and the accuracy rate needs to be improved.
Next plan
- Add validation set evaluation and model saving mechanism
- Implement visualization of the training process
From toys to tools, from black boxes to transparent
Project source code
References
-
LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278-2324. DOI:10.1109/5.726791
-
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems, 25, 1097-1105. DOI:10.1145/3065386
-
Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. arXiv:1409.1556
For more articles, please go toA blog park with thousands of people