preamble
What is the ultimate programmer's quest? A user experience that remains silky smooth while system traffic increases? That's right! However, in scenarios where large amounts of files are being transferred and data is being passed around, traditional "data handling" slows down performance. To address this pain point, Linux has introduced thezero-copy This is a technology that allows efficient data transfer without the CPU having to worry about it. Today, I will use the most common language to explain the workings of zero-copy, common implementation and practical applications, to help you thoroughly understand this technology!
1. Traditional copying: the "old days" of data handling
In order to understand zero-copy, let's look at how traditional data transfers work. Imagine we need to read a large file from a hard disk and send it to the network. This sounds simple enough, but in reality, traditional data transfers involve multiple steps and take up a lot of CPU resources.
1.1 A typical file transfer process (without DMA technology):
Suppose we want to read a large file from the hard disk and send it to the network. Here are the detailed steps for the traditional copy method:
-
Read data into kernel buffer: Use
read()
system call, data is read from the hard disk into the kernel buffer. At this point, the CPU needs to coordinate and execute the relevant instructions to complete this step. -
Copy data to user buffer: The data is copied from the kernel buffer to a buffer in user space. This step is performed by the
read()
The call is triggered and the CPU is fully responsible for this data copy. -
Write data to kernel buffer: By
write()
system call, the data is copied from the user buffer back to the kernel buffer again. the CPU steps in again and takes care of the data copying. - Transferring data to the network card: Eventually, the data in the kernel buffer is transferred to the network card and sent to the network. Without DMA technology, the CPU needs to copy the data to the network card.
1.2 Let's take a look at a graph for a more straightforward view:
1.3 "Four copies" of data transmission
In this process, the data undergoes four copies in the system:
- Hard Disk -> Kernel Buffer(CPU involved, responsible for data reading and transmission)
-
Kernel Buffer -> User Buffer(
read()
(Call triggered, CPU responsible for copying) -
User Buffer -> Kernel Buffer(
write()
(Call triggered, CPU responsible for copying) - Kernel Buffers -> Network Cards(Final data sent, CPU involved in transmission)
1.4 Performance bottleneck analysis
The problems with this traditional copying method are obvious:
-
High CPU resource usage: Each time
read()
cap (a poem)write()
Calls require the CPU to make multiple copies of data, which seriously consumes CPU resources and affects the execution of other tasks. - memory footprint: When the amount of data is large, memory usage increases significantly, which may lead to system performance degradation.
-
Context switching overhead: Each time
read()
cap (a poem)write()
The call involves switching between user state and kernel state, which increases the burden on the CPU.
These problems are especially noticeable when dealing with large files or high-frequency transfers, where the CPU is forced to act as a "porter" and performance is severely limited. So, is there a way to minimize the CPU's "carrying" work? This is where DMA (Direct Memory Access) technology comes in.
2. DMA: Prelude to Zero Copies
DMA (Direct Memory Access) DMA is a technique that allows data to be transferred directly between the hard disk and memory without the CPU being involved byte by byte. In short, DMA is a "helper" to the CPU, reducing its workload.
2.1 How does DMA help the CPU?
In traditional data transfer, the CPU needs to physically move the data from the hard disk to the memory and then to the network, which consumes CPU resources. The advent of DMA allows the CPU to do less work:
- Hard disk to kernel bufferThis is accomplished by the DMA, which automatically copies the data to the kernel buffer as soon as the CPU issues an instruction.
- Kernel buffer to NIC: The DMA handles this part as well, sending the data directly to the NIC, with the CPU simply overseeing the overall process.
With DMA, all the CPU needs to do is say, "Hey DMA, move the data from the hard disk to memory!" Then the DMA controller takes over the job and automatically passes the data from the hard disk to the kernel buffer, with the CPU just supervising from the sidelines.
2.2 With DMA , let's look at the process of data transfer:
To better understand the role of DMA in the overall data handling, let's illustrate with a diagram:
clarification:
- DMA Responsible for hard disk to kernel buffer and kernel to NIC transfers.
- CPU still has to handle the transfer of data between the kernel and the user's buffer.
2.3 Which steps still require CPU involvement?
While DMA can take some of the task off the CPU's hands, it does not have full authority over all data copying.The CPU is still responsible for two things:
- Kernel buffer to user buffer: The data needs to be copied by the CPU into user space for use by the program.
- User buffer back to kernel buffer: After the program has processed the data, the CPU has to copy the data back to the kernel in preparation for subsequent transfers.
It's like hiring a helper, but you still have to do some of the detailed work yourself. So, during high concurrency or large file transfers, the CPU will still be stressed by these copying tasks.
2.4 To summarize
To summarize, DMA does relieve the CPU of the burden of data transfers, allowing data to be transferred from the hard disk to the kernel buffer and from the kernel buffer to the NIC with little or no CPU involvement. However, DMA does not completely solve the problem of copying data between kernel and user space; the CPU still needs to perform two data transfers, a limitation that becomes particularly acute in highly concurrent and large file transfer scenarios.
3. Zero-copy: making data "direct"
Therefore, in order to further reduce CPU involvement and improve transfer efficiency, Linux introduced thezero-copy Technology. The core goal of this technology is:Allow data to flow directly in kernel space, avoiding redundant copies in user space, thus minimizing CPU memory copy operations and improving system performance.
Next, let's take a closer look at some of the major zero-copy implementations in Linux.
take note of: The implementation of zero-copy technology in Linux requires hardware support for DMA.
3.1 sendfile: the earliest zero-copy approach
sendfile
is a zero-copy method first introduced in Linux and designed for file transfers.
3.2 Workflow of sendfile
- DMA (Direct Memory Access) loads file data directly into kernel buffers.
- Data goes from the kernel buffer directly into the socket kernel buffer in the network stack.
- The data is processed through the network protocol stack and sent directly to the network via the network card.
pass (a bill or inspection etc)sendfile
The CPU only needs to copy the data once during the whole transfer process, which reduces the CPU usage.
3.3 Simple illustrations:
sendfile
Illustration Description:
- Reading data from a hard disk: File data is read from the hard disk via DMA and loaded directly into the kernel buffer, a process that does not require CPU involvement.
- Copy data to the socket buffer of the network stack: Instead of entering user space, the data goes from the kernel buffer directly to the socket buffer in the network stack, where it undergoes the necessary protocol processing (e.g., TCP/IP encapsulation).
- Data sent via network card: The data is eventually sent directly to the network via the network card.
3.4 sendfile Interface Description
sendfile
The function definitions are as follows:
ssize_t sendfile(int out_fd, int in_fd, off_t *offset, size_t count);
-
out_fd
: Destination file descriptor, typically a socket descriptor, for sending over the network. -
in_fd
: The source file descriptor, usually for files read from the hard disk. -
offset
: An offset pointer that specifies where in the file to start reading from. The offset pointer is used to specify the location of the file from which to start reading if it is set toNULL
, then reads from the current offset position. -
count
: The number of bytes to be transferred.
The return value is the number of bytes actually transferred, and on error returns the-1
and seterrno
to indicate the cause of the error.
3.5 Simple Code Example
#include <sys/>
int main() {
int input_fd = open("", O_RDONLY);
int server_fd = socket(AF_INET, SOCK_STREAM, 0);
struct sockaddr_in address;
address.sin_family = AF_INET;
address.sin_addr.s_addr = INADDR_ANY;
address.sin_port = htons(8080);
bind(server_fd, (struct sockaddr *)&address, sizeof(address));
listen(server_fd, 3);
int client_fd = accept(server_fd, NULL, NULL);
sendfile(client_fd, input_fd, NULL, 1024);
close(input_fd);
close(client_fd);
close(server_fd);
return 0;
}
This example shows how to use thesendfile
Sends a local file to a client connected over the network. Simply call thesendfile
The data can then be transferred from theinput_fd
Transfer directly tooutput_fd
。
3.6 Applicable scenarios
sendfile
It is mainly used to transfer file data directly to the network, and is ideal for situations where large files need to be transferred efficiently, such as file servers, streaming media transfers, and backup systems.
In the traditional method of data transfer, the data has to go through multiple steps:
- First, data is read from the hard disk into kernel space.
- The data is then copied from kernel space to user space.
- Finally, the data is copied from user space back to the kernel to be sent out to the NIC.
To summarize, sendfile makes data transfers more efficient and less CPU-intensive, and is particularly suited to simple large file transfer scenarios. However, for more complex transfers, such as moving data between multiple file descriptors of different types, splice provides a more flexible approach. Let's take a look at how splice accomplishes this.
4. splice: piped zero copy
splice
is another data transfer system call in Linux that implements zero-copy. It is designed to efficiently move data between different types of file descriptors, and is suitable for transferring data directly in the kernel to reduce unnecessary copying.
4.1 Workflow of splice
-
Reading data from a file: Use
splice
The system call reads the data from the input file descriptor (e.g., a hard disk file), and the data goes directly into the kernel buffer via DMA (direct memory access). -
Transfer to network socket: Subsequently.
splice
Continue to transfer data from the kernel buffer directly to the file descriptor of the target network socket.
The whole process is completed in the kernel space, avoiding the round-trip copying of data from the kernel space to the user space, which greatly reduces the involvement of the CPU and improves the system performance.
4.2 Simple illustrations:
Similar to the sendfile illustration, just with a different interface.
splice
Illustration Description:
The data is passed through thesplice
The data is transferred from the file descriptor to the network socket. the data first enters the kernel buffer through DMA and is then transferred directly to the network socket. the whole process avoids the intervention of the user's space, which significantly reduces the CPU copying work.
4.3 splice interface description
splice
The function is defined as follows:
ssize_t splice(int fd_in, loff_t *off_in, int fd_out, loff_t *off_out, size_t len, unsigned int flags);
-
fd_in
: The source file descriptor from which the data is read. -
off_in
: a pointer to the source offset, if forNULL
, then the current offset is used. -
fd_out
: The target file descriptor where the data will be written. -
off_out
: A pointer to the target offset if forNULL
, then the current offset is used. -
len
: The number of bytes to be transferred. -
flags
: Signs that control behavior, e.g.SPLICE_F_MOVE
、SPLICE_F_MORE
etc.
The return value is the number of bytes actually transferred, and on error returns the-1
and seterrno
to indicate the cause of the error.
4.4 Simple Code Example
int main() {
int input_fd = open("", O_RDONLY);
int server_fd = socket(AF_INET, SOCK_STREAM, 0);
struct sockaddr_in address;
address.sin_family = AF_INET;
address.sin_addr.s_addr = INADDR_ANY;
address.sin_port = htons(8080);
bind(server_fd, (struct sockaddr *)&address, sizeof(address));
listen(server_fd, 3);
int client_fd = accept(server_fd, NULL, NULL);
splice(input_fd, NULL, client_fd, NULL, 1024, SPLICE_F_MORE);
close(input_fd);
close(client_fd);
close(server_fd);
return 0;
}
This example shows how to use thesplice
Sends local files directly to a network socket for efficient data transfer.
4.5 Applicable scenarios
splice
Ideal for efficient, direct data transfers between file descriptors, such as from a file to a network socket, or passing data between a file, a pipe, and a socket. In this case, the data transfer is done in kernel space without going into user space, significantly reducing the number of copies and CPU involvement. In addition, splice is particularly well suited for scenarios that require flexible data flow and reduced CPU burden, such as logging and real-time data streaming.
4.6 Difference between sendfile and splice
Although sendfile and splice are both zero-copy technologies provided by Linux for efficiently transferring data in kernel space, there are some significant differences in their application scenarios and functionality:
Data flow modalities:
- sendfile: directly transfer data from kernel buffer to socket buffer, suitable for file-to-network transfer. Suitable for simple and efficient file-to-network transfer scenarios.
- splice: more flexible, can transfer data between any file descriptors, including files, pipes, sockets and so on. Therefore, splice can realize more complex data flow between files, pipes and sockets.
Applicable Scenarios:
- sendfile: mainly used for file transfer to the network, ideal for file servers, streaming media and other scenarios that require efficient file transfer.
- splice: better suited for complex data flow scenarios, such as those that require multi-step transfers between files, pipes and networks or flexible control of data flow.
dexterity:
- sendfile: used to send files directly and efficiently to the network, although the operation is single, but the performance is very efficient.
- splice: can be used in conjunction with pipelines to achieve more complex control of data flow, e.g., data is processed through the pipeline before being sent to the target location.
5. mmap + write:mapped zero copy
In addition to the two ways mentioned above, themmap
+ write
It is also a common way to implement zero-copy. This approach focuses on reducing the number of data copying steps through memory mapping.
5.1 mmap + write workflow
- utilization
mmap
The system call maps the file into the virtual address space of the process so that the data can be shared directly between kernel space and user space without additional copy operations. - utilization
write
The system call writes the mapped memory area directly to the target file descriptor (e.g., a network socket) to complete the data transfer.
This approach reduces data copying and improves efficiency, and is suitable for scenarios where data needs to be flexibly manipulated before being sent. Instead of explicitly copying data from kernel space to user space, data is shared through mapping in this way, thus reducing unnecessary copies.
5.2 Simple illustrations:
mmap + write
Illustration Description:
- utilization
mmap
Maps file data to the virtual address space of the process, avoiding explicit data copying. - pass (a bill or inspection etc)
write
Sends the mapped memory region data directly to the target file descriptor (e.g., a network socket).
5.3 mmap Interface Description
mmap
The function is defined as follows:
void *mmap(void *addr, size_t length, int prot, int flags, int fd, off_t offset);
-
addr
: Specifies the starting address of the mapped memory, usually theNULL
Determined by the system. -
length
: The size of the memory region to be mapped. -
prot
: Protective signs for mapped areas, e.g.PROT_READ
、PROT_WRITE
。 -
flags
: Properties that affect the mapping, such asMAP_SHARED
、MAP_PRIVATE
。 -
fd
: File descriptor pointing to the file to be mapped. -
offset
: An offset in the file indicating where in the file to start mapping from.
The return value is a pointer to the mapped memory area, and on error returnsMAP_FAILED
and seterrno
。
5.4 Simple Code Examples
int main() {
int input_fd = open("", O_RDONLY);
struct stat file_stat;
fstat(input_fd, &file_stat);
char *mapped = mmap(NULL, file_stat.st_size, PROT_READ, MAP_PRIVATE, input_fd, 0);
int server_fd = socket(AF_INET, SOCK_STREAM, 0);
struct sockaddr_in address;
address.sin_family = AF_INET;
address.sin_addr.s_addr = INADDR_ANY;
address.sin_port = htons(8080);
bind(server_fd, (struct sockaddr *)&address, sizeof(address));
listen(server_fd, 3);
int client_fd = accept(server_fd, NULL, NULL);
write(client_fd, mapped, file_stat.st_size);
munmap(mapped, file_stat.st_size);
close(input_fd);
close(client_fd);
close(server_fd);
return 0;
}
This example shows how to use themmap
Maps the file to memory and then passes thewrite
Sends data to the client of the network connection.
5.5 Applicable scenarios
mmap
+ write
Ideal for scenarios that require flexible manipulation of file data, such as the need to modify or partially process data before sending it. Compatible withsendfile
Compare.mmap
+ write
Provides greater flexibility as it allows access to data content in the user state, which is useful for application scenarios that require pre-processing of files, such as compression, encryption, or data conversion.
However, this approach also introduces more overhead as the data needs to be interacted between the user state and the kernel state, which increases the cost of system calls. Therefore.mmap
+ write
More suited to those situations where some customization is required before data transfer, and less suited to purely efficient transfer of large files.
6. tee: zero-copy approach to data replication
tee
is a zero-copy method in Linux that copies data from one pipeline to another while preserving the data in the original pipeline. This means that data can be sent to multiple destinations at the same time without affecting the original data flow, making it ideal for scenarios such as logging and real-time data analysis where the same data needs to be sent to different places.
6.1 Workflow of tee
-
Copying data to another pipeline:
tee
A system call can copy data from one pipe to another without changing the original data. This means that data can be used for different purposes at the same time in kernel space without having to go through a copy in user space.
6.2 tee interface description
tee
The function is defined as follows:
ssize_t tee(int fd_in, int fd_out, size_t len, unsigned int flags);
-
fd_in
: The source pipeline file descriptor from which data is read. -
fd_out
: Destination pipeline file descriptor where the data will be written. -
len
: The number of bytes to be copied. -
flags
: Signs that control behavior, e.g.SPLICE_F_NONBLOCK
etc.
The return value is the number of bytes actually copied, and in case of error returns-1
and seterrno
to indicate the cause of the error.
6.3 Simple Code Examples
int main() {
int pipe_fd[2];
pipe(pipe_fd);
int server_fd = socket(AF_INET, SOCK_STREAM, 0);
struct sockaddr_in address;
address.sin_family = AF_INET;
address.sin_addr.s_addr = INADDR_ANY;
address.sin_port = htons(8080);
bind(server_fd, (struct sockaddr *)&address, sizeof(address));
listen(server_fd, 3);
int client_fd = accept(server_fd, NULL, NULL);
// utilization tee Copy data
tee(pipe_fd[0], pipe_fd[1], 1024, 0);
splice(pipe_fd[0], NULL, client_fd, NULL, 1024, SPLICE_F_MORE);
close(pipe_fd[0]);
close(pipe_fd[1]);
close(client_fd);
close(server_fd);
return 0;
}
This example shows how to use thetee
Copies the data in the pipeline and passes it through thesplice
Sends data to network sockets for efficient data transfer and replication.
6.4 Applicable scenarios
tee
Ideal for scenarios where data needs to be sent to multiple targets simultaneously, such as real-time data processing, logging, etc. Bytee
The following is an example of how to use a kernel to replicate data in the kernel space, improve system performance, and reduce the CPU burden.
Summarize the comparison:
Below I have summarized several zero-copy methods for Linux, so that you can compare and learn from them:
methodologies | descriptive | Zero copy type | CPU Participation | Applicable Scenarios |
---|---|---|---|---|
sendfile | Sends file data directly to the socket without copying it to user space. | Complete zero-copy | Very little, direct data transfer. | File servers, video streaming, and other large file scenarios. |
splice | Efficiently transfer data between file descriptors in kernel space. | Complete zero-copy | Very little, completely within the kernel. | Complex transfer scenarios between files, pipes and sockets. |
mmap + write | Flexible data handling by mapping files to memory and sending data using write | Partial zero-copy | Medium, needs to be mapped and written. | Scenarios where data needs to be processed or modified, such as compression encryption. |
tee | Copy data from one pipeline to another without consuming the original data. | Complete zero-copy | Very little, data is replicated in the kernel. | Multi-targeted scenarios such as log processing and real-time data monitoring. |
Finally:
I hope this article has given you a more comprehensive and clearer understanding of Linux's zero-copy techniques! These techniques may seem a bit complicated, but once mastered, you'll find that they're very simple, and very useful in real-world projects.
If you think this article is helpful to you, remember to give me a point in the see and like 👍, and share it with the need of partners it! Also welcome to my public number "Learn Programming with Xiaokang」
What can I learn by paying attention?
-
Here we share Linux C, C++, Go development, computer basics and programming interviews, etc. The content is in-depth and easy to understand, so that technical learning becomes easy and interesting.
-
Whether you are preparing for an interview or want to improve your programming skills, this place is dedicated to providing practical, interesting and insightful technology sharing. Come and follow us, let's grow together!
How do I follow my public number?
It's very simple! Scan the QR code below to follow with one click.
In addition, Xiaokang has recently created a technical exchange group dedicated to discussing technical issues and answering readers' questions. When reading the article, if there is any knowledge point that you do not understand, you are welcome to join the exchange group to ask questions. I will try my best to answer for everyone. I look forward to making progress with you all!