In high-performance computing scenarios, all-flash architectures and kernel-state parallel file systems are often used to meet performance requirements. As the data scale increases and the cluster size of distributed systems increases, the high cost of all-flash and the O&M complexity of kernel clients become major challenges.
JuiceFS, a fully userland cloud-native distributed file system, dramatically improves I/O throughput through distributed caching and uses lower-cost object storage to complete data storage for large-scale AI operations.
The JuiceFS data read process starts with a read request from the client, which is then sent to the JuiceFS client through FUSE, through the pre-read buffer layer, and then into the cache layer for eventual access to the object store. In order to improve the read efficiency, JuiceFS adopts various strategies in its architectural design, including data pre-reading, prefetching and caching.
In this article, we will analyze in detail how these strategies work and share our test results in specific scenarios, so that readers can gain a deeper understanding of the performance benefits of JuiceFS and some of the associated limitations, so that they can be more effectively applied to a variety of usage scenarios.
Given the depth and technical nature of the content, the reader is required to have some knowledge of operating systems and is advised to bookmark it for perusal when needed.
01 JuiceFS Architecture Introduction
JuiceFS Community Edition architecture is divided into three parts: client, data storage and metadata. Data access supports multiple interfaces, including POSIX, HDFS API, S3 API, and Kubernetes CSI to meet different application scenarios. In terms of data storage, JuiceFS supports dozens of object stores, including public cloud services and self-hosted solutions, such as Ceph and MinIO, and the metadata engine supports a variety of common databases, including Redis, TiKV, and MySQL.
The main difference between Enterprise Edition and Community Edition is in the handling of metadata engine and data caching in the lower left corner of the picture. The Enterprise Edition includes a self-developed distributed metadata engine and supports distributed caching, while the Community Edition only supports local caching.
02 A few concepts about "reading" in Linux
On Linux systems, data reading is accomplished in several ways:
- Buffered I/O: This is the standard way of reading files, where data passes through the kernel buffer and the kernel performs a pre-read operation to optimize read efficiency;
- Direct I/O: Allows direct file I/O operations to bypass kernel buffers to reduce data copying and memory footprint for large data transfers;
- Asynchronous I/O: Usually used with Direct I/O. It allows an application to issue multiple I/O requests in a single thread without having to wait for each request to complete, thus improving I/O concurrency performance;
- Memory Map: Mapping a file into the address space of a process allows direct access to the contents of the file via pointers. With memory mapping, applications can access the mapped file area as if it were normal memory, with the kernel automatically handling the reading and writing of data.
Several major read modes and the challenges they pose to storage systems:
- Random reads, including random large I/O reads and random small I/O reads, mainly test the latency and IOPS of the storage system.
- Sequential reads, a major test of the storage system's bandwidth.
- A large number of small file reads mainly test the performance of the storage system's metadata engine and the overall IOPS capability of the system.
03 JuiceFS Read Process Principle Analysis
JuiceFS adopts the strategy of storing files in chunks. A file is logically divided into several chunks, each with a fixed size of 64MB, and each chunk is further subdivided into a number of 4MB blocks, which are the actual storage units in the object storage, and there are a number of performance optimizations in the JuiceFS design that are closely related to this chunking strategy. (Learn more about the JuiceFS storage model). In order to optimize read performance, JuiceFS has adopted various optimization schemes such as pre-reading, prefetching and caching.
Pre-reading readahead
Pre-reading (readahead): Reduce access latency and improve real-world I/O concurrency by predicting future read requests from the user and loading data from the object store into memory in advance. The following diagram is a simplified schematic of the read process. Below the dotted line represents the application layer and above the dotted line is the kernel layer.
When a user process (the application layer highlighted in blue in the lower left corner) initiates a system call to read or write a file, the request first passes through the kernel's VFS and then to the kernel's FUSE module, which passes through the/dev/fuse
The device communicates with the JuiceFS client process.
The process shown in the lower right corner is a subsequent pre-read optimization in JuiceFS. The system tracks a series of consecutive reads by introducing "sessions". Each session records the offset of the last read, the length of the consecutive reads, and the size of the current read-ahead window, which can be used to determine if a new read request hits the session, and to automatically adjust/shift the read-ahead window. By maintaining multiple sessions, JuiceFS can also easily support high-performance concurrent sequential reads.
In order to improve the performance of sequential reads, we have added measures to improve concurrency in the system designThe goroutine is a goroutine that reads data from a block (4MB). Specifically, each block (4MB) in the pre-read window starts a goroutine to read the data. Note that the number of concurrency is limited by the buffer-size parameter. Under the default 300MB setting, the theoretical maximum concurrency is 75 (300MB divided by 4MB), which is not enough in some high-performance scenarios. Users need to adjust this parameter according to their own resources and scenarios, and we have tested different parameters in the following section.
As an example in the second line of the figure below, when the system receives the second consecutive read request, it actually initiates a request for three consecutive blocks of data containing the pre-read window and the read request. In accordance with the pre-read setup, the next two requests will both hit the pre-read buffer directly and be returned immediately.
If the first and second requests are not pre-read, but access the object store directly, the latency will be higher (usually >10ms).When the latency is reduced to less than 100 microseconds, it means that this I/O request successfully used pre-reading, i.e., the third and fourth requests, which directly hit the pre-read data in memory。
prefetch
Prefetch: When a small piece of data in a file is read at random, we assume that the area around this piece of data may also be read, so the client asynchronously downloads the entire block where this small piece of data is located.
However, there are some scenarios where prefetching as a strategy is not applicable, for example, if the application performs drastically offset, sparse random reads of large files, prefetching will access some unnecessary data, resulting in read amplification. Therefore, if the user has a deep understanding of the read pattern of the application scenario and confirms that prefetching is not needed, he or she can pass the--prefetch=0
Disable the behavior.
Cache cache
In a previous sharing, our architect, Changjian Gao, described in detailJuiceFS caching mechanismor viewCached documents. In this article, the introduction to caching will focus on basic concepts.
page cache
The page cache is a mechanism provided by the Linux kernel. One of its core functions is readahead, which ensures a fast response when data is actually requested by reading it into the cache in advance.
Further, the page cache (page cache) in specific scenarios is also very critical, for example, when dealing with random read operations, if the user can strategically use the page cache, the file data will be filled into the page cache in advance, such as memory free in advance of the complete and continuous reading of the file once, you can significantly improve the performance of the subsequent random reads, which will greatly improve the overall performance of the business.
Local cache local cache
JuiceFS's local cache can save blocks in local memory or on local disk, allowing for local hits when the application accesses the data, reducing network latency and improving performance. The default unit of data caching is a block of 4MB, which is asynchronously written to the local cache after the first read from the object store.
Regarding the configuration of the local cache, such as--cache-dir
cap (a poem)--cache-size
and other details, Enterprise Edition users canView Document。
Distributed cache cache group
Distributed caching is an important feature of Enterprise Edition.Compared to local caching, distributed caching aggregates the local caches of multiple nodes into a single cache pool to improve cache hit ratesThe latency of distributed caches is usually slightly higher than that of local caches. However, since the distributed cache adds a network request, this results in a latency that is usually slightly higher than that of the local cache. distributed cache random read latency is usually 1-2ms, while local cache random read latency is usually 0.2-0.5ms. onDistributed Cache Specific Architecture, you can check the official website documentation.
04 FUSE & Object Storage Performance Performance
JuiceFS read requests go through FUSE and data is read from the object store, so understanding the performance of FUSE and the object store is fundamental to understanding the performance of JuiceFS.
About FUSE Performance
We performed two sets of tests on FUSE performance. The test scenario is when an I/O request arrives at a FUSE mounted process and the data is filled directly into memory and returned immediately. The tests evaluate FUSE's total bandwidth, average bandwidth per thread, and CPU usage for different numbers of threads. For hardware, Test 1 is on the Intel Xeon architecture and Test 2 is on the AMD EPYC architecture.
Threads | Bandwidth(GiB/s) | Bandwidth per Thread (GiB/s) | CPU Usage(cores) |
---|---|---|---|
1 | 7.95 | 7.95 | 0.9 |
2 | 15.4 | 7.7 | 1.8 |
3 | 20.9 | 6.9 | 2.7 |
4 | 27.6 | 6.9 | 3.6 |
6 | 43 | 7.2 | 5.3 |
8 | 55 | 6.9 | 7.1 |
10 | 69.6 | 6.96 | 8.6 |
15 | 90 | 6 | 13.6 |
20 | 104 | 5.2 | 18 |
25 | 102 | 4.08 | 22.6 |
30 | 98.5 | 3.28 | 27.4 |
FUSE Performance Test 1, based on Intel Xeon CPU architecture
- In single-threaded tests, the maximum bandwidth reached 7.95GiB/s while CPU usage was less than one core;
- The bandwidth essentially scales linearly as the number of threads increases, with the total bandwidth increasing to 104 GiB/s when the number of threads increases to 20;
In this regard, it is important to note that FUSE's bandwidth performance may vary across different hardware models and operating systems on the same CPU architecture. We have tested with several models, and the maximum single-thread bandwidth measured on one of them was only 3.9GiB/s. The maximum single-thread bandwidth measured on one of them was only 3.9GiB/s.
Threads | Bandwidth(GiB/s) | Bandwidth per Thread (GiB/s) | CPU Usage(cores) |
---|---|---|---|
1 | 3.5 | 3.5 | 1 |
2 | 6.3 | 3.15 | 1.9 |
3 | 9.5 | 3.16 | 2.8 |
4 | 9.7 | 2.43 | 3.8 |
6 | 14.0 | 2.33 | 5.7 |
8 | 17.0 | 2.13 | 7.6 |
10 | 18.6 | 1.9 | 9.4 |
15 | 21 | 1.4 | 13.7 |
FUSE Performance Test 2, based on AMD EPYC CPU architecture
- In Test 2, the bandwidth does not scale linearly, especially when the number of concurrencies reaches 10, with less than 2GiB/s per concurrent;
In the case of multiple concurrency, Test 2 (EPYC architecture) has a bandwidth peak of about 20GiBps; Test 1 (Intel Xeon architecture) exhibits a higher performance space, with peaks typically occurring after CPU resources are fully utilized, when both the application process and the FUSE process CPUs reach their resource limits.
In practice, due to the time overhead of each link, the actual I/O performance is often lower than the above test peak of 3.5GiB/s. For example, in the model loading scenario, loading a model file in pickle format, the bandwidth of a single thread can only reach 1.5 to 1.8GiB/s. This is mainly due to the fact that the data deserialization has to be carried out when the pickle file is read and the CPU single-core processing bottleneck is also encountered. This is mainly due to the fact that reading the pickle file and deserializing the data at the same time will cause a bottleneck in CPU single-core processing. Even in the case of reading directly from memory without FUSE, the bandwidth can only reach 2.8GiB/s at most.
About Object Storage Performance
Tests were conducted using the juicefs objbench tool, covering different loads with single concurrency, 10 concurrency, 200 concurrency, and 800 concurrency. Users should note that the performance difference between different object stores can be significant.
Upload bandwidth upload objects- MiB/s | Download bandwidth download objects MiB/s | Average upload time ms/object | Average download time ms/object | |
---|---|---|---|---|
single-concurrent | 32.89 | 40.46 | 121.63 | 98.85ms |
10 Concurrent | 332.75 | 364.82 | 10.02 | 10.96 |
200 Concurrent | 5590.26 | 3551.65 | 067 | 1.13 |
800 Concurrent | 8270.28 | 4038.41 | 0.48 | 0.99 |
When we increase the concurrency of GET operations on the object store to 200 and 800, we are able to achieve very high bandwidth. This shows that single concurrency bandwidth is very limited when reading data directly from the object store, and that increasing concurrency is critical to overall bandwidth performance.
05 Continuous vs. random read test
To give you a visual reference for benchmarking, we used the fio tool to test the performance of JuiceFS Enterprise Edition under sequential read and random read scenarios.
read consecutively
As you can see from the graph below 99% of the data is less than 200 microseconds. The pre-read window always works well in continuous read scenarios, so latency is low.
At the same time, we can also increase the I/O concurrency by increasing the pre-read window to increase the bandwidth. When we adjust the buffer-size from the default 300MiB to 2GiB, the read concurrency is no longer limited, and the read bandwidth is increased from 674MiB/s to 1418MiB/s, which is the peak performance of the single-threaded FUSE, and the further increase of the bandwidth needs to increase the I/O concurrency in the business code.
buffer-size | bandwidths |
---|---|
300MiB | 674MiB/s |
2GiB | 1418MiB/s |
Bandwidth performance test with different buffer-sizes (single-threaded)
When the number of business threads is increased to 4, the bandwidth reaches 3456MiB/s; at 16 threads, the bandwidth reaches 5457MiB/s, at which point the network bandwidth is saturated.
buffer-size | bandwidths |
---|---|
1 Thread. | 1418MiB/s |
4 threads | 3456MiB/s |
16 threads | 5457MiB/s |
Bandwidth performance test with different number of threads (buffer-size: 2GiB)
random access
For small I/O random reads, the performance is mainly determined by latency and IOPS, and since the total IOPS can be scaled linearly by adding more nodes, we first focus on the latency data on a single node.
"FUSE data bandwidth" refers to the amount of data transferred through the FUSE tier, representing the data transfer rate that can actually be observed and operated by the user application; "underlying data bandwidth" refers to the bandwidth of the storage system itself to process data at the physical or operating system level. The "underlying data bandwidth" refers to the storage system's own processing of data at the physical or operating system level.
From the table, we can see that the latency is lower for both local and distributed cache hits compared to penetrating to the object store, and when we need to optimize the random read latency we need to consider improving the cache hit rate of the data. Also, we can see that using asynchronous I/O interfaces and increasing the number of threads can greatly improve IOPS.。
Unlike the small I/O scenario, the large I/O random read scenario also requires attention to theThe problem of read amplificationThis is because the actual data request will be 1-3 times more than the data request from the application. As shown in the table below, the underlying data bandwidth is higher than the FUSE data bandwidth, this is because of the effect of prefetch, the actual data request will be 1-3 times more than the data request from the application, at this time, you can try to turn off the prefetch and adjust the maximum prefetch window to tune.
categorization | FUSE data bandwidth | Underlying data bandwidth |
---|---|---|
1MB buffered IO | 92MiB | 290MiB |
2MB buffered IO | 155MiB | 435MiB |
4MB buffered IO | 181MiB | 575MiB |
1MB direct IO | 306MiB | 306MiB |
2MB direct IO | 199MiB | 340MiB |
4MB direct IO | 245MiB | 735MiB |
JuiceFS (Distributed Cache Enabled) Large I/O Random Read Test Results