Since last year, the field of LLM has been growing rapidly, and the number of Foundation Models such as LLaMA, ChatGLM, Baichuan, Qwen and yi-model has increased significantly. Many enterprises have started to do post-training based on these Foundation Models to develop vertical-specific models to realize application implementation.
The parameter size of AI models is growing exponentially, and more and more generalized large models with hundreds of billions or even trillions of parameters have emerged. For example, the latest llama3 model provides three versions of 405B, 70B, and 8B, and the number of model parameters is constantly refreshed, which poses a great challenge to the cost and planning of enterprise storage. In addition to the growth of model size, the complex process of large model development and efficient data flow also put forward higher requirements on the storage system, which directly affects the efficiency of the whole process.
In addition, the demand for GPU arithmetic for large-scale training and inference has led to an increasing shift in arithmetic supply to a multi-cloud, multi-region model, which in turn poses the challenge of distributing, synchronizing, and consistency management of datasets and models across multiple regions.
In this paper, we will briefly introduce our thoughts on large model storage selection, workload requirements at different stages of development, including multiple dimensions such as cost, performance, and functionality, and explore how JuiceFS can help users optimize these aspects.
01 Thinking about Large Model Storage Selection
In response to the challenges of large data volume and complex data flow, the commonly adopted solution is to establish a unified data management, also known as a data lake. This approach effectively reduces the difficulty and cost of data flow in complex data pipelines by unifying the namespace and storing different types of data in the backend while providing diverse data access methods in the front end.
While the ideal is perfect, the reality is that it is challenging to design a unified technology stack or product to achieve these goals. From the perspective of supporting front-end data processing frameworks, a file system that is fully compatible with the POSIX protocol seems to be the best solution at this point.
In addition, the file system also needs to support a huge file size, not only to be able to support tens of billions of file storage, but also to ensure that the file access performance at this scale; this requires the file system to have efficient metadata management capabilities and high-performance implementations, such as multi-level caching.
It also includes support for storage orchestration on other container platforms, flexible horizontal scalability, and multi-cloud, multi-data center data synchronization support, all of which are key factors to consider when choosing a data lake product.
The following table lists some common file system products on the market and provides a multi-dimensional side-by-side comparison. We hope to provide you with a reference when selecting a model.
JuiceFS users often ask about CephFS, Lustre, and Alluxio during the selection process, and we will focus on the differences between these three products.
Let's compare these products from a product positioning perspective.
Better POSIX protocol compatibility is needed in terms of support for data processing frameworks.
- JuiceFS, CephFS, and Lustre are all filesystems, and as such they all provide more comprehensive POSIX support and ensure strong data consistency across sessions;
- Alluxio is mainly positioned as a data orchestration and acceleration tool, only partially supports POSIX and does not guarantee strong data consistency, so the applicable usage scenarios are different.
In terms of supporting massive small files, the challenge is mainly in the ability to manage file metadata and performance.
- CephFS and Lustre utilize disk to store metadata with in-memory caching for performance acceleration; after a certain file size, the cost of managing and using metadata increases significantly;
- Alluxio provides two metadata storage solutions based on RocksDB and heap memory, users can choose the appropriate solution according to their needs;
- JuiceFS uses object storage as the backend, which is characterized by flat namespace, high scalability and low cost, and is very suitable for storing massive files. The community version of the metadata engine adopts an open plug-in architecture design, users can choose from a variety of mainstream open source databases as metadata storage according to the file size, performance requirements, usage habits, etc., and the enterprise version of the self-developed metadata engine based on the implementation of the full-memory service for the management of the file directory tree, from the performance of the metadata access, horizontal scalability in several aspects of the customization and optimization.
For multi-cloud data management, file system level replication capabilities that are transparent to the user are required.
- CephFS and Lustre themselves do not have file system level replication, and need to be realized through the replication function of the storage base or a third-party tool, which is difficult to guarantee data consistency and adds additional management costs.
- Both JuiceFS and Alluxio provide multi-cloud, multi-region data distribution capabilities that are transparent to the user. In addition, JuiceFS provides the ability to make the file system writable at the mirror site;
To summarize the benefits of JuiceFS in data lake storage for users:
- Unified namespace provides multi-protocol access;
- Full POSIX protocol compatibility; tens of billions of file size support;
- Strong data consistency;
- Highly concurrent shared access capability;
- Performance optimization for different loads.
- Multi-cloud, multi-region data distribution and access.
02 JuiceFS Practice in Large Model Scenarios
Session 1: Data set loading
There are two key points in loading the dataset for large model training: the dataset needs to be iteratively traversed in multiple rounds, i.e., multiple epochs, and the dataset needs to be randomly disrupted before the start of each epoch. This is to ensure that the model can fully learn the patterns in the data and avoid overfitting. These randomly disrupted datasets are then loaded into the GPU memory in batches for processing.
Since these data processes rely heavily on CPU resources, the GPU tends to be idle during this phase. Users need to improve the efficiency of dataset loading, such as using caching, so as to reduce the idle time of the GPU. Data sets are usually in the form of large structured files (e.g., FFRecord, TFRecord, Arrow) or massive small files, such as unpacked text, images, audio, etc.
The process of randomly breaking up the dataset requires random readings of the data file, theCombined with the format of the dataset file, the dataset loading process is modeled as random reads of large and small files for the I/O model of storageRandom read I/O requires high IOPS and low I/O latency of the file system. Random read I/O requires the file system to have high IOPS and low I/O latency, especially for the I/O processing of random reads of a large number of small files, which puts higher requirements on the performance of the file system metadata engine. In addition, the dataset is read repeatedly during a training process, so if the data can be cached in high-performance storage media and achieve a high cache hit rate, the performance of data reading can be maximized.
JuiceFS is designed to help users find the optimal balance between performance and cost.. It provides high IOPS and low IO latency for dataset loading while remaining cost-effective by utilizing an object store as a means of back-end data persistence and combining it with a multi-level cache acceleration architecture. Some performance metrics for small IO random read operations using JuiceFS are listed below:
- The delay is between 0.05 and 0.1 milliseconds when hitting the local cache;
- Hits to the distributed cache (an Enterprise Edition feature) are delayed between 0.1 and 0.2 milliseconds;
- Latency exceeds 30 to 50 milliseconds when reading directly from the object store.
Higher IOPS can be achieved by using libaio, but this requires the integration of framework-specific extensions to support the libaio interface. In addition, theJuiceFS Enterprise Edition provides a self-developed metadata engine with full-memory implementation, which ensures that the average processing time of metadata requests is maintained at the order of 100 microseconds under the management of massive file sizes, and also achieves very good performance in data set loading scenarios with large volumes of small files.. In addition through the metadata engine multi-partitioning architecture implementation provides the ability to scale dynamically and linearly according to file size.
Attached below is a set of random read test data of large files from users to help you understand the performance of JuiceFS. The test data is for a single 1.8TB file.
Session 2: Training session checkpoint save
The reason for saving the checkpoint file of the training session is to save time and computational resources in case of training interruption by recovering from the most recent checkpoint and avoiding starting from scratch. If the checkpoints are written synchronously, the GPU waits for all the checkpoints during the time window when they are saved.Typically, checkpoint writes are sequential writes of large files. To minimize the time of this process, the file system needs to provide high performance write throughput capabilities。
JuiceFS uses object storage as the data persistence layer, and its throughput is limited by the write bandwidth of the object storage, the bandwidth of the leased line, and the bandwidth of the NIC of the node where the JuiceFS client is located.JuiceFS adopts the design of chunked storage, and by increasing the concurrency to the object storage, it can make full use of the bandwidth of the object storage, so as to increase the throughput of the sequential writes of large files.
In addition, JuiceFS also provides write caching. If there is a performance bottleneck in the back-end object storage, you can consider turning on writeback write caching, which utilizes the NVMe high-speed storage media of the GPU node as the local cache, and writes the checkpoint file locally first, and then uploads it to the object storage asynchronously, which can help to reduce the write latency. It is important to note that turning on writeback write caching may result in some scenarios where the checkpoint files are inconsistent and cannot be loaded properly to recover the training, in which case we need to load the checkpoint files from the previous window to recover them and rerun some training epcoh.
Based on the validation of actual user applications, theIn the scenario where checkpoints are saved to the JuiceFS filesystem, one write checkpoint process per GPU can achieve a write throughput of 1.7GB/s or higher, which is sufficient for this scenario.。
Session 3: Model Loading for Training and Reasoning Sessions
Typical model loading scenarios include:
-
Training Recovery Load Checkpoint File ScenarioThe efficiency of loading checkpoint files determines the length of time the GPU waits for training recovery, which directly affects the utilization of the GPU.
-
Inference service loading model file scenario: Deploying a trained model to an inference service generally requires loading the complete model file when the inference node starts. When the number of inference nodes is large, such as running thousands of inference nodes, and loading the model at the same time, each node needs to read a complete model file from the file storage, which will generate a very large read throughput, and if the network throughput becomes a bottleneck, the efficiency of the model loading will be very slow. If the file storage is the object storage on the public cloud, then the bandwidth of the object storage, dedicated bandwidth, etc. can easily become a bottleneck, and each node gets the complete model file from the object storage, which also generates a large amount of extra bandwidth for the object storage and the cost of calling.
It is worth noting that the different formats of the model file have different requirements for storing the I/O model during the model loading session. The model file may be loaded using the .pt, .bin files saved with the safetensors library or the
.safetensors
Documentation. Among themsafetensors
Files are special in that they are loaded using mmap, and the I/O model for the storage is random reads of large files, while files in other formats are loaded using sequential reads of large files. In order to meet the requirements of loading models of these two different formats, let's take a look at the JuiceFS solution.
JuiceFS optimization practices for safetensors format model file loading (large file random read): Includes warming up these files into the cache ahead of time for high file read IOPS and low latency performance. The final performance will be limited by the IOPS capability of the cache media and the network environment of the JuiceFS client. In addition, to minimize read amplification issues that may result from prefetching, consider turning off prefetch prefetching. To further optimize the performance of loading safetensors files, you can prefetch files into the kernel's pagecache before the model is loaded, a step that can dramatically improve the speed of loading safetensors files by an order of magnitude compared to reading from the JuiceFS cache.
JuiceFS Optimization Strategies for Model File Loading (Sequential Reads of Large Files) in Other Formats: JuiceFS improves the performance of sequential reads of large files by using the read-ahead feature. This feature predicts and loads future chunks of data into memory ahead of time, and fully utilizes the object store bandwidth by appropriately configuring the read-ahead window size and increasing the number of concurrent accesses to the object store. The ultimate read throughput performance is limited primarily by the bandwidth of the object store or the bandwidth of the dedicated line to the object store.In addition, preheating checkpoints or model files into the cache can further improve performance. For example, in the case of an 8-card stand-alone checkpoint load, with sufficient bandwidth on the NIC, the throughput of a hot read can reach more than 10GB/s, which is sufficient to cope with the user's performance requirements for this scenario.In addition, in the scenario of inference loading model files, the cache cluster only needs to get the complete model file from the object store once. In addition, in the scenario of inference loading model files, by warming up, only the cache cluster needs to get the complete model file from the object storage once, and the inference nodes in the back read from the cache, which maximizes the read throughput capacity and saves the bandwidth and call cost of using the public cloud object storage.
Session 4: The Need for Rapid Data Distribution in Hybrid Cloud Architectures
With the popularization of large models, GPU arithmetic has become a scarce resource, especially in the domestic environment, unlike the CPU resources that were easy to obtain in the past. In particular, users who need to pre-train general-purpose models often need to deal with the problem of cross-location and multi-data center arithmetic collaboration, which requires storage resources to be flexibly configured according to the geographic location of the arithmetic, to ensure that the data can be distributed with the arithmetic.
JuiceFS automatically synchronizes data from the master site to other mirror sites through its mirror file system function. Metadata synchronization is completed within a defined time window, with millisecond latency in the same city, and about 10-30 milliseconds across cities in the country. Within the specified time window, the mirror sites ensure data consistency, and the actual data replication is completed asynchronously in the background. Once the metadata synchronization is completed, the data can be accessed at each site; if the actual data has not been fully synchronized to the mirror site, the JuiceFS client will automatically route the request to the main site's object storage to obtain the data, which will not interrupt the access to the data and will be transparent to the user, despite the performance loss.
In this hybrid cloud architecture, for better cost control, it is recommended that object storage be deployed at one location and cache clusters large enough to hold the training set data be deployed locally at each site to achieve a nearly 100% cache hit rate. Data is pre-preheated from the shared object storage to the cache clusters at each site, and this deployment method is the most widely adopted by users in AI scenarios.
The figure below shows the architecture of a hybrid cloud computing cluster with one primary and three mirror sites for one user. Each site achieves accelerated data access through cache groups.
It is worth noting thatIn the latest Enterprise Edition, JuiceFS has added writable features for mirror clusters. In a multi-site training scenario, when the training clusters of each mirror site save the checkpoint file and write it to the JuiceFS file system of the respective site, it will be automatically written to the metadata and object storage of the file system of the main site and distributed to the other mirror sites, where the copying of the metadata is synchronized to ensure the consistency of the version of the data written in the mirror cluster. This feature simplifies and speeds up the checkpoint recovery of subsequent training tasks and the process of loading the inference service model to better ensure the consistency of the data in the mirror cluster training and inference. The write performance of the mirror sites is affected by the condition of the private line network between the sites.
03 Summary
A common solution to the challenges posed by large data volumes and complex data flows includes the establishment of a unified data management system, i.e., a data lake. This strategy effectively reduces the difficulty and cost of data flow in complex data pipelines by adopting a unified namespace to store different types of data in the back-end, while providing diverse data access methods in the front-end.
When choosing a file storage system, users can consider various aspects such as infrastructure, performance, compatibility, and ease of use to select the most suitable product.JuiceFS utilizes inexpensive object storage as a means of back-end data persistence and combines it with a multi-level caching acceleration architecture to provide users with a cost-effective storage solution.
In addition, the article discusses JuiceFS practices and optimization strategies in the following key areas.
- Dataset loading: Random reads of large dataset files and massive small files require a file system with high IOPS and low I/O latency, as well as high-performance metadata access;
- Training session checkpoint saving: checkpoint Sequential writes of large files require the file system to provide high performance write throughput capabilities;
- Model loading in training and inference sessions: random reads of large files for safetensors format checkpoint/models require high IOPS and low I/O latency of the filesystem; sequential reads of large files for checkpoint models of other formats require high-performance read throughput of the filesystem; bandwidth bottlenecks for concurrent loading of models by large-scale inference nodes Problems;
- Hybrid Cloud Architecture Data Distribution: Mirror read/write function that ensures data consistency to satisfy dataset distribution, checkpoint preservation in multi-region collaborative training, and model deployment for multi-region inference services.
I hope this has been of some help to you, and if you have any other questions feel free to join theJuiceFS CommunityCommunicate with everyone.