Large Model Training: Best Practices for Storing Thousands of Nodes in a K8s Environment

Today's blog comes from full-stack engineer Zhu Weiwei, who shared this topic at the recent KubeCon China conference.

Kubernetes has become the de facto application orchestration standard, and more and more applications are moving towards cloud-native. At the same time, the rapid development of AI technologies, especially the advancement of large-scale language models (LLMs), has led to a dramatic increase in the amount of data that organizations need to process. for example, the Llama 3.1 model, with 405 billion parameters, has a model file size of 231 GB. as the model parameters grow, the size of the model file grows as well.

01 Storage Challenges for Large Model Training in Kuberenetes

As data clusters continue to grow in size, managing large-scale data clusters in a Kubernetes environment presents multiple challenges:

Complex Permission Management: Large-scale AI training often involves hundreds of algorithm engineers, placing complex demands on file system permission management. Permission management at this scale is particularly difficult in a Kubernetes environment, where access to highly dynamic and distributed resources must be controlled at a granular level without compromising the efficiency of development and operations.
Stability Challenge: In an extremely elastic cloud-native environment, the stability of the file system also faces great challenges. How to ensure that business is not affected when restarting or upgrading file system services?
System Observability: How to increase system observability and simplify O&M and troubleshooting in a complex Kubernetes system?

In addition to the challenges in a Kubernetes environment, the storage system faces performance requirements of high concurrency, high throughput and low latency, as well as the challenge of maintaining data consistency in a multi-cloud architecture.

02 How is the JuiceFS architecture designed to address these challenges?

JuiceFS stores metadata and data separately. Metadata is stored in databases including Redis, MySQL, and our own high-performance cloud data engine, while data is sliced and diced and stored in an object store that supports almost all types of object stores on the market. This method of storing files in chunks allows all file I/O requests to be accurately locked to specific chunks through offsets, which is especially suitable for reading and writing large files and ensures data consistency.

As shown in the figure below, the JuiceFS client sits on top of the system, handles all file I/O requests, and provides multiple means of access to upper-level applications, including the POSIX interface, the JuiceFS CSI Driver, and the S3 Gateway.

JuiceFS in a Kubernets Environment

JuiceFS provides a CSI Driver that enables users to use the filesystem in Kubernetes environments by way of native Persistent Volume Claims (PVCs), supporting both static and dynamic configuration.

In a static configuration, the administrator creates a separate Persistent Volume (PV) for the application Pod. Users can use JuiceFS in a Pod by simply creating a PVC and declaring this PVC in the Pod.

Dynamic configuration simplifies the work of the administrator. Instead of creating a separate PV for each Pod, the administrator creates a template for the PV, i.e., StorageClass, and the user does the same as in static configuration, but still needs to create a PVC, and the system automatically generates the required PVs dynamically based on the StorageClass, and then the system automatically creates the corresponding PVs during runtime.

The following illustration demonstrates the flow of operations when the JuiceFS CSI Driver receives a mount request from Kubernetes. A separate Pod is created to run the JuiceFS client. This design provides several significant benefits:

Increased system stability and scalability: The FUSE client is completely decoupled from the CSI Driver component so that restarts and upgrades of the CSI Driver do not affect the operation of the FUSE client.
Ease of Management: This architecture allows for intuitive management of FUSE daemons in the same way as Kubernetes, increasing process transparency and management efficiency.

How to Run the CSI Driver in a Serverless Environment

In Serverless environments, services are typically not associated with a specific node, which means that it is not possible to run a daemon site on the node, thus preventing the CSI node component from working properly. To address this issue, we have adopted an innovative solution using the Sidecar model to support JuiceFS in Serverless resilient environments, ensuring high availability and flexibility for storage clients.

Specifically, we register a webhook with the API server in the CSI controller, and when the API server needs to create a Pod, it makes a request to this webhook. Through this mechanism, we inject a Sidecar container into the application Pod, which runs the JuiceFS client. This configuration allows the JuiceFS client to coexist in the same Pod as the application container in the form of a sidecar, sharing the same lifecycle, thus improving overall operational efficiency and stability. For more information about Sidecar mode, clickhere (literary)Learn more.

Enabling data security in multi-tenant environments

Ensuring data security in a multi-tenant environment is a major challenge.JuiceFS employs a variety of security mechanisms to address this challenge:

Data isolation: JuiceFS realizes data isolation between different businesses by assigning different storage directories to PVCs dynamically declared by StorageClass.

Data encryption: After the file system starts the static data encryption function, users can set the key password and key file in Secret to enable the data encryption function of JuiceFS.

Permission Control: Users can manage file permissions using Unix-like UIDs and GIDs, and set uid and gid directly in the pod. in addition, JuiceFS also supports setting POSIX ACLs for more granular permission control.

Unlimited expandable storage space

JuiceFS is built on an object store, so it has virtually no upper limit on storage capacity. We have implemented a logical data management system based on the object store.

Users can set a quota for JuiceFS by specifying the value of the StorageClass attribute in PersistentVolumeClaim (PVC). This process is similar to setting a capacity quota for JuiceFS. When it is time to expand the capacity of the data, it is very simple: you can easily expand the capacity of the data by modifying the value of the storage capacity in the StorageClass attribute of the PVC using the kubectl command.

How to achieve high performance?

When a large number of clients need to access the same dataset frequently, distributed caching enables multiple clients to share the same cached data, significantly improving performance. This mechanism is particularly suitable for scenarios where a GPU cluster is used for deep learning model training.

Here is a typical deployment of a distributed cache cluster. In the GPU compute node, the JuiceFS client runs and uses the local NVMe as the local cache. Distributed file cache clusters are typically deployed on the proximal side and pull data from a remote object store to the proximal side by warming it up for use by clients on the GPU node.

Here we share a performance test for reading large files sequentially and randomly to help you understand the effect of distributed caching. As shown in the figure below, in the sequential read large file test, the bandwidth is 4.7GB without caching and 13.2GB with caching, and in the random read large file test, the latency is 29 milliseconds without caching and 0.3 milliseconds with caching, which is a significant performance improvement.

How do you maintain data consistency in a multi-cloud environment?

As model parameters and dataset sizes continue to grow, GPU resources in the public cloud are a more appropriate choice due to their ample quantity and high flexibility. To reduce costs and meet the demands of multi-cloud architectures, more and more companies are choosing to allocate GPU resources across different cloud platforms.

We introduced the Mirror File System feature to enable users to access JuiceFS data across different cloud platforms and maintain data consistency. The system synchronizes data from the raw file system to the object store in an asynchronous manner.

In addition, we enhance data consistency by periodically synchronizing Raft's changelog with the metadata engine. In a mirrored filesystem, clients can initiate write requests to the original filesystem, and read requests can be initiated from either end - ensuring data consistency for both the original and mirrored filesystems. This design ensures the stability of data consistency in multi-cloud architectures.

03 Practice and Optimization for Thousands of Nodes Clusters

In a cluster containing thousands of nodes, one of the biggest challenges is managing a large number of nodes and their associated resources such as Deployment, DaemonSet, etc. As the number of requests for these resources increases, the APIServer will be under extreme pressure.

Optimization 1: Visual Monitoring

When there are many resources in a cluster, troubleshooting often becomes tedious and difficult. For this reason, we provide a visual dashboard that lists all the application Pods that use JuiceFS PVCs, and displays the mounted Pods corresponding to each Pod, as well as their operational status.

Additionally, if the Pod is acting up, the dashboard shows you tips on what might be causing the problem and provides you with further direction on how to troubleshoot. This dashboard greatly simplifies the troubleshooting process for the user.here (literary)Learn more about this feature.

Optimization 2: Resources & Performance

For Application Pods, it is impractical and wasteful of resources to create separate mount points for each Pod. Therefore, all application Pods using the same PersistentVolumeClaim (PVC) share a single mount Pod by default, and in some configurations, all application Pods using the same StorageClass will also share a single mount Pod to further optimize resource usage.

On the other hand, CSI uses list-watch to manage the lifecycle of mounted pods. In a large-scale cluster, the full list request when the CSI component starts will put great pressure on the API server and may even cause it to go down. Therefore, we adopt the strategy that the CSI component on each node polls the Kubelet on the corresponding node individually to reduce the pressure on the API server.

Optimization 3: Stability

Due to the unique nature of FUSE clients, their Mount Points may still be unavailable after a reboot, which can affect all application data requests. For this reason, we previously implemented an optimization: when a Mount Pod is rebooted due to OOM or other reasons, we perform a remount in the CSI for the application Pod. While this approach restores the mount point, file requests may still be affected at that time.

For further optimization, we get the fuse file descriptor (fd) used by the Mount Pod from /devfuse at startup and pass this fd directly to the CSI Driver via IPC. The CSI Driver maintains a mapping between the Mount Pod and the fuse fd it uses in memory. If a Mount Pod is rebooted due to OOM, the CSI immediately deletes it and starts a new Pod to replace it.

When this new Pod starts up, it will get the previously used fuse fd from the CSI via IPC and re-process the business request. This approach has a relatively small impact on the user side, and there will only be a short lag in the file reading operation, which will not affect the subsequent processing.

Optimization 4: Smooth Upgrade

Smooth Upgrade of a Mount Pod is similar to the previously mentioned Failback, but with one significant difference: during the original upgrade process, the old client would save all the data requests it was currently processing to a temporary file. With the implementation of the smooth upgrade feature, the new client performs two operations: first, it gets its fuse file descriptor (fd) from the CSI, and second, it reads the pre-upgrade data requests from this temporary file immediately after startup. This ensures that no business requests are missed during the upgrade process, thus realizing a truly smooth upgrade feature.

04 Summary

Since JuiceFS CSI Driver was first launched in July 2021, the application scenarios have become more and more complex as the number of Kubernetes users grows and the cluster size continues to expand. Over the past three years, we have continuously optimized and improved the JuiceFS CSI Driver in key areas such as stability, administrative privileges, and so on, which has enabled JuiceFS to effectively adapt to a variety of complex requirements, making it an ideal choice for data persistence in Kubernetes environments.

Finally, we will provide another overview of the core features of JuiceFS as well as key optimizations to help users make better storage selections in a Kubernetes environment.

Data security: JuiceFS guarantees data security by practicing data isolation, encryption and permission control. In addition, its distributed caching technology not only improves system performance, but also effectively controls costs.
Data elasticity: In Serverless environment, JuiceFS adopts sidecar design pattern and dynamic data scaling technology to enhance data elasticity.
Data consistency: JuiceFS's mirrored file system feature ensures data consistency under multi-cloud architecture, ensuring data stability and reliability across different cloud platforms.
High performance and cost control: JuiceFS supports fast pulling of necessary file data from the object store to the local cache, which is typically accomplished in 10 to 20 seconds, dramatically reducing data fetching time, compared to four to five hundred seconds in the uncached state. The warm-up process is parallelized and can be completed before the model is loaded, effectively reducing startup time.
POSIX Compatibility: JuiceFS offers full POSIX compatibility, ensuring a high degree of compatibility with a wide range of applications and systems.