Good Future: Building a Low-Operation Model Repository Based on JuiceFS in a Multi-Cloud Environment

Good Future, formerly known as Xueshis, was listed on the New York Stock Exchange in 2010. The company actively applies big model research to teaching products, and has recently launched a hundred billion big model in the field of mathematics.

In the context of large models, storage systems need to handle huge amounts of data and complex file operations, and are required to support high concurrency and high throughput. In addition, it needs to address the challenges of version management, model training performance optimization, and multi-cloud distribution.

To solve these problems, the team developed a model repository based on JuiceFS, which supports users to store checkpoints for the training process, and the control surface supports users to upload and uniformly manage models from various cloud environments. Through the JuiceFS CSI component, Good Future mounts the model repository to each cluster, and it takes only 1-3 minutes to mount and configure large model files, making AI application elasticity easier.

In addition, through the implementation of permission control, cloning backup and other strategies, it effectively reduces the loss of user misoperation and improves data security. At present, Good Future has deployed two sets of metadata and data warehouses in multi-cloud and multi-location; the usage scale of object storage reaches 6TB, storing more than 100 models.

01 Challenges of model repositories in the context of large models

In the traditional DevOps R&D process, we usually use container images as deliverables, i.e., Docker Builder builds an image and pushes it to Docker Hub or other image repositories via docker push command. The client data plane pulls the image locally from the central image repository via Docker, and this process may be accelerated to improve efficiency.

When it comes to AI scenarios, things are different. In the case of a training task, for example, it might use Torch Save, Megatron save_checkpoint, or some other means to generate a serialized file, which is subsequently written in Linux POSIX to storage, which can be either an object store (e.g., AliCloud's OSS) or a file system (e.g., GPFS or NFS). In short, the training task uploads the model by writing it to the remote storage by writing to the file system. There are also some evaluation steps involved in the whole process, but we omit them here to describe the whole process concisely. Unlike traditional IT delivery that only involves pushing and pulling images, AI scenarios need to deal with larger scale data and more complex file operations, which are more demanding on the storage system and often require high concurrency and high throughput support.

At the reasoning stage, in the container scenario, you need to mount the NFS or GPFS system or object storage system via CSI to pull the remote model into the container scenario. On the surface, this process seems to work fine. However, during the actual operation, we still found a number of obvious problems.

The number one problem is the lack of versioning. For traditional container images, we have explicit deliverables and versioning information, including details such as uploader, size, etc. However, in model repositories, since models are usually stored as Linux filesystems (in the form of files or folders), there is a lack of versioning and metadata information, such as uploaders, modifiers, and so on.
Second, model repositories do not enable acceleration and multi-cloud distributionIn a Docker scenario, we can use tools such as Dragonfly to accelerate. In a Docker scenario, we can use tools such as Dragonfly for acceleration. However, when using storage systems such as NFS, GFS, or OSS, there is a lack of an effective acceleration solution. At the same time, it is not possible to realize multi-cloud distribution. Take NFS as an example, it is basically closed-loop within an IDC and cannot be mounted across IDCs or even across clouds. Even if it can be mounted, its speed is very slow, so we can assume that it cannot realize multi-cloud distribution.
Finally, poor security. In inference scenarios, the entire model repository needs to be mounted into the container. If the client's mount permissions are too large (e.g., a directory containing a large number of models is mounted), this may lead to model leakage or accidental deletion problems. Especially when the mount method is read-write, the client may even delete the model files.

This gives rise to storage requirements for model repositories in different scenarios.

Storage requirements for training scenarios:

Model Download and Processing: In the algorithm modeling phase, various models need to be downloaded and possibly transformed and sliced, which include open source models, reference models, or self-designed network structures. For example, TP (Tensor Parallelism) and PP (Pipeline Parallelism) slicing of models is performed.
High performance read/write: The training phase requires a storage system with very high read and write throughput in order to store large-scale checkpoint files. These files can be very large, such as a single checkpoint file size approaching 1TB.

Storage requirements for reasoning scenarios:

Model Versioning and Servitization: When a model is updated to a new version, a version release and approval process is required. In addition, during model servitization, there may be a need to frequently expand or contract the GPU resources used, which is usually done at night for resource release and during the day for resource expansion.
High read throughput performance: Since model copies need to be pulled frequently during the day to cope with resource expansion, the storage system needs to support efficient read operations to ensure fast response to model pulling needs.

In addition, the following needs exist from a manager's perspective:

Team-level model management: The model repository should support segregated management of models by team to ensure privacy and independence of models between different teams.
Extensive version control: The storage system should be able to clearly record meta-information such as iteration time and version usage of the model, and support uploading, downloading, auditing and sharing functions of the model.

02 Storage Selection: How to choose between cost, performance, and stability?

Core Considerations

First, reduce your dependence on cloud vendorsThe technology solution should be consistent and harmonized across self-built IDCs as well as across multiple cloud vendors;

Secondly, cost is also an important consideration. While adequate funding can support better solutions, a cost-benefit analysis is equally important. We need to consider the costs of various storage options together, including local disk, GPFS, and object storage (e.g., OSS);

Third, performance is a key factor. According to the previous background, model repositories have high requirements for both read and write performance. Therefore, we need to close the loop of model read and write traffic on a single IDC or a single cloud to ensure high performance;

Finally, stability is also a factor that cannot be ignored. We will not introduce excessive O&M complexity to support the model repository. As a result, there are high requirements for component complexity and stability.

Comparison of major technology options

Fluid + Alluxio + OSS: This solution has been relatively mature and received widespread attention in previous years. It combines the native orchestration capabilities of the cloud and the acceleration capabilities of object storage to achieve multi-cloud technical unification. This solution can be used in AliCloud, Tencent Cloud or self-built IDCs. In addition, the solution is also quite widely used in the community.

However, it has some shortcomings. For example, it cannot be integrated with Ceph Rados, which is a pre-existing technology stack within our group. At the same time, the solution is more complex to operate and maintain, with more components and resource consumption. The performance of this solution is also not ideal for the read speed of large files. In addition, the stability of the client needs to be further verified.

GPFSGPFS is a commercially available parallel file system with strong read and write performance, and our group has already purchased this product. In addition, GPFS has significant advantages in handling large volumes of small files. However, its disadvantages are equally obvious. First, it does not allow for multi-cloud synchronization, which means that the GPUs we purchased at IDC cannot be used to purchase another set of GPFS on another cloud, which is costly. Second, GPFS capacity is also very expensive, several times that of OSS.

CephFS: We have some technical deposits and advantages for this technology within our group. However, it is equally incapable of multi-cloud synchronization and has high operating costs.

JuiceFS: It has the advantage of supporting multi-cloud synchronization, simple operation and maintenance, fewer components and better observability. In terms of cost, it is basically only the cost of object storage, and there is no other additional cost except for metadata management. In addition, JuiceFS can be used both on the cloud with object storage and in IDC with Ceph Rados. Considering the above factors, we chose JuiceFS as the underlying storage system to support the large model repository.

03 Good Future Model Warehouse Practice Program

Model Repository Read/Write Design in Training Scenarios: Single Cloud Training

First, focusing on the model repository read/write setup in the training scenario. We adopt a single-cloud training strategy, i.e., model training on a single cloud platform rather than across multiple clouds, which is mainly considered for feasibility and efficiency in practice.

To meet the read/write requirements in the training scenario, we formulate the following scheme: we form a ceph cluster with redundant NVMe disks on a large number of GPU machines, and use JuiceFS to connect to Rodos, so as to realize the read/write operations of the ceph cluster. During the model training process, the model will mount a disk with JuiceFS CSI. When you need to perform checkpoint storage or loading, you will read and write to Rodos directly.The checkpoint write speed is measured to be up to 15GB/s during large model training.。

For metadata management, we chose Redis over complex metadata management engines like OceanBase or TiKV for the following reasons:We use it only for storage of large files, each of which may be several GB in size, and therefore we judge that its relatively small data volume eliminates the need for a complex metadata management engine to reduce the O&M burden。

Model Repository Read/Write Design in Reasoning Scenarios: Multi-Cloud Reasoning

Unlike the training scenario, inference resources are usually distributed on multi-cloud platforms, such as AliCloud and Tencent Cloud. On these cloud platforms, we will not buy a large number of NVMe machines because the cloud itself has object storage capabilities. Therefore, we adopt the classic model of JuiceFS, i.e., JuiceFS plus Redis, which forms a cluster with the cloud vendor's object storage. During the inference process, the model files are mounted in a read-only manner to avoid their modification by the program. In addition, we design an intermittent synchronization scheme for multi-cloud environments to ensure that the models can be synchronized to JuiceFS on all clouds.

In the face of certain challenges, when large-scale scaling of inference services is required during the day, take scaling HPA (Horizontal Pod Autoscaler) as an example, such fixed-point scaling will result in a large number of inference services starting up at the same time, and a large number of model files need to be read quickly. In this case, without the support of local cache, the bandwidth consumption will be extremely huge.

To cope with this challenge, we adopt a "warm up" strategy. That is, the model files that are about to be read are preloaded into the cache by warming up before the timed expansion. This can significantly improve the elasticity speed of scaling up and ensure that the inference service can be up and running quickly.

Administration: Design of model repository uploading and downloading

The main focus of the management side is on upload and download functionality. We have developed a client in-house that supports uploading model files via the S3 protocol. the S3 gateway receives and transforms these requests and then interacts with metadata systems such as Redis.

Another important design in our application scenario is the de-duplication of identical files. We adopt a design idea similar to that of Docker image repositories, which is to calculate the MD5 value for each file. If two files have the same MD5 value, they will not be uploaded repeatedly. This design not only saves storage space, but also improves uploading efficiency.

In addition, we keep some snapshots when updating the model. When a file is copied using the JuiceFS snapshot feature, it does not add a new file to the OSS, but only records it in the metadata. This makes it easier and more efficient to update models and keep snapshots.

04 Future prospects

On-demand synchronization of multi-cloud model repositories: Our current practice is to synchronize JuiceFS cluster data on a cloud with regular batch synchronization. However, this approach is relatively simple and crude, and may not be able to meet the higher demand for data synchronization in the future. Therefore, we plan to improve it by implementing a token-based synchronization system. This system will be able to identify regions that need to be synchronized and automatically synchronize these data to the multi-cloud platform upon receiving model upload events. In addition, we will introduce some warm up strategies to optimize the process of data synchronization and improve the synchronization efficiency and accuracy.

Extending standalone cache and distributed cache: We are currently using 3T MVME cache on a single machine, and in the short term, the capacity of this approach is still relatively sufficient. However, in the long run, in order to meet the demand for greater data storage and access, we will be based on the principle of consistent hash, in the client side of the independent development of a distributed cache component. This component will be able to increase the degree of open capacity and hit rate, so as to meet the higher requirements for data storage and access in the future.

I hope this has been of some help to you, and if you have any other questions feel free to join theJuiceFS CommunityCommunicate with everyone.