vivo Regulus File System: Storage Performance Optimization Practices for AI Computing Platforms

In the early stages, the Vivo AI computing platform used GlusterFS as the underlying storage base. With the expansion of data scale and access to multiple business scenarios, performance and maintenance problems began to emerge. For this reason, Vivo turned to the self-developed Regulus file system, which is a distributed file storage solution based on the open source version of JuiceFS.

In this article, we will introduce the new features of vivo's Regulus file system developed on top of JuiceFS. As well as vivo's optimization measures for some key scenarios, such as slow sample data reads and checkpoint writes. In addition, the article will also introduce vivo's technical planning including FUSE, metadata engine and RDMA communication, hoping to provide reference and inspiration for users using JuiceFS in large-scale AI scenarios.01 Background of Regulus File Storage Introduced in Computing Platforms

01 Background of Introducing Regulus File Storage for Computing Platforms

Initially, Vivo's AI computing platform used GlusterFS and was maintained by the team itself. In the process of using it, the team encountered some problems. First, it became very slow when processing small files; second, when it was necessary to perform machine expansion and data balancing on GlusterFS, it had a large impact on the business.

Subsequently, the computing team chose to build new clusters as the earlier clusters were full and not expanded. However, this resulted in multiple clusters to maintain, which increased management complexity. In addition, as a platform provider, they had limited manpower to invest in storage, making it difficult to develop new features.

They learned that our Internet department was developing a file storage solution, after in-depth communication and testing. Eventually, they decided to migrate their data storage to our Regulus file storage system.

Based on JuiceFS open source version, Regulus File System has been developed to support multiple standard access protocols, including POSIX, HDFS and CIFS protocol on Windows. In addition, we also provide a file recovery function, which is referenced to commercial solutions and is able to recover data according to the original path.

At the same time, our system supports client-side hot upgrades, which have been implemented in the open source version. In addition, we also support user name rights management, the default use of local uid/gid for authentication. On this basis, we also refer to JuiceFS Enterprise Edition to realize the user name authentication function.

The following diagram shows the architecture of the Regulus file system, which is similar to JuiceFS. For the underlying pedestal, we use TikV to store metadata, and the data is stored in our homegrown object storage system.Of particular note, in the Windows scenario, we have developed a plugin in Samba that makes direct calls to the JuiceFS API, thus providing users with a gateway to our file storage on Windows。

Current AI computing platforms store the process as follows: raw data is first acquired and processed through a system of 40,000 batch tasks to generate sample libraries. These sample libraries are then trained on GPUs to produce model files that are transferred to an online system for inference. The raw data and processed sample libraries are stored directly in the Regulus file system, which Spark can process directly due to its compatibility with the HDFS API. The model files are also stored in Regulus, and through the CSI plug-in it provides, the online inference system can directly mount and read these files.

02 Storage Performance Optimization

The training phase involves two important aspects of storage: sample reading and checkpoint saving during training.

Session 1: Accelerated Sample Reading

In order to improve the speed of sample loading, we developed a distributed read cache layer. Before training the model, we prioritize the data needed for this training to be preloaded into the read cache layer with the warm up function provided by JuiceFS. In this way, the training data can be fetched directly from the read cache layer instead of pulling from the object storage system. Typically, reading data directly from the object store can take tens to tens of milliseconds, but the read cache layer reduces the read time to less than 10 milliseconds, which significantly improves the speed of loading data to the GPU.

Session 2: Checkpoint Writing

For checkpoint writes, we refer to theBaidu's programSpecifically. Specifically, the checkpoint data is first written to a temporary cache area (we call it a "co-managed" area, but we may be referring to some form of intermediate cache or staging area here), and then progressively flushed to the object store. In this process, we also use a single-copy model, because the checkpoints themselves are saved at regular intervals, and even if a checkpoint is lost at a certain time, the impact on the overall training is limited. Of course, we have also developed some strategies to ensure the security of critical data, and not all data goes into this intermediate cache area. Typically, only checkpoint files and log files from the training phase are written. If training is interrupted, checkpoint files can be read from this intermediate cache area.

In addition, when data is written and flushed to the object store, we do not immediately clear this data from the checkpoint cache. Because the training process can be interrupted at any time, if the data in the checkpoint cache is cleared at this time and needs to be pulled from the object store again, it will take a long time. Therefore, we set up a TTL (time to live) mechanism. For example, if the checkpoint data is refreshed to the object store every hour, we can set the TTL to 1.5 hours. This way, even if training is interrupted, we can be sure to have an up-to-date backup available in the checkpoint cache.

During the development of the write cache, we encountered a challenge. Since the communication between our client and the write cache uses the gRPC protocol, this protocol reclaims memory to store the parsed data when the data is deserialized. A very high concentration of write operations (e.g., within a few tens of seconds) during a given period of time can result in a large number of memory requests and releases. Since we are developing in the Go language, its garbage collection (GC) mechanism performs slowly in such situations, which may lead to memory exhaustion for the write cache.

To solve this problem, we investigated other data deserialization schemes. In the end, we adopted Facebook's flatterbuffer solution. Unlike gRPC's Pb deserialization, flatterbuffer can use the data directly after deserialization without additional parsing steps. In this way, we reduce memory usage by up to 50% compared to Pb. We also tested write performance and found that using flatterbuffer improved write performance by 20%.

Session 3: Online Reasoning with High Model Loading Traffic

While users were performing online reasoning, we noticed that model downloads generated a tremendous amount of traffic, sometimes even filling up the bandwidth of the object storage gateway. Upon deeper analysis of this scenario, we found that there are numerous instances, each of which independently loads the full model into memory, and that these instances start loading the model almost simultaneously, a behavior that creates significant traffic pressure.

To address this problem, we borrowed from commercial solutions and adopted an approach that implements logical grouping in Pods. Under this strategy, each subgroup reads only one copy of the complete model from the underlying storage, while each node within the subgroup reads part of the model, and reduces the overall traffic demand by sharing data between nodes (similar to the P2P approach). This approach significantly reduces the bandwidth consumption of the underlying object storage and effectively relieves the traffic pressure.

03 Technology planning

libc calls bypass the FUSE kernel to improve read/write performance The following diagram is from a paper in the ACM journals. It shows that when using a FUSE mount, the request flow moves from the user state to the kernel state and back again. The consumption of context switches in this flow is quite significant.

The higher part of the bar represents the native FUSE, while the lower part of the bar represents the optimized solution.

Small file scenario: Native FUSE has a 1000x difference in the number of context switches compared to the optimized solution;
Large file scenario: the difference in the number of context switches between the native FUSE and the optimized solution is about 100x;
Mixed load scenarios: again showing a huge difference in the number of context count switches.

As mentioned in the paper, the main source of link consumption is context switching. Therefore, we plan to optimize at this layer of FUSE, mainly for metadata and small file scenarios. Currently, we are working on the scheme selection.

Self-developed metadata engine, document semantics down the line

We also plan to develop a metadata engine of our own. Currently, the metadata engine we use is based on TiKV, but TiKV does not have file semantics, all file semantics are implemented on the client side. This causes great inconvenience to our feature development work.

Also, when multiple nodes write to a key at the same time, transaction conflicts are very frequent. Recently, we have also encountered a problem where the process would suddenly get stuck, lasting from a few minutes to ten minutes. This problem has not been resolved.

In addition, the TiKV PD component is in Active mode for the master node, and after 100,000 requests, the latency rises significantly, and the CPU utilization of the PD node (112 cores) is close to saturation. Therefore, we are trying some schemes to reduce the CPU utilization of the master node to see if the time-consumption problem can be improved. We refer to papers such asBaidu's CFS Paper, turn all metadata operations into single-machine transactions as much as possible to reduce the overhead of distributed transactions.

Cache Layer Implementation of RDMA

Communication about the GPU nodes in our server room, which are currently using an RDMA network. Communication with the cache layer still uses the TCP protocol. We have plans to develop an RDMA-based communication method for low-latency, low-CPU-consumption communication between the client and the cache.

By looking at the client's flame graph, we see that the time consumption of RPC communication is still very evident. While it only takes one or two milliseconds to process the data in the write cache, the client's time spent uploading the data across the link may reach five or six milliseconds, or even ten milliseconds. In cases where the client CPU is very busy, this time can reach twenty or thirty milliseconds. RDMA itself does not consume much CPU, and memory consumption is relatively low, so we think it is a solution worth trying.