With the popularity of large models, GPU computing power resources are becoming increasingly scarce, and the traditional strategy of "computing power follows storage" needs to be changed to "storage follows computing power". In order to ensure data consistency and ease of management, enterprises usually choose object storage as the centralized storage point for all model data on a public cloud in a specific region. When scheduling computing tasks, manual intervention is often required, and manual data copying and migration methods are not only costly, but also involve complexity in management and maintenance, including issues such as privilege control, which are extremely tricky.
JuiceFS Enterprise Edition's "Mirror File System" feature allows users to automatically replicate metadata from one region to multiple regions, creating a one-to-many replication model. In a multi-cloud architecture, this feature ensures data consistency while significantly reducing the workload of manual operations and maintenance.
In the latest JuiceFS Enterprise Edition 5.1, the image file system not only supports reading, but also adds the ability to write directly. In this article, we will discuss the implementation principle of reading and writing mirrored file system.
01 Why you need a mirrored file system
Let's envision a scenario where a user's file system is deployed in Beijing, but there is insufficient supply of GPU resources in Beijing, while the user still has available GPU resources in Shanghai. If the user wants to run a model training task in Shanghai, there are two simple scenarios:
- Directly mount the Beijing file system in Shanghai. Theoretically, as long as the network connection between Beijing and Shanghai is smooth, clients in Shanghai can indeed access the data for training. In practice, however, file system access usually involves frequent metadata operations, and the performance results often fall short of expectations due to the large network latency between the two locations.
- Create a new file system in Shanghai and copy the required datasets to Shanghai before training. The advantage of this is that the performance of the Shanghai training task can be guaranteed. However, the disadvantages are also obvious, on the one hand, building a new file system requires high hardware cost, on the other hand, synchronizing the data before each training also improves the complexity of operation and maintenance.
To summarize, neither of these two simple solutions is satisfactory. For this reason JuiceFS Enterprise Edition offers the Mirror File System feature.It allows users to create one or more complete mirrors of an existing file system, which automatically synchronize metadata from the source so that clients in the mirrored area can access the file system nearby for a high-performance experience.The mirrored filesystem is a good solution. Since only metadata can be mirrored and the synchronization process is automated, a mirrored file system has significant advantages in terms of cost and O&M complexity over the previously mentioned Option II.
02 Principles of the Image File System
The architecture of JuiceFS Enterprise Edition is similar to that of the Community Edition, which includes a client, an object store, and a metadata engine. The difference is that the metadata engine of the community edition usually uses third-party databases such as Redis, TiKV, MySQL, etc., while the enterprise edition is equipped with a self-developed high-performance metadata service, in which the metadata engine consists of one or more Raft Groups, and its architecture is shown below:
Thanks to the separation of metadata and data, users can independently choose whether to mirror metadata and whether to mirror data when creating a mirrored file system. The architecture for mirroring both is as follows:
At this point, the mirrored metadata service actually belongs to the same Raft group as the source metadata service, except that their roles are that of learners. when a metadata update occurs at the source, the service automatically pushes the changelog to the mirror and plays it back in the mirror service. In this way, the existence of the mirrored filesystem does not affect the performance of the source filesystem, but the metadata version of the mirror will be a little behind.
The mirroring of data is also done using asynchronous replication, with automatic synchronization performed by the specified configured node. The difference is that for a client in the mirrored region, it only accesses the metadata in its home region, but can access the object stores in both regions simultaneously. When actually reading data, the client prioritizes reading from this region; if it cannot find the desired object, it then tries to read from the source region.
In general, the volume of data itself is large, and the cost of making another copy is relatively high, so another more recommended way is to only mirror the metadata, and in the mirrored area to build a set of distributed cache groups to improve the speed of reading the data, as illustrated below:
JuiceFS mirrored file system recommended use: two regions share the same object storage, mirror region to build distributed cache groups to improve performance
This usage is especially suitable for model training and other scenarios where data sets can be prepared in advance. Before executing the training task, the user first pulls the required data objects into the cache group of the mirrored region through the juicefs warmup command, and the next training can be completed in the mirrored region, and the performance is basically the same as that in the source (assuming that a similar distributed cache group is also configured).
03 Experimental New Feature: Writable Mirror File System
In previous releases, the mirror client defaulted to read-only mode because the mirror metadata itself only supported reads, and all modification operations had to be performed at the source. However, as user requirements have increased, we have noticed some new use cases thatFor example, temporary data generated during data training. Users want to avoid maintaining two different filesystems and expect the mirror side to support a small number of writes as well。
In order to fulfill these needs, we introduced the "Writable Mirror File System" feature in version 5.1. When designing this feature, we consider three main aspects: firstly, the stability of the system, which must be guaranteed; secondly, the consistency of the data at both ends; and finally, the performance of writing.
Initially, a straightforward option we explored was to allow metadata mirroring to also handle write operations. However, during development we found that there are very complex detailing and consistency issues when it comes to merging metadata updates from both ends. Therefore, we maintain the design of "source-only metadata writable". In order to handle the write requests from the mirrored clients, there are two alternative options:
Option 1: The client sends the write request to the mirror's metadata service, which then forwards it to the source.. The source receives the request and starts executing the operation and synchronizes the metadata back to the mirror when it is done and eventually returns. The advantage of this approach is the simplicity of the client operation, which only needs to send the request and wait for the response. However, this complicates the implementation of the metadata service because of the need to manage the forwarding of requests and the synchronization of metadata. In addition, due to the long link, any error in any link may lead to an error in request processing.
Option 2: The client not only connects to the mirror's metadata service, but also directly connects to the source's metadata serviceThe client performs read/write separation internally. The client internally separates reads and writes, with read requests still being sent to the mirrors but write requests being sent to the source. This approach complicates the processing logic on the client side, but simplifies the implementation of metadata services, allowing them to make only minor adaptation changes. It is also more stable for the system as a whole.
Considering the simplicity and reliability of the service, we finally chose option 2, as shown in the following figure. Compared to the original architecture, this scheme mainly has an additional process of mirroring the client to send write requests to the source metadata service.
The following is an example of a create request to create a new file. Assuming that the metadata services on the source and mirror side are A and B respectively, and the mirror client is C, the completion of the request is roughly divided into 5 steps:
- Client sends write request: C first sends a create request to A to create a file.
- Source service response: After processing the request, A sends create OK to inform C that the file has been successfully created, and attaches A's metadata version number (assumed to be v1) to the response.
- Changelog Push: As soon as A sends a reply to a client, it also generates a changelog and pushes it to B.
- Client sends a wait request: after C receives a successful reply from the source, it checks its own mirror metadata cache to see if its version is also up to v1. If it is not, the client sends a wait message to B with the version number v1.
- Mirror service response: B receives the wait message and checks its own metadata version. If it has already reached v1, it replies wait OK to C immediately; otherwise, it puts the request into the internal queue and waits for its own version number to be updated to v1 before sending a reply.
C confirms in step 4 that the mirror version has reached v1, or receives a wait OK in step 5 and returns it to the upper tier application. In either case, B has already incorporated the changes from this create, so C will be able to access the latest metadata when it reads it later. Also, since steps 2 and 3 happen almost simultaneously, in most cases the wait message is processed and returned immediately.
The read operation of the mirror client has a similar mechanism for checking the version. Specifically, before sending a read request, C compares the metadata version numbers of the source and mirror services in its cache; if the source version number has been updated, it sends a wait message to B and waits until its version has been updated before processing the original read request. Unfortunately, the version number of the source in C's cache is not always up-to-date (e.g., if it hasn't sent a write request for a long time), which means that this mechanism only allows C to read as much new data as possible, but it doesn't guarantee that it's always up-to-date (there may be a time lag of less than 1 second, the same as with the original read-only mirrors).
Finally, we'll briefly illustrate the immediate user benefits of using the JuiceFS image file system with a slightly more complex read/write hybrid example.
The requirement is that Client C wants to create the/d1/d2/d3/d4
Create a new file in the directorynewf
. According to the design of the file system, C needs to look up each directory and file on the path one by one, and confirm that the file does not exist before sending a create request. Assume that the network latency from C to A and B is 30ms and 1ms, respectively, and that C has not yet established a metadata cache and ignores the request processing times for A and B. The network latency from C to A and B is 30ms and 1ms, respectively.
Use of mirrored file systems: C's read requests are handled by B, and only the last file creation request needs to be sent to A. The total time required is about 1 * 2 * 6(mirror lookup) + 30 * 2(source create) + 1 * 2(mirror wait) = 74ms.
Without the use of a mirrored file system: If the source filesystem is mounted directly on the mirrored area, each request from C needs to interact with A, so the total time required is 30 * 2 * 6(source lookup) + 30 * 2(source create) = 420ms, which is more than 5 times as long as the previous one.
04 Summary
In AI research, multi-cloud architectures have become the standard for many organizations due to the extremely high cost of GPU resources. With JuiceFS mirrored file systems, users can create one or more complete file system mirrors that automatically synchronize metadata from the source, allowing clients in the mirrored region to access files close to each other, delivering high performance and reducing O&M efforts.
In the latest JuiceFS 5.1 release, we have made important optimizations to the mirrored file system, adding the ability to allow writes, enabling enterprises to access data in any data center with a unified namespace. At the same time, they can enjoy the acceleration effect of nearby caching under the premise of ensuring data consistency. We hope that the implementation ideas and attempts shared in this article will provide users with some insights and inspiration.