Hugging Face's Transformers is a powerful machine learning framework that provides a range of APIs and tools for downloading and training pre-trained models. To avoid repeated downloads and improve training efficiency, Transformers automatically downloads and caches model weights, word lists, and other resources, which are stored by default in the~/.cache/huggingface/hub
directory. This mechanism for caching data.
However, when there are multiple users or nodes dealing with the same or related tasks, each device needs to repeatedly download the same models and data sets, which inevitably leads to increased management difficulty and wasted network resources.
To solve this problem, set the Hugging Face's cached data directory on shared storage so that every user who needs the resource can share access to the same copy of the data.
When it comes to the choice of shared storage, if there are not many devices and they are all located locally, then you can consider using Samba or NFS sharing. If the computing resources are distributed in different clouds or server rooms in different regions, you need to use a distributed file system with more guaranteed performance and consistency, and JuiceFS is a very suitable solution.
With the features of JuiceFS distribution, multi-sharing, and strong consistency, training resources can be efficiently shared and migrated between different compute nodes, eliminating the need to repeatedly prepare the same data, thus significantly optimizing resource usage and storage management, and improving the training efficiency of the entire AI model.
JuiceFS Architecture
JuiceFS is an open source cloud-native distributed file system that utilizes a technical architecture that separates data and metadata storage, using object storage as the underlying storage to save data and key-value storage or relational databases as the metadata engine to save file metadata. These compute resources can be built on your own or purchased on a cloud platform, so JuiceFS is easy to build and use.
Underlying data storage
In terms of underlying storage, JuiceFS supports almost all mainstream public cloud object storage services on the market, such as Amazon S3, Google Cloud Storage, AliCloud OSS, etc., and also supports privately deployed object storage such as MinIO and Ceph.
metadata engine
In terms of metadata engine, JuiceFS supports a variety of databases such as Redis, MySQL, PostgreSQL, etc. In addition, you can also purchase the official cloud service version of JuiceFS, which adopts the official high-performance distributed metadata engine researched and developed by Juicedata, and can meet the needs of scenarios with higher performance requirements.
JuiceFS Client
JuiceFS is divided into open source version and cloud service version, they use different clients, but the use of the basic method is the same. In this article, we take the open source version as an example.
JuiceFS Community Edition provides a cross-platform client that supports Linux, macOS, Windows and other operating systems and can be used in a variety of environments.
Once you have built a JuiceFS file system based on these computing resources, you can access the JuiceFS file system through the APIs provided by the JuiceFS client or the FUSE interface to read and write files, query metadata, and other operations.
For Hugging Face's cached datastore, you can mount JuiceFS to the~/.cache/huggingface/
directory so that Hugging Face related data can be stored in JuiceFS. Alternatively, you can customize the Hugging Face cache directory location by setting an environment variable to point to a directory where JuiceFS is mounted.
The next section describes how to create a JuiceFS file system and two ways to use JuiceFS as a cache directory for Hugging Face.
Creating a JuiceFS File System
It is assumed that the following object stores and metadata engines have been prepared:
- Object Storage Bucket:
- Object Store Access Key:
your-access
- The object stores the Secret Key:
your-secret
- Redis database:
:6379
- Redis Password:
redis-password
Among other things, object storage related information is only used once when creating a JuiceFS file system, they are written to the metadata engine, and only the address and password of the metadata engine are needed for subsequent use.
Installing the JuiceFS Client
For Linux and macOS systems, the JuiceFS client can be installed with the following command:
curl -sSL /install | sh -
For Windows systems, it is recommended to use the JuiceFS client in a Linux environment with WSL 2. Alternatively, you can download the pre-compiled JuiceFS client on your own, please refer to the official JuiceFS documentation for details.
Creating a JuiceFS File System
Use the format command to create a JuiceFS file system:
juicefs format \
--storage s3 \
--bucket \
--access-key your-access \
--secret-key your-secret \
"redis://:redis-password@:6379" \
hf-jfs
Among them.
-
hf-jfs
is the name of the JuiceFS file system, which can be customized. -
--storage s3
Specify the object storage type as S3. You can refer to the official documentation to learn more about object storage types; - The metadata engine URL starts with
redis://
begins, followed by the Redis username and password, separated by :, and then the Redis local-address and port number, separated by @. It is recommended that you wrap the entire URL in quotes.
Pre-mounting JuiceFS to the Hugging Face Cache Directory
If Hugging Face Transformers is not already installed, you can pre-create a data cache directory and mount JuiceFS to it.
# establish Hugging Face cache directory
mkdir -p ~/.cache/huggingface
# mount JuiceFS until (a time) Hugging Face cache directory
juicefs mount -d "redis://:redis-password@:6379" ~/.cache/huggingface
Immediately after installing Hugging Face Transformers, it will automatically cache the data into JuiceFS, for example:
pip install transformers datasets evaluate accelerate
Since the packages involved in different hardware and environments will be different, please install and configure according to the actual situation, and will not be expanded here.
Specify the Hugging Face cache directory via an environment variable.
Another approach is to specify the Hugging Face cache directory via an environment variable, which allows you to point the Hugging Face cache directory to the directory where JuiceFS is mounted without modifying the code.
For example, the mount directory for JuiceFS is /mnt/jfs:
juicefs mount -d "redis://:redis-password@:6379" /mnt/jfs
Specify the Hugging Face cache directory via the HUGGINGFACE_HUB_CACHE or TRANSFORMERS_CACHE environment variables:
export HUGGINGFACE_HUB_CACHE=/mnt/jfs
In this way, Hugging Face Transformers caches the data into a directory mounted by JuiceFS.
Use Hugging Face cached data everywhere
Thanks to the JuiceFS distributed multi-shared storage feature, users only need to set the Hugging Face cache directory to the JuiceFS mount point and complete the first download of model resources.
Subsequently, mount JuiceFS on any node that needs this resource and set it up in one of the two ways described above to reuse the cached data. JuiceFS uses a "close-to-open" mechanism to ensure data consistency when reading and writing the same data shared across multiple nodes.
Note that the speed of downloading model resources from Hugging Face is affected by the network environment, and may vary in different countries and regions, and on different network lines. In addition, you may face similar network latency issues when using the JuiceFS shared data cache directory. To effectively improve the speed, it is recommended to maximize the bandwidth and reduce the network latency between the worker nodes and the JuiceFS underlying object store and the access metadata engine.
summarize
In this paper, we introduce two methods to use JuiceFS as the data cache directory for Hugging Face Transformers, namely, pre-mounting JuiceFS to the Hugging Face cache directory and specifying the Hugging Face cache directory through environment variables. Both methods can realize the shared reuse of AI training resources among multiple nodes.
If you have multiple nodes that need access to the same Hugging Face cached data, or want to load the same training resources in different environments, JuiceFS may be a very desirable option.
I hope that the content of this article can provide you with some help in the process of AI model training, if there are related issues welcome to join the JuiceFS microblogging group and community users to communicate and discuss.