Location>code7788 >text

MiniMax: How to Build a High-Performance, Low-Cost AI Platform for Large Models Based on JuiceFS

Popularity:688 ℃/2024-09-02 17:01:28

Founded in December 2021, MiniMax is a leading general artificial intelligence technology company dedicated to co-creating intelligence with users.MiniMax has independently developed different modal general macromodels, including the trillion-parameter MoE text macromodels, speech macromodels, and image macromodels.
Based on the common big model of different modes, MiniMax has launched native applications such as "Conch AI", a productivity tool, and "Starfield", an immersive AI content community, etc. The MiniMax open platform provides enterprises and developers with safe, flexible, and reliable API services to help quickly build AI applications. MiniMax open platform provides enterprises and developers with safe, flexible and reliable API services to help them quickly build AI applications.

01 Storage Challenges in Multimodal Large Model Development

As a startup, MiniMax focused on flexibility and cost efficiency when building its infrastructure. As a result, the company chose to deploy critical loads, such as GPU resources, in local data centers and other sources in the cloud to be able to take advantage of the technical benefits, elasticity, and flexibility of cloud platforms. As a result, MiniMax adopted a hybrid cloud solution that combines a local data center with a multi-cloud environment. Given the complexity and management challenges of the underlying infrastructure, the company adopted Kubernetes as the unified management layer for the infrastructure.

The storage layer, a key component of the infrastructure platform, faces the following challenges:

  • High performance: The training and inference of large models requires the processing and storage of huge amounts of data, which not only requires a high-capacity storage solution, but also needs to ensure that the data can be read and written quickly;
  • POSIX compatibility: Deep learning frameworks and algorithm engineers base their daily work on the POSIX interface, which requires a storage system that is fully POSIX-compatible or else AI tasks will not function properly;
  • Hybrid cloud architecture: Computing resources, especially GPUs, are distributed across geographic regions and provided by different service providers. In order for compute tasks to be efficiently scheduled, the storage system needs to be adaptable to a variety of service providers and hardware environments, with a high degree of flexibility to support data replication, access, and migration across regions;
  • Storage Cost Optimization: As data volumes continue to grow, especially in Big Data and AI applications, expanding storage capacity while controlling costs is a challenge. Organizations need to adopt cost-effective storage technologies while ensuring that they integrate seamlessly with existing IT architectures.

02 Why JuiceFS Enterprise Edition?

At the beginning of the selection process, MiniMax investigated CephFS, which has some bottlenecks in metadata service. At the same time, MiniMax also tried some high-performance file storage solutions for public cloud services, but eventually gave up due to the high cost. MiniMax wanted to use a storage system that was flexible and highly scalable, while also solving the cost problem and meeting the requirements of a hybrid cloud architecture.

In the end, MiniMax chose JuiceFS Enterprise Edition as the storage base for the company's AI platform to support the high-performance data access requirements of various types of upper-layer models (including text models, speech models, image models, and multimodal models) in the scenarios of data cleansing, model training, and model inference. Especially in distributed training scenarios with large-scale GPU clusters, JuiceFS's excellent performance plays a key role in improving model iteration and GPU utilization.

  • Compatibility: Supports POSIX, HDFS and S3 interfaces, providing a unified storage solution to reduce data copying and migration;
  • Optimize I/O Efficiency: Dramatically improve I/O performance with multi-level caching, read-ahead and concurrent read strategies;
  • High-performance metadata service: The self-developed metadata service is capable of handling millions of requests per second with sub-millisecond response times, meeting the stringent requirements of AI training stages;
  • Multi-Cloud/Hybrid Cloud Data Management: Automated cross-cloud and cross-region data replication ensures that data is automatically synchronized with compute migration for globally distributed compute needs;
  • Low Cost: JuiceFS significantly reduces data storage costs through its object storage-based design, enabling MiniMax to utilize cost-effective object storage. In addition, JuiceFS' ease of operation and maintenance helps to reduce MiniMax's overall costs.

03 How to build a unified storage system based on JuiceFS in a hybrid cloud architecture?

Initially, MiniMax faced frequent reads and relatively few writes. For this reason, the company adopted the distributed caching function of JuiceFS and used NVMe all-flash memory to accelerate read operations. With the expansion of data processing requirements and cluster construction, the capacity of a single cluster is no longer sufficient to meet the needs of MiniMax. Therefore, MiniMax built a centralized metadata distribution engine based on JuiceFS and an edge cluster architecture capable of high-speed read and write operations.

MiniMax uses JuiceFS's mirrored file system feature to automatically replicate metadata from the center cluster to each edge cluster. Instead of storing the actual data, edge clusters preheat the data from the center cluster to JuiceFS' distributed cache over a dedicated line, leveraging the high-performance cache cluster to increase read bandwidth and reduce data duplication.

In addition, data can also be downloaded on demand to a JuiceFS cluster at the edge and lifecycle managed by upper tier services.

Based on this unified storage system, MiniMax built a large ring network covering the whole country, and used some cities as core access points. At the same time, it connects to these access points in close proximity in each IDC room to support efficient data distribution.

"JuiceFS not only provides a storage system that adapts to our hybrid cloud needs, but also optimizes the data processing flow through high-performance metadata services and multi-interface compatibility, significantly reduces O&M costs, and is ideal for us to operate a large model platform."

-- Xinglong, Technical Director, MiniMax