Hazel Simulation System Storage Practice: High Availability and Minimal Operation and Maintenance in Hybrid Cloud Architecture

Hazel Innovations is a technology company focusing on the research, development and design of box-type warehousing robotic systems. Its simulation platform reproduces actual warehouse environments and equipment through digital simulation technology, using imported data such as maps, orders, inventories and strategy configurations to validate and optimize the warehousing solution, ensuring the efficiency and reasonableness of the design solution.

Initially, Hazel's simulation platform was running in a stand-alone environment, but with the growth of data volume, operation and maintenance gradually faced challenges. As a result, the platform was migrated to a Kubernetes environment in a private cloud, and the team then began to look for a distributed file system suitable for use in a k8s environment.

The data characteristics of the simulation platform include: a large number of small files, concurrent writes, cross-cloud architecture, etc. After comparing Longhorn, Ceph and other systems, HazelTech chose JuiceFS. After comparing Longhorn, Ceph and other systems, HazelTech chose JuiceFS; at present, the total number of files on the platform is 11 million, and the average daily number of written files is more than 6,000, mainly small files, with an average file size of 3.6KB; meanwhile, the number of Mount Points is more than 50.

01 Simulation Platform Storage Challenges: From Commercial Software to Build Your Own

A simulation platform (simulator) is a tool platform based on a discrete event engine. It is through the underlying digital simulation, docking upper business systems, in the absence of physical equipment, to achieve the equivalent capacity with physical equipment. Simply put, the simulation platform is actually the establishment of a virtual warehouse, all the simulation is based on this virtual warehouse to complete.

Traditional companies usually use commercial software, but these software cannot perform large-scale scheduling or reduce the consumption of computing resources. For this reason, we have self-developed a simulation platform that covers IaaS, PaaS and other services. Our simulation is mainly based on discrete events, and unlike commercial software that uses GPUs, we significantly reduce the amount of computation by simplifying the computation steps and optimizing the abstraction of events, and we can even complete the computation task of multiplespeed simulation using only CPUs.

Our simulation system consists of several key components: firstly, the simulation ontology, such as the robot simulation, including the logical simulation of its motion, rotation and movement speed. The simulation equipment is directly interfaced with our self-developed scheduling system and WMS (Warehose Management System). This is a feature not directly available in commercial software.

The simulation process also includes mechanical simulation, picking operation simulation and business integration of the upstream system environment. After the simulation is completed, a large amount of event data is generated, which is recorded in small files. We extract and analyze this data by separating these small files in another node for storage and computation. In this way, we are able to effectively gain insights into operational efficiency and potential problems from the simulation data to further optimize system performance.

Our simulation platform started as a pure stand-alone system. The stand-alone version consisted of a starter, a core engine (analogous to a game engine), which would be connected to a real business system, the RCS (Robot Control System), through a real communication protocol, thus providing a high degree of field reproduction.

However, the stand-alone system has some drawbacks, such as the original Core Enging is written in Python, which can't meet the needs of high concurrency IO of stand-alone machine, and the O&M becomes especially difficult when the system is scaled up to 50 VMs. Considering the number of people in our team, O&M at this scale is a huge challenge for us.

02 Simulation Platform to the Cloud: From Private to Hybrid Cloud Architecture

The standalone system lasted for about 2 years, and as the volume of data grew, the standalone system faced increasingly complex O&M issues, so we migrated to the Kubernetes architecture, adopting a service-as-a-service (SAAS) approach and deploying all components in a K8s environment. As a result, we were faced with the question of how to select storage in a Kuberenetes environment.

K8s Environment Storage Selection: JuiceFS vs CephFS vs Longhorn

When we choose a distributed file system, performance is not the primary consideration, but more critical is how to effectively implement a cross-cloud network structure. We briefly evaluated JuiceFS, CephFS and Longhorn.

Longhorn only supports file sharing within a single cluster, and requires multi-cloud cluster network communication with Kubernetes, which is cumbersome for us in terms of O&M costs. CephFS is too difficult to maintain for a small team like ours.

As a result, we chose JuiceFS, mainly because it's particularly easy to run and maintain, making it ideal for small teams. the plug-and-play nature of JuiceFS means that even beginners can get started with little to no major problems, which reduces operational risk. Additionally, JuiceFS has had no failures since deployment, which is very commendable.

In terms of the current scale of JuiceFS usage, we have processed about 11 million files in the last six months. We perform data cleansing every six months to remove data that is no longer needed, so the total amount of data will not continue to grow. Average daily writes are about 6.4k files, with an average total of 60,000 files per day, with file sizes ranging from 5kB to 100kB. The average daily data write size is about 5GB, with a maximum of 8GB, and we plan to scale up to 50 concurrent writes over time.

Building a Simulation Platform in a Private Cloud K8s Environment

In mid-2023, we started building the simulation platform in a K8s environment. When the system scaled to 1,000 robots, we realized that Python's IO processing power was insufficient. We therefore adopted the Go language, a change that dramatically improved the IO performance of the system, making it easy to support more than 10,000 robots concurrently on a single machine.

The file processing of the whole system is characterized by highly concurrent writing of small files, and the current concurrency level is 50, which meets our needs.

Recently, we transformed our system from a monolithic architecture to a microservice architecture. This shift solves the problem of mixed and coupled code of the original team and ensures the independence of code between each service, thus improving the maintainability and scalability of the system.

We also implement the strategy of storage-computation separation, in which the simulation node writes a small file of simulation process data using JuiceFS and renames it as *.fin file at the end of writing; when another analysis computation node discovers the *.fin file in real time and starts the computation, thus realizing the storage-computation separation between the isolated simulation and the analysis, and avoiding the CPU hijacking that causes distortion of the simulation process.

Hybrid Cloud SaaS Simulation Services

In a private cloud K8s environment, our team needs to manage numerous components such as MySQL, OSS, etc. In order to solve the problem of dispersed resources and efforts, we started to move to a hybrid cloud SaaS solution in January 2024 and chose to migrate our storage service to AliCloud. For small teams, if conditions allow, I recommend utilizing public cloud services as much as possible, especially when it comes to storage, as data security is a basic need and there can be no mistakes.

We used AliCloud's OSS and MySQL services to share data between the two clusters. This configuration not only improves the efficiency of data processing, but also brings the advantages of cost-effectiveness and flexibility. For example, when the machine resources in the local data center are insufficient, using JuiceFS's natural cross-cloud storage capability, under this architecture, we can easily rent machines from any cloud vendor to elastically scale the cluster to meet the demand, thus realizing cost savings.

03 Pitfalls with JuiceFS

Before migrating to the cloud, we were running fine with Redis and Minio locally because we had plenty of memory and storage so that there were no performance issues. However, after migrating to the cloud, some issues were encountered.

Default cache is too large, causing Pod evictions

The default cache size is set to 100GB, while standard AliCloud servers usually come with 20GB of disk. this mismatch in configurations can lead to Pod evictions because the cache is too large for the actual available disk space.

Problems with bucket settings for storageClasses in hybrid cloud scenarios

When implementing cross-cloud functionality, while the official documentation may recommend using an intranet, actual practice suggests that an extranet address should be used, especially when using the Container Storage Interface (CSI). By default, the first operation performed is the format pod, which writes the current configuration parameters to the database. Using an intranet address may cause extranet access to fail if other nodes do not specify bucket settings and rely on this setting in the database.

Object storage takes up more space than it actually does

Despite the fact that we have 6TB of storage under our AliCloud account, in actual use we found that only about 1.27TB of data was stored, and the space taken up by the object storage was significantly inflated, far exceeding the actual amount of data. This is probably due to the lack of automatic garbage collection (GC) processing. Therefore, we must manually perform GC operations periodically to reduce the storage footprint.

04 Future prospects

We plan to achieve elastic scaling in the cloud, adopt a hybrid platform strategy, and launch a high-speed version of the service. We have already made significant progress in efficient simulation in the machine learning space, successfully achieving simulation speedups of up to 100x.
In order to further improve efficiency, we plan to improve the closed-loop management of the whole system. Since our simulation system is already capable of generating a large amount of data, it is logical that this data should be fully utilized to improve the closed-loop system through machine learning training to maximize the use of data. This will greatly enhance the intelligence and automation level of our system.