Clobotics is a company that applies computer vision and machine learning technologies to the wind power as well as retail industries. In the wind power industry, Clobotics utilizes drones to inspect wind turbine blades, significantly reducing reliance on manual labor. In retail, the company analyzes captured images of packaged goods to provide insights based on real-time data to increase sales and reduce operational costs.
In terms of storage, Clobotics originally used the cloud SDK directly, while some systems used internal wrappers, which did not form a unified storage layer, and also faced the challenges of multi-cloud architecture, massive small files, and compatibility. During the process of transforming the storage layer, Clobotics compared file system solutions such as Ceph, SeaweedFS and JuiceFS, and finally chose to use JuiceFS.
JuiceFS supports access to almost all major public cloud platforms and efficiently handles the storage of large numbers of small files. Its full POSIX compatibility allows us to implement the entire data flow on JuiceFS, significantly reducing technical engineering efforts and costs.
Currently, within Clobotics, two business scenarios, wind power and retail, have been connected to JuiceFS, involving business access, data labeling and model training scenarios, and new scenarios are still being expanded.
01 Clobotics Business Architecture and Storage Requirements
Clobotics has two main business modules, wind power and retail. The figure below shows our technical architecture. At the infrastructure level, we use standardized service components, including configuration centers (e.g., Apollo), service registries (e.g., Nacos), as well as monitoring, logging, and alerting systems, etc. These systems rely on widely recognized open-source components, such as Elasticsearch and Grafana for visualization and display of logging and monitoring data, and Prometheus as a tool for collecting monitoring metrics. Elasticsearch and Grafana for logging and monitoring data visualization, and Prometheus as a tool for collecting monitoring metrics.
Further up, there is the Universal Service Layer, the core of which lies in the centralized management of various types of asset data, covering multiple domains, such as wind turbines in the wind power industry, stores and supermarkets in the retail industry, as well as our own assets, such as drones and freezers for retail use. In addition, the IAM (Identity Authentication and Access Management) system is responsible for the allocation and management of user rights to ensure system security.
To address the inevitable real-time, quasi-real-time, and batch processing needs of data processing, we designed and implemented a unified workflow and scheduling center. This center incorporates open-source components such as Apache Airflow to cope with batch processing scenarios; at the same time, we developed our own customized services to enhance the scheduling capability for specific needs that cannot be fully covered by Airflow. At the level of public services, we have especially extracted AI model services to realize the sharing and reuse of AI capabilities.
Data Characterization of Computer Vision Scenes
Our core data type consists of various types of captured images, which vary significantly in terms of specifications and pixel clarity. Approximately 50 million raw captured images are added each month, covering both wind power and retail, and the data is characterized as follows:
Massive Small Files: The original files of wind power scenarios can be more than 10 megabytes, and even after compression, their size is still not negligible. These images need to be viewed in detail one by one during the annotation process. Considering the network transmission efficiency and the smoothness of the annotation work, we adopt the "Tile Image" technology, which is to cut the large image into small images similar to map tiles, in order to improve the loading speed and viewing efficiency. However, this approach also led to a surge in the number of files, especially the bottom layer of small images, which are small in size but huge in number. In a retail scenario, we process about 2 to 3 million slice commands per month, with peaks of 500 or more.
More than twenty types of model training: covering both general and vertical domains, with different iteration cycles (weekly, monthly, quarterly) to ensure that the models can be adapted to the needs of different scenarios.
High metadata performance requirements: CSV and JS files are indispensable data input formats for AI model training. In addition, model files, as a key component of online services, need to be frequently updated and iterated, and are large in size, placing higher demands on storage performance.
Management of new data: As new sites are added, the data needs to be updated or refreshed on a regular basis, generating additional I/O operations. At the same time, reports need to be temporarily stored in a specific location after they are generated so that they can be downloaded or shared by users.
Versioning: this is an aspect that we cannot ignore, especially for raw data and image datasets. In the retail scenario, the rapid change of customer requirements requires us to perform fine-grained version control on the dataset. In the wind power scenario, in order to realize the fine management of different leaf shapes, the dataset slicing and versioning also need to be more detailed.
Challenges of building storage tiers for multi-cloud architectures
We use multi-cloud storage solutions including Azure Blob Storage, AliCloud OSS, Google Cloud Storage (GCS), Amazon S3, as well as MinIO in standalone or small cluster mode. primarily stemming from the need for adaptability in different customer environments.
Due to the different cloud service providers chosen by different customers, we need to keep adapting to support data access requirements under different technology stacks (e.g., .NET, Go, Python, C++, Java, etc.), which undoubtedly increases the complexity of the architecture and the challenges of operation and maintenance. In addition, due to the differences in functionality and scenarios between wind power and retail and other business platforms, we have to face a certain degree of duplication of development work, which puts a lot of pressure on the R&D resources of startups.
Furthermore, the cross-cloud architecture, when performing operations such as data annotation and model training, needs to pull data from multiple cloud storage services, which not only increases the complexity of data migration, but also may incur unnecessary costs due to frequent data reads. Therefore, how to optimize the cross-cloud data storage and access strategy while ensuring data consistency and security has become an urgent problem for us to solve.
02 File Storage Selection: POSIX, Cloud Native, Low O&M
Given the data characteristics of our scenarios and and the challenges that multi-cloud architectures pose to data storage, we revisited and rethought how to build a more lightweight and flexible storage tier architecture.This architecture needs to be flexible enough to respond to the data storage needs of different business scenarios, and at the same time ensure that when new cloud storage services are introduced, they can be quickly accessed at very low or even no cost。
During the initial selection process, we fully considered the mainstream and open source storage solutions in the market. After in-depth research, we first excluded HDFS, even though it is widely used in many companies in China. However, for our needs, its initial design is more inclined to deal with large data volumes and high throughput scenarios, rather than the large number of files and the need to regularly clean up the data we face.HDFS's NameNode is stressed when the number of files grows, and the high cost of data deletion, coupled with its lack of POSIX compatibility, made it unsuitable for our needs.。
Name | POSIX-compatible | CSI Driver | Scalability | Operation Cost | Document |
---|---|---|---|---|---|
HDFS | No | No | Good | High | Good |
Ceph | Yes | Yes | Medium | High | Good |
SeaweedFS | Basic | Yes | Medium | High | Medium |
GlusterFS | Yes | Not mature | Medium | Medium | Medium |
JuiceFS | Yes | Yes | Good | Low | Good |
Subsequently, the CSI Driver became a necessary consideration when evaluating storage solutions, given the current preference of most companies to deploy and operate their services on Kubernetes.We've been working on this for a long time. Our current data volume is only 700TB and growing at a low rate, so scalability is not our primary concern. O&M costs are something we have to keep a tight rein on, and as a startup, we want to minimize manpower investment in infrastructure in order to focus resources on our core business.
When evaluating Ceph, we found that it was relatively easy to install and deploy, but the operation and maintenance costs were high, especially in the capacity planning and scaling challenges, and Ceph's documentation, while rich, was organized in a cluttered manner, making it more difficult to get started.
SeaweedFS as a good performance of the open source project, because of colleagues have the relevant operation and maintenance experience into our vision, but ultimately due to the high cost of operation and maintenance and insufficient documentation and abandoned; GlusterFS for its lightweight operation and maintenance and scaling characteristics have gained a certain degree of attention, although the self-built storage tier will bring about a certain degree of operation and maintenance costs, but the overall is still within the acceptable range. The overall scope is still within the acceptable range.
Ultimately, we chose JuiceFS, which caught our eye with its full POSIX compatibility and support for cloud-native environments. In terms of O&M costs, JuiceFS relies heavily on lightweight metadata engines such as MySQL or Redis, which are part of our existing technology stack, and there is no need to introduce new components, thus greatly reducing O&M complexity. Additionally, JuiceFS is documented in a clear and easy-to-understand manner, making it easy for newcomers to get started.
03 JuiceFS Application Practice
After selecting JuiceFS as our storage tier solution, our overall storage architecture has been built and optimized to its current form. Here I will focus on just a few key aspects of our real-world application.
First, the FUSE module plays a central role in the model training sessionModel training requires processing a large amount of data, which is usually stored in the cloud. Model training requires processing a large amount of data, which is usually stored in the cloud, and we utilize high-performance physical machines with sufficient graphics resources to meet the computation and scheduling requirements of model training. All model training tasks are centralized on a single high-performance machine. Therefore, we use Fuse mounting to synchronize the data from different storage sources in the cloud to a local directory, forming a locally accessible storage space. In this process, the largest single training dataset we processed reaches the level of millions, with high data stability, and is mainly used for fast recognition model training in retail scenarios.
Secondly, in resource management and access control, the CSI Driver is mainly used in the Mount Pod approach.This approach simplifies the deployment process and the organization of Pods. This approach simplifies the deployment process and the organizational structure of Pods, and at the same time, through the fine-grained control of the internal scheduler, it effectively avoids resource access conflicts and concurrent read/write problems among different Pods. Although there are occasional deadlocks at the beginning, efficient concurrency control has been achieved by optimizing the dataset management and access scheduling strategy.
As for S3 Gateway, it was an unexpected and important addition to our JuiceFS selection!We had originally planned to build a standalone file service to share internal files. Originally, we planned to build a standalone file service to share internal files, but we had to deal with complicated permissions and timeliness issues. S3 Gateway not only provides role-based permission control to meet our basic needs, but also realizes fine-grained management of the timeliness of shared links through the Security Token mechanism, effectively preventing the risk of malicious data capture.
The benefits of using JuiceFS are as follows:
- unified storage layer: First and foremost, JuiceFS achieves the core goal of our initial selection, which is to provide a unified storage layer. This layer not only simplifies the management of data storage, but also improves the overall data access efficiency.
- Flexibility in Cloud Storage Access: As our business grows, we are able to more easily access new cloud storage services or types without having to make large-scale adjustments to our existing architecture, enhancing the scalability and adaptability of our system.
- Simplified rights management: With the built-in ACL mechanism in JuiceFS, we are able to satisfy the permission management needs in most scenarios. Although this feature may require additional extensions for particularly large or complex business environments, it is sufficient for us to meet our daily needs.
- Cross-cloud storage versioning: JuiceFS allows us to effectively manage data versions on different cloud storage services, ensuring data consistency and traceability, providing solid data support for business decisions.
- Performance Monitoring and Optimization: With JuiceFS, we are able to collect and analyze storage tier performance metrics to more accurately assess and optimize system performance. This capability is difficult to achieve with bare-bones cloud storage, where raw data management is often opaque to the average user.
- Transparency in metadata management: JuiceFS makes it easier to access and manage the original data of a file, such as when it was written and when it was updated, which is critical for advanced operations such as data repair and tiered storage.
- POSIX compatibility: JuiceFS' POSIX compatibility means that developers can utilize standard file APIs regardless of programming language or technology stack, with no additional learning costs, improving development efficiency and system compatibility.
- Simplified Operations and Maintenance: The operation and maintenance of JuiceFS is relatively simple, focusing mainly on the health status of metadata services such as Redis or MySQL. This feature reduces the difficulty of operation and maintenance and minimizes the risk of downtime due to improper system maintenance.
- Cost savings: Most unexpectedly, we have significantly reduced the uploading and storing of duplicate data through JuiceFS's effective dataset management. This improvement not only reduces storage costs, but also saves operational costs by reducing unnecessary data copies. In addition, the cleansing of duplicate data has further improved storage efficiency.
When using JuiceFS, we have adopted several strategies to optimize data storage and management:
-
Separate Instance Architecture for Data Segregation and Consolidation: We prioritize a separate instance architecture that uses different metadata engines to accurately manage various data storage requirements. This approach reduces complexity and management challenges more than building large unified storage clusters. Considering the need for segregation of data among different customers and the challenges of data consolidation in different general-purpose scenarios, we assign data to separate instances based on their characteristics and usage. This not only facilitates quick access for specific domains such as experimental data, but also reduces the difficulty and cost of data recovery. In model training, the addition of redundant nodes and retry mechanisms help to quickly resume training and reduce the impact on the training cycle.
-
Dataset Versioning and IsolationWe manage data versions through a multi-layer directory structure and specific naming conventions to meet the challenges of frequent updates of product packaging in retail and other scenarios; and through a unified coding prefix management system to ensure that specific versions of the required dataset can be quickly located during model training or data reading; at the same time, we adopt the arrangement and combination of sub-nodes under the multi-layer directory to realize the efficient management of the versions of different datasets and the rapid Meanwhile, it adopts the arrangement and combination of sub-nodes under multi-layer catalog to realize the efficient management and fast combination of different dataset versions, which improves the flexibility and efficiency of data processing.
04 Future planning
Optimize the data warm-up process: Currently, our approach of mounting JuiceFS locally and copying the data to the local directory where the model is trained was identified as inefficient in the initial implementation. Given that JuiceFS already provides advanced features such as caching and prefetch, we plan to investigate and fully utilize these built-in features to enable intelligent caching of data to manage datasets more efficiently and improve data access speed.
Optimization of data access across geographies: In certain scenarios, we need to access data located in Europe that cannot be transferred outside of Europe due to data protection policy restrictions. However, temporary access is allowed. Currently, we respond to this need with an on-premise CDN solution in order to control costs and avoid using a stock CDN service that may not be economical. Going forward, we expect to be able to leverage JuiceFS's caching mechanisms to enable short-term data sharing and efficient access to further optimize the process of handling data across geographies.
Deploying Multiple JuiceFS InstancesWe will carry out in-depth tuning and optimization work. By fine-tuning configuration parameters, optimizing resource allocation, and monitoring performance, we aim to further improve the overall performance and stability of the system to ensure that JuiceFS can continue to efficiently support our business needs.