Today's blog comes from JuiceFS cloud service user Jerry, who has innovated version control of their data by using the JuiceFS snapshot feature.Jerry, a North American-based tech company, utilizes artificial intelligence and machine learning technologies to streamline the comparison and purchase process for users buying auto and home insurance.
Rigorous testing and controlled releases have been standard practice in software development for decades. But what if we could apply these principles to databases and data warehouses? Imagine being able to define a set of standards for data infrastructure with test cases that are automatically applied to each new "release" to ensure that customers always see accurate and consistent data. This would dramatically improve data quality.
01 Challenge: Why end-to-end testing is not common in data management
The idea may seem intuitive, but end-to-end testing is not common in data management because it requires the database or data warehouse to have the ability to take clones or snapshots, which most data systems do not provide.
Modern data warehouses are essentially organized, variable stores that change over time and that we manipulate through data pipelines. The data is usually visible to the end customer as soon as it is generated, and there is no concept of "release". Without this notion of release, end-to-end testing of a data warehouse makes little sense. There is no way to ensure that what the test sees is what the customer will see, as the data is constantly changing due to modifications in the data pipeline.
So the core of the problem, is to realize a data publishing mechanism, this mechanism can be extracted at a certain moment the state of the data warehouse into a "snapshot", and control this "snapshot" of the visibility of the end user. In this way, this snapshot becomes a "release artifact", and we control what conditions and when it can eventually be seen by users.
02 Existing methods and their limitations
Some teams have developed version control systems on top of data warehouses. Instead of directly modifying the tables queried by end-users, they create new versions of the tables for the changes and use atomic exchange operations to "publish" the tables. While this approach works to some extent, it presents significant challenges:
- Efficient implementation of the "create and exchange" model is not easy;
- Ensuring consistency across multiple tables (e.g., verifying that each row in the order table has a corresponding row in the price table) requires that changes to multiple tables be "packaged" into a single "transaction," which is also challenging, not only because it is difficult to implement, but also because it requires that the data pipeline be relatively tightly organized.
03 Solution: ClickHouse Database Cloning Powered by JuiceFS
We have developed a system that utilizes the JuiceFS snapshot feature to "clone" ClickHouse databases into replicas. This approach is described in detail in our earlier post "Low Cost Read/Write Separation: Jerry Builds a Master-Slave ClickHouse Architecture".
It works as follows:
- We run the ClickHouse database on JuiceFS, a POSIX-compliant shared file system supported by the Object Storage Service (OSS).
- JuiceFS provides a "snapshot" feature that implements git branching semantics.
- Use simple commands such as
juicefs snapshot src_dir des_dir
We can create a clone of src_dir at that moment.
This approach allows us to easily copy/clone a ClickHouse instance from a running instance, creating a frozen snapshot that can be considered a "release artifact".
04 Implementing End-to-End Testing with Database Cloning
With this mechanism, we can run end-to-end tests on ClickHouse copies and control their visibility based on the results.
It is now possible to develop, organize and iterate on data end-to-end tests using a common unit testing framework (we use pytest). This approach allows us to encode infrastructure and business standards for data availability and reliability into data tests.
A typical test is a table size test, which helps prevent data problems caused by accidental or temporary table corruption. Business standards can also be defined to protect data reporting and analytics from unintended changes in the data pipeline that could lead to data errors. For example, users can enforce uniqueness on a column or group of columns to avoid duplicates - a key factor when calculating marketing costs.
At Jerry, this architecture has played a critical role in recent quarters, effectively preventing virtually all P0-level data issues that could be exposed to end customers.
This approach is not limited to ClickHouse; if you are running any type of data lake or lake silo on top of JuiceFS, it may be easier to adopt the publishing mechanism described in this article.
05 Conclusion
By bringing modern software development practices to the world of data management, we can significantly improve data quality, reliability and consistency. The combination of database cloning and end-to-end testing provides a powerful toolset for ensuring that customers always see the right data, just as they would expect to see the right functionality in a fully tested software release.
The following diagram illustrates the workflow of our database publishing and end-to-end testing process.
The creation of this architecture marks an important step in bridging the gap between software development and data management, opening up entirely new possibilities for innovation and quality assurance in the data domain.
I hope this has been of some help to you, and if you have any other questions feel free to join theJuiceFS CommunityCommunicate with everyone.