Volcano Engine Digital Intelligence Platform: Technical Interpretation and Landing Practice of High Performance ChatBI

Introduction: The development and maturity of large modeling capabilities have given rise to a new generation of intelligent BI - ChatBI, which greatly simplifies the process of data analysis, improves efficiency and lowers the threshold of analysis through the combination of natural language processing (NLP) and large-scale language models (LLMs). DataWind, the intelligent data insight product of Volcano Engine's Digital Intelligence Platform, has recently launched the ChatBI capability, which provides intelligent repair, multi-grammar application and other capabilities, and realizes second response and one-click generation in terms of performance. Users only need to describe their needs through text to generate indicators, quickly realize data acquisition, analysis and calculation and chart construction, significantly reducing the threshold of data consumption. This article will be from the technical architecture, implementation path, summarize the outlook of several aspects, disassemble the volcano engine digital intelligence platform how to land ChatBI capabilities.

BI is a term that has been around for a long time. The connotation of I - "intelligence" has gradually changed over time and with the development of the times.

Initially, the ability to change filters and drill down dimensions on data dashboards and dashboards was thought to be a sign of intelligence.

And as the platform is updated and iterated, more high-level and complex functions are updated to the platform in a more user-friendly form, so that people without a computer background or programming background can also deeply appreciate the sense of intelligence brought about by the age of code, computers or big data.

With the advent of the AI era, we have more expectations for intelligence. For example, can it "guess" what you're thinking and make smart recommendations? Or, when it sees data anomalies, can it help find out why?

Objectively speaking, the internal BI platform of Jitterbug Group, which has been developed since 2018, started late. Therefore, it directly skipped the early development stage of the BI platform, and from the beginning of the project, its goal was to become a data analysis platform that could meet almost all the data analysis needs within the company.

Within the Jitterbug Group, the BI platform is built in the following stages:

The first is the development and construction around 2020. A lot of resources have been invested in this phase to develop the functionality related to outcome attribution in the hope that it will help users to solve attribution problems.

Second, in April 2021, a low-code visual modeling tool was released. The reason for this was that the team did not want users to have to contact the warehouse developers to prepare a new copy of the data when they realized that the data had not been prepared during the data analysis process. For this reason, the visual modeling tool was developed in the hope that users would be able to work with the data easily with just a simple drag-and-drop operation.

Third, the end of 2023. Faced with the rapid development of ChatGPT, the internal team thought it would have a disruptive impact on BI like "lifting the table", so after a period of trial and error, the product was released in April this year. So far the results have been good, and several thousand people are already using this internal product at a high frequency.

Currently, DataWind, the intelligent data insight under the Volcano Engine Digital Intelligence Platform, has built up a relatively complete product capability matrix that includes data preparation and management, data analysis, and multi-end display functions, and at the same time, gives the product system a high degree of operational and maintenance advantages.

Up to now, 80% of the internal employees of Jitterbug Group have become monthly active users of the product, while the minimum query volume of the product in a single day on weekdays is basically above 200w times.

Volcano Engine Numerical Intelligence Platform High Performance Data Analytics Architecture Solution

Data-driven decision-making is an important concept that is deeply rooted within the Jitterbug Group, and fits in well with the OKR philosophy that the company has implemented. Since OKRs are usually measured by indicators, data analysis becomes essential when there is a problem with the indicators and we need to investigate the cause. At the same time, in the process of troubleshooting, there will be a variety of analysis ideas in the mind of the user at the same time, and if the data analysis time is too long, it will interrupt the original analysis ideas. Therefore, in order to realize high-speed analysis, the internal staff of the enterprise users have great requirements for the performance of the analysis platform.

While performance is important, BI platform developers tend to think of it as more engine-related: a less capable engine results in BI not being able to handle much.

However, the dataset, data source, the size of the data volume, and the complexity of the Query are not the concern of the users, they are concerned about whether their data analysis can be done quickly. Therefore, the development of performance improvement is a big challenge.

In order to meet the performance needs of users, the development has taken a different approach from that of the mainstream BI vendors. Although the DataWind product supports direct connectivity to most of the data sources, data engines and databases in common use today. However, in the enterprise users use more "extraction", that is, around the self-developed analytical database ByteHouse built a very heavy extraction link, almost all the company needs to analyze the data into the ByteHouse all the data.

Because the way data is stored has a huge impact on query efficiency, the BI team uses a large number of ByteHouse clusters to meet the needs of users for real-time joins, offline joins, and joins to different table engines.

At the same time, how to fully and effectively utilize the ByteHouse high-performance engine is also very important. For example, what dataset should be recommended to what kind of cluster, what kind of table engine should be selected, and what kind of slicing and sorting key strategy should be determined. All of these issues are critical to performance.

Let's start with a brief introduction to ByteHouse, which is optimized for performance in a number of areas compared to the native Clickhouse.

First, there is the optimization of HaMergeTree, which is indispensable for most enterprise users. Native ClickHouse has a large dependency on the Apache ZooKeeper (ZK), and even more so on the part information processing of files. This results in ClickHouse processing large-scale datasets, which tends to strain the resources of ZK and increase the number of znodes under management, affecting system performance and stability.

ByteHouse has made a lot of optimizations to reduce the dependency on ZK. Currently in ByteHouse, the face of ZK only exists in few scenarios such as schema information and generating self-incrementing sequences, which ensures the overall performance and usability of ByteHouse.

On the HAUniqueMergeTree, the Raplacing MergeTree aspect of the native ClickHouse. While the ClickHouse engine is relatively inefficient at reading, ByteHouse handles this aspect by building certain indexes to enable fast updates and tagged deletions of records, thus improving performance.

In addition, the join ability of native Clickhouse has been criticized for a long time because of the pressure on coordinator nodes, ByteHouse implements truly distributed join in this aspect, and also does a lot of optimizer work based on this. For example, when a big table joins a small table, ByteHouse will make its own judgment based on the data situation of the small table, to exempt part of the data in the big table from reading or blocking.

Overall, importing a lot of data into ByteHouse does not mean a lot of tables. Within the Jitterbug group, people prefer to import more comprehensive and granular data into the ByteHouse cluster, so as to avoid the situation that a certain aspect of data does not exist or the level of granularity is not enough during the process of doing data analysis.

In scenarios where very fine granularity is used, the team thinks that most of the queries are based on some high frequency metrics dimensions to find queries done on very granular data. So it would be easy to think of the solution, i.e. build some Cube or materialized views and build some automated routing.

These operations are very transparent to the user. In terms of engineering, it is worth mentioning that the team did not use ByteHouse's own materialized view or projection approach, as it was found to have a negative impact on clustering and overall performance during development practice tests. Currently, the development mainly uses Hadoop-based links and Spark-based links for Cube construction, and thus realizes automatic routing. From the user's point of view, users are faced with a large table with a wide range of details.

However, the side effect of this approach is that since the various lines of business of the product take a pay-per-use form, building a large number of aggregation tables for the dataset will inevitably lead to a rise in costs. This brings us to a new question: how to reduce costs while meeting the speed requirements of the users?

The solution to this problem is to provide hot/cold tiering, where the most commonly used data, such as data from the last 7 or 14 days, can be placed in the ByteHouse for storage. For the most frequently used data, such as the last 7 days or 14 days, the data can be stored in ByteHouse, while the data that is farther away from the current time can be placed in the ByteHouse cluster, which is separated from the storage and computing, and can be queried in a cheaper way. Data that is older, such as data from the past year, is stored in Hive tables and can be queried using Python or Spark.

The reason for retaining support for sparse queries is that MPP-related databases are generally notoriously bad at spiking. Spark is used as a backing method to at least ensure that users can still query the data in extreme cases.

In addition, performance optimization and cost management are also issues worth investigating. In this regard, a more "people's war" approach was adopted in the development. Given that it is difficult to meet the huge business needs of the group by relying only on the platform operation and maintenance team to monitor all the performance indicators and specific database tables, the team chose to embed this monitoring and optimization capability in the product system.

From there, every line of business leader, and indeed every project manager, can visualize which datasets consume more resources, and which datasets are less cost-effective - i.e., those that are investing large sums of money but are not queried as frequently. The strategy also assists them in identifying which parts can be improved by building multilevel aggregations, and how to implement cost-containment measures for more efficient resource allocation while ensuring that performance is not compromised.

In this way, it not only spreads the pressure of management and optimization, but also promotes the attention and participation of all staff in resource efficiency, ensuring that the entire Group maintains optimal cost-effectiveness and service performance while expanding in scale.

BI + AI for Intelligent Data Insights

Jitterbug Group internally invested more in the intelligence part of the BI platform during the construction phase.

And the sharing of the smart part can be roughly divided into three parts.

The first is data development, which aims to help those who prepare data to be able to prepare more valuable data; the second is data analysis, which is expected to help users query anomaly indicators and attribute anomalies; and the third is data consumption, which improves the efficiency of extracting information through conversational Q&A.

The data development scenarios are relatively simple, and the team's work focuses on integrating a variety of AI algorithms into a low-code visual modeling tool, with a heavy use of predictive capabilities. The scenarios in which predictions are used are also very easy to understand.

Suppose the user has a table in which a certain column may represent data after a few days, if at this time the user already knows the information and historical data of other columns, they will want to predict the new value of the column by machine learning. From the current point of view, there are more such needs. From the operator point of view, the product bus has integrated more than 40 types of operators, among which feature engineering operators and prediction operators are the two types that are frequently used.

Next is the data analysis scenario. In the data analytics scenario, the development team wanted to help users query and attribute anomalies more quickly, and didn't want to configure the user to send warnings when the data metrics fell below 10% or 5%, for example.

The team wanted to develop a more flexible early warning system that could reflect the seasonality of the indicator, so the STO algorithm was used in the development, combined with the indicator smoothing technique, which utilizes the residuals combined with the historical data to calculate the range of the indicator's volatility, and then alerts when it exceeds this range.

When analyzed in terms of product form, attribution can be divided into the following categories.

Instant attribution is easier to understand in terms of the product form, i.e., users only need to click once when they find an abnormality, and the system will carry out attribution. In terms of dimension selection, the development team referred to a dimension selection method based on the Gini coefficient, which is often used by the United Nations to compare the difference between the rich and the poor, and after interpreting it into dimensions, each dimension can be regarded as a country, and if the dimensional value of a dimension contributes to a certain indicator in a more average way and does not have significant differences, it is considered that this dimension may not be the main cause.

After determining the dimension it will calculate the contribution rate of the dimension value through a series of methods. Instant attribution is more demanding on immediacy, it can return the query results in a short time, such as 15 seconds or so, but instant attribution can analyze relatively few things, it will not be related to the analysis of the indicators, will only do dimensional analysis, and will not do a combination of multi-dimensional correlation analysis, the overall function of the more simple.

The other type of attribution, insight reporting, is relatively feature-rich. Insight reporting can handle relatively complex requirements through an asynchronous notification model, with the ability to analyze different metrics and perform multiple combinations of dimensions.

User attribution is configured, for example, by selecting metrics or dimensions, and the insight report is generated by selecting a general combination, which can be viewed on the system or pushed to the appropriate IM.

There is also a more internally used attribution - the Indicator Analysis Tree.

Within the group, when people are doing OKR alignment, the indicators tend to form something similar to an indicator system, i.e.: the higher level emphasizes indicators such as GMV, while the lower level pays more attention to indicators such as PV, and this difference will form a tree-like indicator structure. If there are doubts about the indicators, a solidified dimension-based analysis will be conducted. The general idea is to solidify the indicator analysis process to ensure that when viewing OKR or indicators, you can clearly know the boards, dimensions and nodes that are abnormal.

The attribution function is relatively simple from the realization point of view as a whole, and the difficulty lies mainly in the product design and algorithm-related processing, which is also a relatively simple problem from the engineering point of view. In addition, the asynchronous insight report and indicator analysis number scheduling in the implementation, to try to avoid the impact on the online query, to minimize the occupation of online query resources.

While actively working on work related to metrics attribution, the Big Model emerged. The team then invested a lot of time and resources in exploring the capabilities associated with the Big Model. In terms of results, there are now several thousand people within the group who are regular users of ABI's Copilot, so overall the results have been good.

From an exploratory perspective, the team developed diverse scenarios, but the results on the ground were partly relatively successful and partly relatively unsuccessful.

Looking back at the less successful scenarios today, one thing they all have in common is that the quality of the content generated is not very high. That is to say, relatively speaking, they may have a lot of room for improvement in terms of product interaction, but it is often more difficult to tune the quality of the content.

For example, users expect big models to help with attribution, informing them why the data is not right or which direction to look in next.

Regarding this aspect of the capability, the team initially brought it online because it actually performed beyond expectations. From a developer's perspective, it's particularly easy to look at things related to big models with low expectations, and feel like it's doing as well as it's doing, but in reality, from a user problem-solving perspective, the quality of the content that's being generated is often not as high as it could be. So at this point it's particularly easy for teams to generate more optimistic expectations, which often leads to less than ideal landing results or landing posture.

There are some common features in the current success stories. First, if the functionality is designed to solve a problem, such as the need for users to search for information in the process of solving a problem, combining it with ChatGPT tends to result in a better solution.

In terms of code development scenarios, the team's product features have landed well internally, including the previously listed SQL queries around the higher-order Notebook practices supported by the ABI. These features can replace the scenario of users searching the Internet to view a large number of Stack overflow posts, and then refining the code editing ideas, and in the big model-enabled code editor, DataWind provides a series of functions such as interpreting SQL, optimizing SQL, generating comments, and repairing reported errors.

Or perhaps the first step in data preparation: the entry of source data. Initially, when entering indicators, a lot of translation work is often required to name the indicators.

If you are in a multilingual environment, you also need to configure the foreign language name of the indicator. For this kind of problem, in the past, it is usually necessary to consult public information and query the English writing of the relevant words. For this part of the work, the large model can have a better performance, which can greatly save the user's energy.

In exploring and analyzing dashboards and interpreting them, the big model can play a big role in helping the user to embellish the work, because the user may need to quickly send the data results to the superior, and it may be relatively difficult to write the interpretation content by themselves. In this scenario, DataWind, in addition to interpreting the data for a chart, will also suggest follow up questions, which are actually suggested by the GPT.

When a user clicks on a question, it also describes the next action to be taken, which in turn responds to another question about how to let the user know the model's subsequent behavior. This is because it is crucial that the answers are not 100% accurate at this stage of the big model's capabilities.

And in serious occasions, what is needed is a very precise number, and the user will very much want to understand what the caliber of its statistics is, and how it came to this conclusion. Therefore, in the landing scenarios, an extremely important principle is to let the user know clearly what the big model has done, or what part of the big model has done, and how to deal with the next request, how the data is actually obtained, which is very critical, otherwise, even if the model can reach an accuracy of more than 95%, the landing of its data products is also more difficult.

Moving on to the SQL query scenario, the product is capable of interpretation and optimization. Natural language can be used in the editor to help generate the appropriate SQL and some of the code associated with the notebook.

In terms of realization. The first point is the issue of compliance and content auditing. Jitterbug Group's internal practice initially adopted the GPT model and made various attempts, including tuning of GPT 3, etc., and chose GPT 4 after comparison, while also trying to dock the company's self-developed model, so it is more difficult to audit the content, for example: if the schema of table is considered less sensitive, while the data of table is sensitive, how should those dimensional values be handled? For example, if the schema of table is considered less sensitive while the data of table is sensitive, how should those dimension values be handled? Do we need to do vectorized matching? This will involve a series of technical and engineering issues.

Then there is the issue of fine-tuning, whether to use model fine-tuning and when? Even now, we have not completely abandoned the model fine-tuning route, the development team's understanding of fine-tuning is more of a space for time, because sometimes the team will find that when the user describes the private domain problem in great detail, the prompt is too long, and on the one hand, it may not be able to input, and on the other hand, it may also affect the overall efficiency of the use of the prompt. In this case, the number of prompts to be provided should be reduced by partial fine-tuning.

Summary and outlook

To briefly summarize a few key points of the future outlook:

First, enterprise BI is becoming a new trend, and while it was once common for many business units to purchase their own BI, company-wide or employee-wide BI was not important at the time, it is becoming increasingly important today.

Second, metrics governance as well as AI capabilities are also critical components.

Third, the team believes that data consumption can drive data construction, and the overall construction idea is to make the upper layer of data consumption extremely prosperous, and after the relative prosperity, it will continue to put forward new demands to the lower layer of data construction, such as the ETL part, the number of warehouses, and the part of the data lake, so as to drive the construction of the lower layer of infrastructure.

strike (on the keyboard)Volcano Engine DataWind Learn more