StarRocks Materialized View Refresh Process and Principle

A while back, I added a newcharacterizationThat was also my first experience with StarRocks, and since I was completely unfamiliar with the database, I had to learn a lot about the concepts from scratch.

In order to be able to add this feature smoothly (you can see the details later), I needed to string together the whole process of materializing views, hence this post.

Before we begin a brief overview of the basic concepts of materialized views:

Simply put, views are similar in concept to traditional databases such as MySQL, and are used to resolve large amounts of performance-consuming SQL, which can be queried ahead of time and then placed in a separate table, so that when queried again the performance consumption will be relatively low.

refresh condition

In order to ensure real-time view data, you also need to be able to refresh the data in the view in time when the data changes, currently there are several places that will trigger a view refresh:

To refresh the view manually, use theREFRESH MATERIALIZED VIEW order_mv; statement
Sets the view to the active state:ALTER MATERIALIZED VIEW order_mv ACTIVE;
A refresh is triggered when the base table data changes.
Triggers a refresh when truncate the base table:truncate table trunc_db.t1;
Triggered when drop partition:ALTER TABLE <tbl_name> DROP PARTITION(S) p0, p1 [, ...];

There was a bug in the current version of truncate table and drop partition that did not trigger an auto-refresh when the base table and materialized view were not in the same database, which has now been fixed.

/StarRocks/starrocks/pull/52618
/StarRocks/starrocks/pull/52295

refresh process

As shown in the figure, when a refresh is triggered the main thing is that you need to calculate the partitions that need to be refreshed.

The first time a refresh is triggered it is not going to take a period (e.g. a time range) with it, and then based on the period calculated by the filter, by default only the first period will be used (which we can do by using thepartition_refresh_number parameter to adjust the number of partitions for a single refresh).

Then if there are remaining cycles, the refresh task will be retriggered once for those cycles (taking with it the data from the remaining cycles just now), and so on for recursive execution.

You will see the returned partition data through the log.

New Optimization Parameters

We ran into a scenario when using materialized views:

CREATE TABLE IF NOT EXISTS test.par_tbl1
(
    datekey DATETIME,
    k1      INT,
    item_id STRING,
    v2      INT
)PRIMARY KEY (`datekey`,`k1`)
 PARTITION BY date_trunc('day', `datekey`);

 CREATE TABLE IF NOT EXISTS test.par_tbl2
(
    datekey DATETIME,
    k1      INT,
    item_id STRING,
    v2      INT
)PRIMARY KEY (`datekey`,`k1`)
 PARTITION BY date_trunc('day', `datekey`);

 CREATE TABLE IF NOT EXISTS test.par_tbl3
(
    datekey DATETIME,
    k1      INT,
    item_id STRING,
    v2      INT
)
 PRIMARY KEY (`datekey`,`k1`);

But we have three base tables, where 1 and 2 are both partitioned, but 3 is non-partitioned.

At this point a new materialized view is created based on them:

CREATE
MATERIALIZED VIEW test.mv_test
REFRESH ASYNC
PARTITION BY a_time
PROPERTIES (
"excluded_trigger_tables" = "par_tbl3"
)
AS
select date_trunc("day", ) as a_time, date_trunc("day", ) as b_time,date_trunc("day", ) as c_time
from test.par_tbl1 a
         left join test.par_tbl2 b on  =  and a.k1 = b.k1
         left join test.par_tbl3 c on a.k1 = c.k1;

When I update the data in both partitioned and non-partitioned tables:

UPDATE `par_tbl1` SET `v2` = 2 WHERE `datekey` = '2024-08-05 01:00:00' AND `k1` = 3;
UPDATE `par_tbl3` SET `item_id` = '3' WHERE `datekey` = '2024-10-01 01:00:00' AND `k1` = 3;

The expected result is that onlypar_tbl1 Data modified in the table is synchronized to the view ("excluded_trigger_tables" = "par_tbl3"has been set to not trigger a view refresh), but the reality is that thepar_tbl1 cap (a poem)par_tbl2 All data in the table is refreshed to the materialized view.

We can use this SQL to query the running status of the Brushless View task:

SELECT * FROM information_schema.task_runs order by create_time desc;

This results in resource loss, and if the data in these two base tables is very large, this refresh will be very time consuming.

So our requirement is to refresh only the modified data even in such scenarios.

So we added a new parameter to the new materialized view:

CREATE
MATERIALIZED VIEW test.mv_test
REFRESH ASYNC
PARTITION BY a_time
PROPERTIES (
"excluded_trigger_tables" = "par_tbl3",
"excluded_refresh_tables"="par_tbl3"
)
AS
select date_trunc("day", ) as a_time, date_trunc("day", ) as b_time,date_trunc("day", ) as c_time
from test.par_tbl1 a
         left join test.par_tbl2 b on  =  and a.k1 = b.k1
         left join test.par_tbl3 c on a.k1 = c.k1;

This way, when the data is refreshed in theexcluded_refresh_tables Whether the configured table has any data changes, if so then you can't refresh all of the currently calculated partitions (the full amount of data from the 1,2 tables), but continue to find an intersection and only calculate the data that has changed in the base table.

This avoids full refresh of data for par_tbl1, par_tbl2 and only refreshes the modified data.

Such a scenario is usually one in which there is a dictionary table in the associated base table, and usually the amount of data is not large, so there is no need for partitioning.

This allows you to use these two parameters when creating a materialized viewexcluded_trigger_tables，excluded_refresh_tables Ruled it out.

The overall refresh logic isn't complicated, it's mainly a few different refresh portals and the logic for calculating partitions during the refresh process.

Reference Links:

/zh/docs/using_starrocks/async_mv/Materialized_view/#understanding-starrocks-materialized_view
/zh/docs/using_starrocks/async_mv/use_cases/data_modeling_with_materialized_views/#Partition modeling
/StarRocks/starrocks/pull/52295
/StarRocks/starrocks/pull/52618