Location>code7788 >text

ClickHouse Physical View Learning Summary

Popularity:392 ℃/2024-12-10 03:43:12

materialized view

Materialized View Source Table - Base Data Source

Creating the source table, since our goal involves reporting aggregated data rather than a single record, allows us to parse it, pass the information to the materialized view, and discard the actual incoming data. This meets our goals and saves storage space, so we'll use theNullTable Engine.

CREATE DATABASE IF NOT EXISTS analytics;

CREATE TABLE analytics.hourly_data
(
    `domain_name` String,
    `event_time` DateTime,
    `count_views` UInt64
)
ENGINE = Null;

Note: Materialized views can be created on Null tables. Therefore, data written to the table will eventually affect the view, but the original raw data will still be discarded

Monthly summary tables and materialized views

For the first materialized view, you need to create theTarget table (in this case theanalytics.monthly_aggregated_data), the example will store the sum of the views by month and domain name.

CREATE TABLE analytics.monthly_aggregated_data
(
    `domain_name` String,
    `month` Date,
    `sumCountViews` AggregateFunction(sum, UInt64)
)
ENGINE = AggregatingMergeTree
ORDER BY (domain_name, month);

will forwardTargetThe materialized view of the data on the table is as follows:

CREATE MATERIALIZED VIEW analytics.monthly_aggregated_data_mv
TO analytics.monthly_aggregated_data
AS
SELECT
    toDate(toStartOfMonth(event_time)) AS month,
    domain_name,
    sumState(count_views) AS sumCountViews
FROM analytics.hourly_data
GROUP BY domain_name, month;

Annual summary tables and materialized views

Now, create a second materialized view that will link to the previous target tablemonthly_aggregated_data
First, create a new target table that will store the sum of views aggregated per domain per year.

CREATE TABLE analytics.year_aggregated_data
(
    `domain_name` String,
    `year` UInt16,
    `sumCountViews` UInt64
)
ENGINE = SummingMergeTree()
ORDER BY (domain_name, year);

The materialized view is then created and this step defines the cascade.FROM statement will use themonthly_aggregated_datatable, which means the data flow will be:
1. Data arrivalhourly_dataTable.
Will forward the received data to the first materialized viewmonthly_aggregated_data a meter (measuring sth)
3.Finally, the data received in step 2 will be forwarded to theyear_aggregated_data

CREATE MATERIALIZED VIEW analytics.year_aggregated_data_mv
TO analytics.year_aggregated_data
AS
SELECT
    toYear(toStartOfYear(month)) AS year,
    domain_name,
    sumMerge(sumCountViews) as sumCountViews
FROM analytics.monthly_aggregated_data
GROUP BY domain_name, year;

Attention:

A common misconception when working with materialized views is that the data is being read from a table, which is not theMaterialized viewswork; the data forwarded is the inserted data block, not the final result in the table.

Imagine, in this example, that themonthly_aggregated_dataThe engine used in is a collapsed merge tree (CollapsingMergeTree), forwarded to the second materialized viewyear_aggregated_data_mv of the data will not be the end result of folding the table, it will forward the data that has the just asSELECT… GROUP BYblock of data for the fields defined in the

If you are using theCollapsingMergeTreeReplacingMergeTreeso much so thatSummingMergeTreeand plan to create cascading materialized views, you need to understand the limitations described here.

data collection

Now it's time to test our cascading materialized views by plugging in some data: the

INSERT INTO analytics.hourly_data (domain_name, event_time, count_views)
VALUES ('', '2019-01-01 10:00:00', 1),
       ('', '2019-02-02 00:00:00', 2),
       ('', '2019-02-01 00:00:00', 3),
       ('', '2020-01-01 00:00:00', 6);

consult (a document etc)analytics.hourly_datawill not be able to find any records, because the table engine isNullbut the data have been processed

 SELECT * FROM analytics.hourly_data

Output:

domain_name|event_time|count_views|
-----------+----------+-----------+

in the end

If you try to query the target table'ssumCountViewsfield value, will see the field value represented in binary (in some terminals) because the value is not stored as a number, but as aAggregateFunctiontype stored. To get the final result of the aggregation, you should use the-MergeSuffix.

By making the following inquiries.sumCountViewsField values are not displayed properly:

SELECT sumCountViews FROM analytics.monthly_aggregated_data

Output:

sumCountViews|
-------------+
             |
             |
             |

utilizationMergeSuffix acquisitionsumCountViews Value.

SELECT sumMerge(sumCountViews) as sumCountViews
FROM analytics.monthly_aggregated_data;

Output:

sumCountViews|
-------------+
           12|

existAggregatingMergeTree air marshalAggregateFunction define assumThe following is an example of how to use thesumMerge. When in theAggregateFunctionUse the functionavgIf you are using theavgMergeAnd so on.

SELECT month, domain_name, sumMerge(sumCountViews) as sumCountViews
FROM analytics.monthly_aggregated_data
GROUP BY domain_name, month

Output:

month     |domain_name   |sumCountViews|
----------+--------------+-------------+
2020-01-01||            6|
2019-01-01||            1|
2019-02-01||            5|

Now we can see if the materialized view meets our defined goals.

The data is now stored in the target tablemonthly_aggregated_datain which data for each domain can be aggregated on a monthly basis:

SELECT month, domain_name, sumMerge(sumCountViews) as sumCountViews
FROM analytics.monthly_aggregated_data
GROUP BY domain_name, month;

Output:

month     |domain_name   |sumCountViews|
----------+--------------+-------------+
2020-01-01||            6|
2019-01-01||            1|
2019-02-01||            5|

Aggregate data for each domain on a yearly basis.

SELECT year, domain_name, sum(sumCountViews)
FROM analytics.year_aggregated_data
GROUP BY domain_name, year;

Output:

year|domain_name   |sum(sumCountViews)|
----+--------------+------------------+
2019||                 6|
2020||                 6|

Combine multiple source tables to create a single target table

Materialized views can also be used to combine multiple source tables into a single target table. This is useful for creating tables similar to theUNION ALLA materialized view of the logic is very useful.

First, create two source tables representing different sets of metrics: the

CREATE TABLE 
(
    `event_time` DateTime,
    `domain_name` String
) ENGINE = MergeTree ORDER BY (domain_name, event_time);

CREATE TABLE 
(
    `event_time` DateTime,
    `domain_name` String
) ENGINE = MergeTree ORDER BY (domain_name, event_time);

Then use the combined set of indicators to createTargetTable:

CREATE TABLE analytics.daily_overview
(
    `on_date` Date,
    `domain_name` String,
    `impressions` SimpleAggregateFunction(sum, UInt64),
    `clicks` SimpleAggregateFunction(sum, UInt64)
) ENGINE = AggregatingMergeTree ORDER BY (on_date, domain_name);

Creates two pointers to the sameTargetA materialized view of the table. There is no need to explicitly include missing columns:

CREATE MATERIALIZED VIEW analytics.daily_impressions_mv
TO analytics.daily_overview
AS
SELECT
    toDate(event_time) AS on_date,
    domain_name,
    count() AS impressions,
    0 clicks --<<<--- If the column is removed,then it defaults to clicksbecause of0
FROM
    
GROUP BY toDate(event_time) AS on_date, domain_name;

CREATE MATERIALIZED VIEW analytics.daily_clicks_mv
TO analytics.daily_overview
AS
SELECT
    toDate(event_time) AS on_date,
    domain_name,
    count() AS clicks,
    0 impressions --<<<---If the column is removed,then it defaults to impressions because of0
FROM
    
GROUP BY toDate(event_time) AS on_date, domain_name;

Now, when values are inserted, they will be aggregated into theTargetin the corresponding columns of the table:

INSERT INTO  (domain_name, event_time)
VALUES ('', '2019-01-01 00:00:00'),
       ('', '2019-01-01 12:00:00'),
       ('', '2019-02-01 00:00:00'),
       ('', '2019-03-01 00:00:00')
;

INSERT INTO  (domain_name, event_time)
VALUES ('', '2019-01-01 00:00:00'),
       ('', '2019-01-01 12:00:00'),
       ('', '2019-03-01 00:00:00')
;

Query target table theTarget table:

SELECT
    on_date,
    domain_name,
    sum(impressions) AS impressions,
    sum(clicks) AS clicks
FROM
    analytics.daily_overview
GROUP BY
    on_date,
    domain_name
;

Output:

on_date   |domain_name   |impressions|clicks|
----------+--------------+-----------+------+
2019-01-01||          2|     2|
2019-03-01||          1|     1|
2019-02-01||          1|     0|

Reference Links

/docs/en/guides/developer/cascading-materialized-views

AggregateFunction

Aggregate functions have an implementation-defined intermediate state that can be serialized asAggregateFunction(...)datatype, and is usually passed through thematerialized viewstored in a table. A common way to generate the state of an aggregate function is to use theStatesuffix to call the aggregation function. In order to get the final result of the aggregation later, you must use the function with the-MergeThe same aggregation function with a suffix.

AggregateFunction(name, types_of_arguments...) - Parameter data type.

Parameter Description:

  • The name of the aggregation function. If the name corresponds to an aggregation function shoe with parameters, you also need to specify parameters for the others.
  • Polymerization function parameter type.

typical example

CREATE TABLE testdb.aggregated_test_tb
(   
    `__name__` String, 
    `count` AggregateFunction(count),
    `avg_val` AggregateFunction(avg, Float64),
    `max_val` AggregateFunction(max, Float64),
    `time_max` AggregateFunction(argMax, DateTime, Float64),
    `mid_val` AggregateFunction(quantiles(0.5, 0.9), Float64) 
) ENGINE = AggregatingMergeTree() 
ORDER BY (__name__);

Note: If the above SQL is not addedORDER BY (__name__, create_time), execution will report an error similar to the following:

SQL incorrect [42]: ClickHouse exception, code: 42, host: 192.168.88.131, port: 8123; Code: 42, () = DB::Exception: Storage AggregatingMergeTree requires 3 to 4 parameters:
name of column with date,
[sampling element of primary key],
primary key expression,
index granularity

Create a data source table and insert test data

CREATE TABLE testdb.test_tb 
(
    `__name__` String, 
    `create_time` DateTime, 
    `val` Float64
) ENGINE = MergeTree() 
PARTITION BY toStartOfWeek(create_time) 
ORDER BY (__name__, create_time);

INSERT INTO testdb.test_tb(`__name__`, `create_time`, `val`) VALUES
('xiaoxiao', now(), 80.5),
('xiaolin', addSeconds(now(), 10), 89.5),
('xiaohong', addSeconds(now(), 20), 90.5),
('lisi', addSeconds(now(), 30), 79.5),
('zhangshang', addSeconds(now(), 40), 60),
('wangwu', addSeconds(now(), 50), 65);

insert data

usingStatesuffix of the aggregation function of theINSERT SELECT to insert data - for example, if you want to get the mean value of the target column data, i.e. theavg(target_column), then the aggregation function used to insert the data isavgState*StateThe aggregate function returns the state (state), not the final value. In other words, returning aAggregateFunction The value of the type.

INSERT INTO testdb.aggregated_test_tb (`__name__`, `count`, `avg_val`, `max_val`, `time_max`, `mid_val`)
SELECT `__name__`,
countState() AS count,
avgState(val) AS avg_val, 
maxState(val) AS max_val,
argMaxState(create_time, val) AS time_max,
quantilesState(0.5, 0.9)(val) AS `mid_val`
FROM testdb.test_tb
GROUP BY `__name__`, toStartOfMinute(create_time);

Attention:SELECTfields in the statement, either by using an aggregate function call (such as the abovevalfield), or leave the original field unchanged (such as the aforementioned__name__field), and when keeping the original field unchanged, the field must be included in theGROUP BYclause, otherwise an error similar to the following will be reported:

SQL incorrect [215]: ClickHouse exception, code: 215, host: 192.168.88.131, port: 8123; Code: 215, () = DB::Exception: Column `__name__` is not under aggregate function and not in GROUP BY (version 20.3.5.21 (official build))

Query Data

surname CongAggregatingMergeTreeWhen querying data in a table, use theGROUP BYclause and the same aggregation functions as when inserting data, but using theMergesuffix, for example, the aggregation function used when inserting data isavgState, then the aggregation function used for the query isavgMerge

postfixMerge's aggregate function accepts a set of states, combines them together, and returns the result of the complete data aggregation.

For example, the following two queries return the same results

SELECT `__name__`, 
create_time,
avgMerge(avg_val) AS avg_val, 
maxMerge(max_val) AS max_val
FROM ( 
SELECT `__name__`, 
toStartOfMinute(create_time) AS create_time,
avgState(val) AS avg_val, 
maxState(val) AS max_val
FROM testdb.test_tb
GROUP BY `__name__`, create_time
)
GROUP BY `__name__`, create_time;

SELECT `__name__`, 
toStartOfMinute(create_time) AS create_time,
avg(val) AS avg_val, 
max(val) AS max_val
FROM testdb.test_tb
GROUP BY `__name__`, create_time;

Example:

SELECT `__name__`, 
countMerge(`count`), 
avgMerge(`avg_val`), 
maxMerge(`max_val`),
argMaxMerge(`time_max`),
quantilesMerge(0.5, 0.9)(`mid_val`)
FROM testdb.aggregated_test_tb
GROUP BY `__name__`;

Reference Links

/docs/en/sql-reference/data-types/aggregatefunction

AggregatingMergeTree

The engine inherits fromMergeTreeThe logic of data block merging has been changed, and ClickHouse uses a single record (in a data block) storing a combination of aggregate function states to replace a record with the same primary key (or, more precisely, with the samesorting key) of all rows of

Description: A data block is the basic unit of data stored by ClickHouse

It is possible to useAggregatingMergeTree Table for incremental data aggregation, including polymerized views.

The engine handles all columns of the following types:

  • AggregateFunction

  • SimpleAggregateFunction

    If you can reduce the number of ordered lines, use theAggregatingMergeTreeappropriate

tabulate

CREATE TABLE [IF NOT EXISTS] [db.]table_name [ON CLUSTER cluster]
(
    name1 [type1] [DEFAULT|MATERIALIZED|ALIAS expr1],
    name2 [type2] [DEFAULT|MATERIALIZED|ALIAS expr2],
    ...
) ENGINE = AggregatingMergeTree()
[PARTITION BY expr]
[ORDER BY expr]
[SAMPLE BY expr]
[TTL expr]
[SETTINGS name=value, ...]

For a description of the request parameters, seeRequest description

query statement

establishAggregatingMergeTreeTable and CreationMergeTreeThe clauses of the table are identical.

Queries and inserts

To insert data, use theINSERT SELECTutilizationaggregateStatefunction to make a query. The query from theAggregatingMergeTreeWhen querying data in a table, use theGROUP BYclause and the same aggregation functions as when inserting data, but using theMergeSuffix.

existSELECTThe result of the query.AggregateFunctionValues of type have implementation-specific binary representations for all ClickHouse output formats. For example, if you can use theSELECTThe query dumps the data asTabSeparatedformat, then you can use theINSERTThe query reloads this dump.

An example of a materialized view

CREATE DATABASE testdb;

Creates a file that holds the raw data of theTable.

CREATE TABLE 
(
    StartDate DateTime64, 
    CounterID UInt64,
    Sign Nullable(Int32),
    UserID Nullable(Int32)
) ENGINE = MergeTree 
ORDER BY (StartDate, CounterID);

Note: The aboveStartDate DateTime64, If written asStartDate DateTime64 NOT NULL, The run will report an error as follows:

Expected one of: CODEC, ALIAS, TTL, ClosingRoundBracket, Comma, DEFAULT, MATERIALIZED, COMMENT, token (version 20.3.5.21 (official build))

Next, create aAggregatingMergeTreetable, which will store theAggregationFunction, which is used to track the total number of visits and the number of unique users.

Create aAggregatingMergeTree Physical view for monitoringtable and use theAggregateFunction Type:

CREATE TABLE testdb.agg_visits (
    StartDate DateTime64,
    CounterID UInt64,
    Visits AggregateFunction(sum, Nullable(Int32)),
    Users AggregateFunction(uniq, Nullable(Int32))
)
ENGINE = AggregatingMergeTree() ORDER BY (StartDate, CounterID);
SQL incorrect [70]: ClickHouse exception, code: 70, host: 192.168.88.131, port: 8123; Code: 70, () = DB::Exception: Conversion from AggregateFunction(sum, Int32) to AggregateFunction(sum, Nullable(Int32)) is not supported: while converting source column Visits to destination column Visits: while pushing to view testdb.visits_mv (version 20.3.5.21 (official build))

CREATE TABLE testdb.agg_visits (
    StartDate DateTime64,
    CounterID UInt64,
    Visits AggregateFunction(sum, Int32),
    Users AggregateFunction(uniq, Int32)
)
ENGINE = AggregatingMergeTree() ORDER BY (StartDate, CounterID);

Create a materialized view from thepaddingtestdb.agg_visits

CREATE MATERIALIZED VIEW testdb.visits_mv TO testdb.agg_visits
AS SELECT
    StartDate,
    CounterID,
    sumState(Sign) AS Visits,
    uniqState(UserID) AS Users
FROM 
GROUP BY StartDate, CounterID;

Insert data into the Table.

INSERT INTO  (StartDate, CounterID, Sign, UserID)
 VALUES (1667446031000, 1, 3, 4), (1667446031000, 1, 6, 3);

The data is inserted simultaneously into therespond in singingtestdb.agg_visitsCenter.

Execution of actions such asSELECT ... GROUP BY ...statement to query a materialized viewtest.mv_visitsto get aggregated data

SELECT
    StartDate,
    sumMerge(Visits) AS Visits,
    uniqMerge(Users) AS Users
FROM testdb.agg_visits
GROUP BY StartDate
ORDER BY StartDate;

Output:

StartDate          |Visits|Users|
-------------------+------+-----+
2022-11-03 11:27:11|     9|    2|

existAdd another 2 records to the list, but this time try to use a different timestamp for one of them.

INSERT INTO  (StartDate, CounterID, Sign, UserID)
 VALUES (1669446031000, 2, 5, 10), (1667446031000, 3, 7, 5);

The query is repeated and the output is as follows:

StartDate          |Visits|Users|
-------------------+------+-----+
2022-11-03 11:27:11|    16|    3|
2022-11-26 15:00:31|     5|    1|

Reference Links

/docs/en/engines/table-engines/mergetree-family/aggregatingmergetree