Location>code7788 >text

SQL] Lag/Rank/Over Window Functions Revealed, A Journey of Data Analysis

Popularity:797 ℃/2024-07-22 08:06:37

The summer sunshine in July was like fire, but Xiaoyue's heart was cool and passionate. In her new position in the data analysis project team, she seems to have found her new stage, where she can deeply integrate her profound back-end technical strength with the essence of data analysis. Every day, she is immersed in the analysis of business needs and data-driven decision-making, working closely with business, product and other departments to unravel the secrets behind the data and contribute to the development of the enterprise.

Just when she was fully engaged in learning and exploring the new environment, a message from the WeChat workgroup was like a fresh breeze in summer, bringing her new opportunities. Yichen, as the representative of the A-side in a project's WeChat group, had released an urgent data analysis task through the group, and he hadn't realized at the time that it was Xiaoyue who would be assigned to handle the task. Although their chance encounter at the café last week was just a quick glance, it seems to have set the stage for this collaboration.

In the face of this sudden challenge, Xiaoyue did not have the slightest panic and hesitation. She knows that every challenge in the work is a valuable opportunity for self-improvement and growth, whether it is a familiar partner or an unfamiliar collaborator. So, she quickly adjusted her mindset, carefully read the task requirements, and replied to Yidian with a professional attitude and rigorous spirit.


Xiaoyue was given the task of analyzing the sales performance of different departments within each company based on raw data tables and calculated the ranking of each department's sales performance, the gap to the next highest sales performance, the gap to the highest sales performance in the department, and the difference to the average of all departmental sales.

--Build Test Sheets on Demand
create table temp1 (
comp_name varchar(100),
dept_name varchar(100),
sales numeric(18,4)
)
--Write test data:
INSERT INTO temp1  VALUES ('CompanyA', 'Dept 1', 100);
INSERT INTO temp1  VALUES ('CompanyA', 'Dept 3', 80);
INSERT INTO temp1  VALUES ('CompanyA', 'Dept 4', 80);
INSERT INTO temp1  VALUES ('CompanyA', 'Dept 2', 50);
INSERT INTO temp1  VALUES ('CompanyB', 'Dept 4', 120);
INSERT INTO temp1  VALUES ('CompanyB', 'Dept 3', 90);
INSERT INTO temp1  VALUES ('CompanyB', 'Dept 2', 80);
INSERT INTO temp1  VALUES ('CompanyB', 'Dept 1', 70);

Faced with the need to analyze the sales performance of different departments within the company and calculate the rankings, gaps and variances, Xiaoyue's first reaction was to use a temporary table, option one:

-- Create a temporary table to store the maximum and average sales for each company
CREATE TABLE temp_max_avg AS
SELECT comp_name,
       MAX(sales) AS max_sales,
       AVG(sales) AS avg_sales
FROM temp1
GROUP BY comp_name;

-- Create a temporary table to store rankings and previous sales
CREATE TABLE temp_rank_lag AS
SELECT comp_name,
       dept_name,
       sales,
       ROW_NUMBER() OVER (PARTITION BY comp_name ORDER BY sales DESC) AS rank,
       LAG(sales) OVER (PARTITION BY comp_name ORDER BY sales DESC) AS prev_sales
FROM temp1;

-- Create the final temporary table to store all the results
CREATE TABLE temp_final AS
SELECT t1.comp_name,
       t1.dept_name,
       ,
       ,
       COALESCE(t2.prev_sales - , 0) AS next_behind,
       tma.max_sales -  AS total_behind,
        - tma.avg_sales AS diff_from_avg
FROM temp1 t1
JOIN temp_rank_lag t2 ON t1.comp_name = t2.comp_name AND  = 
JOIN temp_max_avg tma ON t1.comp_name = tma.comp_name;

-- Check the final result
SELECT * FROM temp_final
ORDER BY comp_name, rank;

This SQL query is used to analyze the sales performance of different departments within each company and calculates the rank of each department's sales performance, the gap to the next highest sales performance, the gap to the highest sales performance in the department, and the difference to the sales average. Below is an explanation of each field:

  • comp_name: Company name
  • dept_name: Name of department
  • sales: Sales performance
  • rank: Ranking within the company in descending order of sales performance
  • next_behind: Gap in sales performance to next place (0 if first place)
  • total_behind: Gap to the highest sales performance in the sector
  • diff_from_avg: Difference from sales average

Inquiry results

comp_name dept_name sales rank next_behind total_behind diff_from_avg
CompanyA Dept 1 100 1 20 50 15
CompanyA Dept 3 80 2 0 70 -15
CompanyA Dept 4 80 2 0 70 -15
CompanyA Dept 2 50 4 30 100 -45
CompanyB Dept 4 120 1 30 40 22.5
CompanyB Dept 2 80 2 10 60 -12.5
CompanyB Dept 3 90 3 0 50 2.5
CompanyB Dept 1 70 4 20 70 -27.5

Attention:

  • rankfield, if both departments have the same sales performance, they will share the same rank and the next ranked value will be skipped (e.g., Dept 3 and Dept 4 for CompanyA in the table above).
  • next_behindThe field calculates the difference in sales performance from the next in line, or if there is no next in line (i.e., it is already the last in line), it shows zero.
  • total_behindThe field calculates the difference from the highest sales performance in the department.
  • diff_from_avgField calculates the difference between the current department's sales performance and the average of the sales performance of all departments in the company. A positive value indicates above average and a negative value indicates below average.

Considering the efficiency and simplicity of the SQL query, Xiaoyue then optimized Option 1 and decided to use a window function (e.g.RANK()LEAD()MAX()AVG()) to realize.

The optimized solution uses the window function to perform all calculations directly in the SELECT query, eliminating the need to create temporary tables to store intermediate results.RANK()function is used to calculate the ranking of sales performance.LEAD()function (orLAG()(selected on demand) is used to identify gaps in sales performance with the next person in line.MAX()cap (a poem)AVG()The window function is then used to calculate the highest sales performance and the average sales value for the department, respectively, and thus the difference from these values. Optimized scenario two:

--Use a single statement to implement the temporary table logic in Option 1

--mysql8, oracle10g and mssql2012 or above
SELECT temp1.*,  
       RANK() OVER (PARTITION BY comp_name ORDER BY sales DESC) AS rank,  
       COALESCE(LAG(sales) OVER (PARTITION BY comp_name ORDER BY sales DESC) - sales, 0) AS next_behind,  
       MAX(sales) OVER (PARTITION BY comp_name) - sales AS total_behind,  
       sales - AVG(sales) OVER (PARTITION BY comp_name) AS diff_from_avg  
FROM temp1  
ORDER BY comp_name, rank;

--PostgreSQL Version
SELECT *,
         RANK()
    OVER w AS rank_by_sales, MAX(sales)
    OVER w - sales AS total_behind, sales - AVG(sales)
    OVER w AS diff_from_avg, COALESCE(LAG(sales)
    OVER w - sales, 0) AS next_behind
FROM temp1 WINDOW w AS (PARTITION BY comp_name
ORDER BY  sales DESC);

Option 2 has the same query results as Option 1.

In the SQL query in Scenario 2, several window functions (also known as analytic functions) are used to perform the analysis of thetemp1The data in the table is grouped and sorted, and sales performance metrics are calculated for each department relative to other departments within their company. The following is a detailed explanation of these window functions:

  1. RANK() OVER (...) AS rank:
    • RANK() is a window function that is used to assign a unique ranking to rows within a partition. However, if two or more rows have the same ranking value (in this casesales), then those rows will receive the same ranking, and the next ranking will skip the corresponding number (i.e., if there are two rows tied for first, the next row will be ranked third, not second).
    • OVER (PARTITION BY comp_name ORDER BY sales DESC) Specifies how the window function is partitioned and sorted. Here.PARTITION BY comp_name representcomp_namefield divides the data into groups (i.e., each company department is considered a group), and theORDER BY sales DESC Indicates that within each group, the data will follow thesalesThe fields are sorted in descending order.
  2. COALESCE(LAG(sales) OVER (...) - sales, 0) AS next_behind:
    • LAG(sales) OVER (...) is a window function that accesses the line before the current line of thesalesvalue (which in this case is the value of thesales(the descending order of the rows). If there is no previous row (i.e., the current row is the first row in the partition), theLAG()function will return NULL.
    • COALESCE(expression, 0) function is used to return the first non-NULL expression in its argument list as the result. If theLAG(sales) - salesresults in NULL (i.e., the current row is the first row in the partition), theCOALESCEwill return 0.
    • Therefore.next_behindThe column calculates the gap between the sales performance of the current department and the sales performance of the department immediately behind it. If the current department is in first place, the gap is 0.
  3. MAX(sales) OVER (...) - sales AS total_behind:
    • MAX(sales) OVER (...) is a window function that computes the partitionedsalesThe maximum value of the field. Since the same partitioning and sorting is used here (PARTITION BY comp_name), so it will provide each division with the highest sales performance within that company.
    • Then, from this maximum value, subtract the current row'ssalesvalue to get thetotal_behindThe gap between the current department's sales performance and the highest sales performance within that company.
  4. sales - AVG(sales) OVER (...) AS diff_from_avg:
    • AVG(sales) OVER (...) is another window function that is used to compute the partitioningsalesaverage of the fields. Again, since the use of thePARTITION BY comp_nameThis will provide each department with an average of the sales performance within that company.
    • Then, from the current line of thesalesSubtracting this average from the value givesdiff_from_avgA positive value indicates that the current department's sales performance is different from the average sales performance within the company. A positive value indicates that the current department's sales performance is above average, while a negative value indicates that it is below average.

Finally.ORDER BY comp_name, rank is used to sort the entire query results, first by company name (comp_name) are sorted in ascending order and then ranked by sales within each firm (rank) is sorted in ascending order. However, sincerankAlready based onsalesin descending order, so that actually within each company, therankThe ordering is already descending, but the externalORDER BYIt ensures that the results are also organized between different companies. However, due to therankitself may contain ties in SQL, so the actual ordering may vary slightly depending on the SQL implementation (e.g., in some databases, the rankings of ties may be displayed in their original order in the table).


Here is an introduction to the common window analyzer functions in major databases.RANK()OVER()LAG() 。

  1. RANK() Function.

    • This function first appeared in the IBM DB2 database, dating back to 1993.
    • Subsequently, in the 2000 release of Oracle9i, Oracle introduced theRANK() function.
    • Microsoft SQL Server, on the other hand, has added a new version of SQL Server to the 2005 version of SQL Server.RANK() function support.
    • RANK() The introduction of functions provides data analysts and developers with a more flexible and powerful sorting method, which greatly improves the efficiency of data analysis.
    • The RANK() function is used to assign a rank value to rows in a data set.
      It will sort the rows according to the specified columns (or expressions) and then assign a rank value to each row.
      If there are multiple rows with the same value, they will receive the same rank, and the rank value of subsequent rows will be incremented accordingly.
      This function is useful when you need to sort data and get ranking information, such as finding the top 5 products in terms of sales.

  2. OVER() Clause.

    • OVER() Clauses first appeared in the OLAP (Online Analytical Processing) domain back in the 1990s.
    • Oracle introduced in the 2000 Oracle9i release theOVER() clause, making the use of window functions more convenient and flexible.
    • Microsoft SQL Server has also added support for theOVER() clause support.
    • OVER() The introduction of clauses greatly simplifies the complex task of data analysis and improves the efficiency of developers.
    • The OVER() clause is used to define the scope of the window function.
      It is possible to specify a range of row groups, thus allowing the window function to perform calculations within that range.
      Common uses include: calculating moving averages, calculating cumulative values, and calculating rankings.

  3. LAG() cap (a poem)LEAD() Function.

    • These two functions first appeared in the IBM DB2 database, dating back to 2001.
    • Oracle introduced in the 2006 Oracle 10g release theLAG() cap (a poem)LEAD() function.
    • Microsoft SQL Server added support for these two functions in the 2008 version of SQL Server 2008.
    • LAG() cap (a poem)LEAD() The introduction of the function makes it easier and more efficient to analyze the time series and compare the data.
    • The LAG() function is used to get the value of the N rows before the current row.
      The LEAD() function is used to get the value of the line N lines after the current line.
      These two functions are very useful for time series analysis and data comparison, for example.
      Calculate the difference between current sales and last month's sales
      Determine whether the current quarter's performance is higher than that of the previous quarter

In addition to common window functions such as Rank, Lag, and Over, SQL provides a number of other useful window functions, including.

  1. Row_Number(): assigns a unique number to each row in the partition, counting from 1.

  2. Partition_By: can be used together with other window functions to specify the basis of window partitioning.

  3. First_Value() and Last_Value(): get the first and last value in the window respectively.

  4. Ntile(): distributes the rows in the partition equally among the specified number of groups.

  5. Ratio_To_Report(): calculates the ratio of the current row value to the total value of the partition.

  6. Cume_Dist(): calculates the ratio of the number of rows whose current row is less than or equal to its own value to the total number of rows in the partition.

  7. Percent_Rank(): Calculates the percentile rank of the current row within the partition.

  8. Dense_Rank(): similar to Rank(), but without the concatenation of serial numbers.

These commonly used window functions are the product of the development of database technology, with the increasing demand for OLAP and data analysis, database vendors continue to improve and refine these functions, making data analysis work more simple and efficient. The introduction of these window functions, greatly promoting the development of the field of data analysis.