The window function looks like this:
表达式 OVER (PARTITION BY 分组字段 ORDER BY 排序字段)
There are two abilities:
- When the expression is
rank()
dense_rank()
row_number()
, it has the ability to group and sort. - When the expression is an aggregation function such as
sum()
, it has the ability to accumulate aggregation.
Regardless of capability, the window function does not affect the number of rows of data, but rather amortizes the computation over each row .
These two abilities need to be understood separately.
bottom table
The above is the sample bottom table, there are 8 pieces of data, city 1, city 2 two cities, below each have area 1~4, each piece of data has the population number of the data.
Sort by group
If we sort by population, ORDER BY people
will do, but what if we want to sort within cities?
At this point, the grouping and sorting capabilities of the window function are used:
SELECT *, rank() over (PARTITION BY city ORDER BY people) FROM test
The SQL says to sort by people within the city group.
Actually PARTITION BY is also optional, if we ignore it:
SELECT *, rank() over (ORDER BY people) FROM test
It is also valid, but this statement is equivalent to ordinary ORDER BY. Therefore, PARTITION BY is generally used when using window functions for grouping and sorting.
Differences between grouping sorting functions
We print out the results of rank()
dense_rank()
row_number()
:
SELECT *,
rank() over (PARTITION BY city ORDER BY people),
dense_rank() over (PARTITION BY city ORDER BY people),
row_number() over (PARTITION BY city ORDER BY people)
FROM test
In fact, it can be guessed from the results that when these three functions encounter the same value in sorting, they have the following differences in the ranking statistical logic:
rank()
: Same rank for the same value, but takes the rank number.dense_rank()
: The ranking is the same when the value is the same, but does not occupy the ranking number, and the overall ranking is more compact.row_number()
: Force ranking by row number regardless of whether the values are the same.
The above example can be optimized a bit, because all window logic is the same, we can use WINDOW AS to extract as a variable:
SELECT *,
rank() over wd, dense_rank() over wd, row_number() over wd
FROM test
WINDOW wd as (PARTITION BY city ORDER BY people)
Cumulative Aggregation
As we said before, everything that uses an aggregate function turns the query into an aggregate mode. If GROUP BY is not used, the number of returned rows will be compressed into one row after aggregation. Even if GROUP BY is used, the number of returned rows will generally be greatly reduced because the grouping is aggregated.
However, aggregation using window functions does not result in a reduction in the number of returned rows, so how is this aggregation calculated? Let's just look at the following example:
SELECT *,
sum(people) over (PARTITION BY city ORDER BY people)
FROM test
It can be seen that in each city group, after sorting by people, accumulated (the same value will be merged together), which is the implementation idea of RUNNGIN_SUM generally said by BI tools. The dates will be repeated, so you won't encounter the problem of the merge calculation in the first red box.
There are also cumulative functions such as avg()
min()
, etc. These can also act on the window function, and its logic can be understood according to the following figure:
You may have doubts, isn't it more convenient to directly sum (the result of the previous line, the next line)? To verify the conjecture, let's try the result of
avg()
:
It can be seen that if the cache of the result of the previous row is directly used, the avg result must be inaccurate, so the window cumulative aggregation is recalculated for each row. Of course, the possibility of additional performance optimization for sum, max, and min is not ruled out, but avg can only be recalculated for each line.
Use with GROUP BY
The window function can be used in combination with GROUP BY. The rule is that the window range is valid for the following query results, so it doesn't really care whether GROUP BY is performed. Let's look at the following example:
The cumulative aggregation after grouping by region is performed on the granularity of data rows after GROUP BY, not the previous detail rows.
Summarize
Window functions are very useful in scenarios such as sorting or accumulating GVM within a calculation group, we just need to keep two knowledge points in mind:
- Grouping sorting only makes sense when combined with PARTITION BY.
- Cumulative aggregation is used for query result row granularity and supports all aggregation functions.
The discussion address is: Intensive Reading "SQL Window Functions" Issue #405 ascoders/weekly
If you want to participate in the discussion, please click here , there are new topics every week, released on weekends or Mondays. Front-end intensive reading - help you filter reliable content.
Pay attention to front-end intensive reading WeChat public number
<img width=200 src="https://img.alicdn.com/tfs/TB165W0MCzqK1RjSZFLXXcn2XXa-258-258.jpg">
Copyright notice: Free to reprint - non-commercial - non-derivative - keep attribution ( Creative Commons 3.0 license )
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。