Why does SQL support aggregate queries?
This may seem like a naive question, but let's think about it step by step. Data is stored in behavioral granularity. The simplest SQL statement is select * from test
, which obtains the entire two-dimensional display, but this is not enough. For the following two purposes, SQL is required to provide aggregate functions:
- The detailed data has no statistical significance. For example, I want to know the total turnover of today, but I don't care how much the guests at a certain table spend.
- Although it is possible to check the data into memory before aggregation, it is easy to burst the memory when the amount of data is very large. Maybe the amount of data in a table is 10TB a day, and even if 10TB of data can be read into memory, Aggregate computations can also be unacceptably slow.
In addition, the aggregation itself has a certain logic complexity, and SQL provides aggregation functions and grouping aggregation capabilities, which can easily and quickly count aggregated data with business value, which lays the analytical value of the SQL language, so most analysis software directly uses SQL as a direct user-facing expression.
Aggregate function
Common aggregation functions are:
- COUNT: count.
- SUM: Summation.
- AVG: Average.
- MAX: Find the maximum value.
- MIN: Find the minimum value.
COUNT
COUNT is used to calculate how many pieces of data there are. For example, let's see how many pieces of data there are in the id column:
SELECT COUNT(id) FROM test
But we found that the COUNT of any column is the same, so what's the point of passing in the id? There is no need to find a specific column to refer to, so it can also be written as:
SELECT COUNT(*) FROM test
But there are subtle differences between the two. There is a very special value type NULL
in SQL. If COUNT specifies a specific column, the row whose value is NULL
will be skipped during statistics, and COUNT(*)
not specify a specific column, so even if it contains NULL
, even a row has all The columns are all NULL
and will also be included. Therefore, the result detected by COUNT(*)
must be greater than or equal to COUNT(c1)
.
Of course, any aggregate function can follow the query condition WHERE, for example:
SELECT COUNT(*) FROM test
WHERE is_gray = 1
SUM
SUM sums all items, so must be used on numeric fields, not strings.
SELECT SUM(cost) FROM test
SUM treats a NULL value as 0 because this is equivalent to ignore.
AVG
AVG averages all terms, so must work on numeric fields, not strings.
SELECT AVG(cost) FROM test
When AVG encounters a NULL value, it adopts the most thorough ignoring method, that is, NULL does not participate in the calculation of the numerator and denominator at all, as if this row of data does not exist.
MAX、MIN
MAX and MIN are used to find the maximum and minimum values respectively. When the above are different, they can also be applied to strings. Therefore, the size can be judged according to the letters. From large to small, it corresponds to a-z
, but even if it can be calculated, it is meaningless and not good. understand, so extrema on strings is not recommended.
SELECT MAX(cost) FROM test
Multiple aggregate fields
Although they are all aggregate functions, MAX and MIN are not strictly aggregate functions because they only find rows that satisfy the conditions. You can see the comparison of the following two query results:
SELECT MAX(cost), id FROM test -- id: 100
SELECT SUM(cost), id FROM test -- id: 1
The first query can find the id of the row with the maximum value, but the id of the second query is meaningless, because it does not know which row it belongs to, so only the id of the first data is returned.
Of course, if MAX and MIN are calculated at the same time, then id will only return the value of the first data, because this query result corresponds to plural rows:
SELECT MAX(cost), MIN(cost), id FROM test -- id: 1
Based on these characteristics, it is best not to mix aggregation and non-aggregation, that is, once a query has one field that is aggregated, then all fields must be aggregated.
Many custom fields of BI engines now have this limitation, because mixing aggregation and non-aggregation can handle many edge cases when customizing memory calculations. Although SQL can support it, business-defined functions may not.
Group aggregation
Group aggregation is GROUP BY, which can actually be regarded as an advanced conditional statement.
For example, to query the total GDP of each country:
SELECT COUNT(GDP) FROM amazing_table
GROUP BY country
The returned results will be grouped by country, in which case the aggregation function becomes aggregated within the group.
In fact, if we only want to look at the GDP of China and the United States, we can also check it with non-grouping, but it needs to be divided into two SQLs:
SELECT COUNT(GDP) FROM amazing_table
WHERE country = '中国'
SELECT COUNT(GDP) FROM amazing_table
WHERE country = '美国'
So GROUP BY can also be understood as finding out all the enumerable situations of a field and integrating them into a table, each row represents an enumeration situation, and does not need to be decomposed into WHERE queries one by one.
Multi-field grouping aggregation
GROUP BY can be used for multiple dimensions, and the meaning is equivalent to dragging rows/columns into multiple dimensions when querying a table.
The above is the perspective of the BI query tool. If there is no context, you can see the following progressive description:
- Group aggregation by multiple fields.
- The combination of multiple fields becomes a unique Key, that is,
GROUP BY a,b
means that a and b are combined to describe a group. GROUP BY a,b,c
The first column of the query result may see many repeated a rows, and the second column sees repeated b rows, but not within the same a value, and c is the same in b row.
Below is an example:
SELECT SUM(GDP) FROM amazing_table
GROUP BY province, city, area
The query result is:
浙江 杭州 余杭区
浙江 杭州 西湖区
浙江 宁波 海曙区
浙江 宁波 江北区
北京 .........
GROUP BY + WHERE
WHERE is a conditional filter based on rows. Therefore, GROUP BY + WHERE is not filtering within the group, but filtering the whole.
But because of filtering by row, the results are exactly the same within or without, so we can barely perceive the difference:
SELECT SUM(GDP) FROM amazing_table
GROUP BY province, city, area
WHERE industry = 'internet'
However, ignoring this difference can cause us to hit a wall in aggregate filtering.
For example, to filter out the sum of the grades of students with an average score greater than 60, if you do not use subqueries, you cannot add aggregate functions to WHERE in ordinary queries. For example, the following is an example of a syntax error:
SELECT SUM(score) FROM amazing_table
WHERE AVG(score) > 60
Don't imagine that the above SQL can be executed successfully, and don't use aggregate functions in WHERE.
GROUP BY + HAVING
HAVING is conditionally filtered based on groups. So aggregate functions can be used in HAVING:
SELECT SUM(score) FROM amazing_table
GROUP BY class_name
HAVING AVG(score) > 60
The above example can be queried normally, which means that the total score is grouped by class, and only the classes with an average score greater than 60 are filtered out.
So why HAVING can use aggregation conditions? Because HAVING filters groups, it makes sense to filter out groups that do not meet the conditions after group aggregation. WHERE is for row granularity. After aggregation, there is only one piece of data in the whole table, no matter whether it is filtered or not, it is meaningless.
However, it should be noted that index filtering cannot be used to generate derived tables by GROUP BY, so WHERE can use indexing for fields to optimize performance, while HAVING does not work for indexed fields.
Summarize
Aggregate function + grouping can meet most simple SQL requirements. When writing SQL expressions, you need to think about how such expressions are calculated. For example, MAX(c1), c2
is reasonable, and SUM(c1), c2
, c2
, is meaningless.
Finally remember that WHERE is performed before GROUP BY, and HAVING filters for groups.
The discussion address is: Intensive Reading "SQL Aggregation Query" Issue #401 ascoders/weekly
If you want to participate in the discussion, please click here , there are new topics every week, released on weekends or Mondays. Front-end intensive reading - help you filter reliable content.
Pay attention to front-end intensive reading WeChat public number
<img width=200 src="https://img.alicdn.com/tfs/TB165W0MCzqK1RjSZFLXXcn2XXa-258-258.jpg">
Copyright Notice: Free to reprint - non-commercial - non-derivative - keep attribution ( Creative Commons 3.0 license )
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。