This series is the beginning of the SQL series, introducing some macro and basic content.
What is SQL?
SQL is a structured query language used to manage relational databases. 90% of us are exposed to query syntax, but it actually includes complete addition, deletion, modification, and transaction processing functions.
Declarative features
SQL is a declarative programming language, whereas modern general-purpose programming languages are generally imperative. But don't blindly worship declarative languages, for example, it will replace low-level imperative languages in the future, because declarative itself has its shortcomings, and it also has similarities with imperative languages.
Why do we think declarative programming languages are more advanced? Because the declarative language has a higher degree of abstraction, for example, select * from table1
only describes the query data from table1, but the specific steps of the query are not mentioned at all. There may be complex index optimization and locking mechanisms behind this, but we don't need to care about it. It's the pinnacle of programming.
So why is all common business code now imperative? Because the imperative gives us the opportunity describe specific implementations, and general-purpose programming needs to be built on rigorous implementation details. For example, verifying user permissions, even if AI programming provides the ability to convert "logged in users can only access authorized resources" into code, we don't know what resources specifically refer to, and the ownership of resources in the process of permission transfer. to whom.
The reason why SQL can retain declarative features is entirely because it locks down the specific field of relational data management, and it is precisely because the requirements for this field are standardized and enumerable that declarative is possible.
Declarative capabilities can also be fully extended based on imperative languages. For example, many ORMs provide syntaxes like select({}).from({}).where({})
, and even a login()
function is also a manifestation of declarative programming, because the caller does not need to care how to log in, in short, just call it. After completing the landing, isn't that the whole essence of declarative?
Grammar classification
As a relational database management tool, SQL needs to define, manipulate, and control data.
Data definition is to modify the database and table-level structures. These are data structures, or data metadata, which do not represent specific data, but describe the attributes of the data.
Data manipulation is to modify the specific data row by row, add, delete, modify and check.
Data control is the management and control of transactions and user rights.
data definition
DDL (Data Definition Language) data definition, including CREATE
DROP
ALTER
methods.
data manipulation
DML (Data Manipulation Language) data manipulation, including SELECT
INSERT
UPDATE
DELETE
methods.
data control
DCL (Data Control Language) data control, including COMMIT
, ROLLBACK
, etc.
All SQL operations revolve around these three types, of which data manipulation accounts for almost 90% of the code volume. After all, the appeal of data query is far greater than that of writing. Data writing corresponds to data collection, while data query corresponds to data analysis, and the field of data analysis can be played. The tricks out are far more than the data collection.
PS: In some cases, the most important SELECT
will be mentioned under the DQL (Data Query Language) classification, so that the classification becomes four.
set operations
The first citizen of the SQL world is a collection, just like the first citizen of the JAVA world is an object. We can only understand SQL better if we look at it from the perspective of collections.
What is the set perspective, that is, all queries and operations are carried out in a two-dimensional data structure, rather than the addition, subtraction, multiplication and division relationships between single numbers in elementary school arithmetic.
The operations of sets generally include UNION
union, EXCEPT
difference, and INTERSECT
intersection, which are all operations in units of rows, while various JOIN statements are set operations in units of columns, which are also the join queries mentioned later.
As long as you stand and think in a two-dimensional data structure, operations are nothing more than horizontal or vertical operations.
data paradigm
The data paradigm is divided into five layers, and each layer has more stringent requirements than the previous layer, so it is a paradigm that can be followed step by step. Data paradigms require data to be increasingly decoupled and redundant.
For example, the first normal form requires that each column is atomic, that is, it is the smallest indivisible data unit. If a column is stored as a string when the data is collected, and the provinces and cities are separated by "|", then it is not atomic.
Of course, the actual production process often does not follow this standard, because the table is not isolated. In the data processing flow, the column may be atomized at a certain stage, and the original data is processed by column merging in order to compress the volume.
It is not only the underlying table that I hope to violate the paradigm. In the current big data processing scenario, more and more businesses use a large-width table structure, and even deliberately implement data redundancy to improve query efficiency. The column storage engine is designed for this scenario. , so the data paradigm can be flexible in big data scenarios, but it is still worth learning.
polymerization
When using GROUP BY to group aggregated data, if you want to filter on aggregated values, you cannot use WHERE qualifications, because WHERE is based on row filtering, not for combination. (GROUP BY groups data, we call these groups "combinations"), so you need to use the filter statement HAVING for combinations:
SELECT SUM(pv) FROM table
GROUP BY city
HAVING AVG(uv) > 100
In this example, it is meaningless to replace HAVING with WHERE, because when adding aggregation conditions to WHERE, all data needs to be merged, which does not conform to the level of detail of the current view. (About the view detail level, there are detailed instructions in the intensive reading "What is LOD Expression" that I wrote earlier).
Aggregation is so important because we have to look at the high LEVEL perspective for analyzing data, and detailed data cannot see trends. While complex requirements are often accompanied by filters with aggregations, it is important to understand how SQL is supported.
CASE expression
CASE expressions are divided into simple and search CASE expressions, simple expressions:
SELECT CASE pv WHEN 1 THEN 'low' ELSE 'high' END AS quality
The above example uses a simple CASE expression to form a new field. This mode is equivalent to generating a business custom temporary field, which is very useful when processing data in the current table. The search CASE expression capability completely covers simple CASE expressions:
SELECT CASE WHEN pv < 100 THEN 'low' ELSE 'high' END AS quality
It can be seen that the search CASE expression can use "expression" to describe the conditions, more complex tasks can be easily completed, and even subqueries, aggregations and other means can be used in expressions. These are the idioms used by experts to write SQL, so CASE expressions are well worth learning in depth.
complex query
SELECT is the most complex part of SQL, which includes three complex query modes, namely join query and subquery.
connection query
Refers to JOIN queries, such as LEFT JOIN, RIGHT JOIN, and INNER JOIN.
When we introduced aggregation, we mentioned that a join query is essentially to expand a column, and two tables will not be synthesized for no reason, so there must be a foreign key as a relational link:
SELECT A.pv, B.uv
FROM table1 as t1 LEFT JOIN table2 AS P t2
ON t1.productId = t2.productId
Join queries not only expand the columns, but also expand the rows, and the way of expansion depends on the type of query being joined. In addition to connecting and querying other tables, you can also connect and query yourself, for example:
SELECT t1.pv AS pv1, P2.pv AS pv2
FROM tt t1, tt t2
The result of this sub-connection query is the Cartesian product of itself and itself, which can be filtered through WHERE to remove duplicates. There will be an article dedicated to it later.
Subqueries and Views
A subquery is a SELECT within a SELECT. Generally speaking, a SELECT will be executed from the inside to the outside, and only in the correlated subquery mode will it be executed from the outside to the inside.
And if the subquery is saved, it is a view. This view is not an entity table, so it is very flexible, and the data will change with the original table data:
CREATE VIEW countryGDP (country, gdp)
AS
SELECT country, SUM(gdp)
FROM tt
GROUP BY country
After that, the view countryGDP
can be used as a temporary table.
This mode is actually a bit contrary to the declarative features of SQL, because defining a view is similar to defining a variable. If you continue to write, it will inevitably form a certain imperative thinking logic, but this is unavoidable.
affairs
When SQL executes a series of operations, it is inevitable to encounter the problem of dirty data if it is not executed, so the transaction can guarantee the atomicity of the operation. Generally speaking, each DML operation is a built-in transaction, and the START TRANSACTION provided by SQL allows us to customize the transaction scope, so that a series of business operations can be packaged together and become an atomic operation.
For SQL, atomic operation is very safe, that is, if it fails, it will leave no trace, and if it succeeds, it will all succeed, and there will be no intermediate state.
OLAP
OLAP (OnLine Analytical Processing) is real-time data analysis, which is the basis for the realization of the computing engine behind BI tools.
More and more SQL databases now support the implementation of window functions, which are used to implement functions such as runningSum or runningAvg in business, which are very common in data analysis.
Taking runningSum as an example, for example, the data of the Double Eleven real-time table is real-time GMV in minutes, and we want to make a GMV summary line graph accumulated to the current time, the Y-axis needs to support expressions such as running_sum(GMV)
, and This may be achieved through window functions behind this.
Of course, not all business functions are directly provided by SQL. The business layer still needs to implement a large number of memory functions. For calculation in the JAVA layer, some of these functions need to be pushed down to SQL for execution. Only the combination of memory functions and pushdown functions can Forms the complex calculated field effects we see in BI tools.
Summarize
SQL is a declarative language. A seemingly simple query statement often corresponds to a complex implementation at the engine layer. This is why SQL is so important and yet so popular.
Although SQL is easy to use, to understand it systematically, it is necessary to start with the concept of structured data and collections.
Don't underestimate the CASE syntax. It is not only confused with the CASE syntax of programming languages, but also combines expressions to perform conditional branch judgment, which is the longest routine used by many data analysts in their daily work.
There are fewer and fewer scenarios for creating applications using simple SQL, but in BI scenarios, there are more and more SQL-based enhanced expression scenarios. I created this series with the goal of understanding query expressions in BI scenarios. I hope Be able to apply what you have learned.
The discussion address is: Intensive Reading "Introduction to SQL" Issue #398 ascoders/weekly
If you would like to join the discussion, please click here , there are new topics every week, released on weekends or Mondays. Front-end intensive reading - help you filter reliable content.
Pay attention to front-end intensive reading WeChat public
<img width=200 src="https://img.alicdn.com/tfs/TB165W0MCzqK1RjSZFLXXcn2XXa-258-258.jpg">
Copyright notice: Free to reprint - non-commercial - non-derivative - keep attribution ( Creative Commons 3.0 license )
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。