Abstract: This article will explain the hard core M-SQL: a related paper on a multi-task representation learning method that converts natural language into SQL statements.
This article is shared from the HUAWEI cloud community " [Cloud-based co-creation] M-SQL, a super multi-task representation learning method, you deserve ", author: Qiming.
Overall introduction to the data set
Definition introduction
International practice, let’s start with a paragraph of definition (bai) meaning (du) introduction (bai) Shao (ke): Text to SQL, as the name suggests, is to generate SQL statements based on user questions on the premise of a given database (or table) . Its mathematical description is as follows:
Let X denote the user's natural language question, D denote the database (or table) related to the question, and Y denote the corresponding SQL statement.
The SQL statement generation task can be expressed as: For each group of independent (X, D, Y), map (X, D) to the corresponding Y.
Take a familiar scene as an example. Suppose we have a student information table, we can ask questions in natural language: Who are the students who are older than 18 years old, and the model needs to return a SQL statement related to it, then:
SELECT Name FROM Student Information WHERE Age> 18
Scene classification
Text to SQL has many classifications, one of which is according to the problem classification :
One is context-free (there is no correlation between questions): Spider
One is context-sensitive (there is a certain connection between the two questions): SparC
"Context-independent" means that there is no correlation between the questions, while "contextual" means that there is some referential relationship or a certain relationship between the two questions.
Similarly, we give a simple example to illustrate:
Question 1: What is the ID of the doctor who has an appointment?
SELECT physician FROM appointment
Question 2: What is their name?
SELECT T2.name FROM appointment AS T1 JOIN physician AS T2 ON T1.Physician = T2.EmpoyeeID
The above is a context-sensitive example. The first sentence determines the doctor ID, and the second sentence determines the doctor's name based on the doctor ID.
The other is classified by field : single field or multiple fields
If all the questions are about aviation, this is a single-domain data set. For cross-domain data sets, there may be many fields in the training set, and there are also many fields in the test set, but the fields in the training set and the fields in the test set do not overlap, which requires our model to have a certain degree of generality. Ability.
The third type is classified according to the database :
Single-table database: WikiSQL, its question is only for one table, or it is for only one table in the database
Multi-table database: Spider, there are many tables in the database for which the question is asked, and the SQL statements it generates may involve connections between multiple tables.
The fourth category is according to the label type to classify :
The final result: WikiTableQuestion
SQL statement: WikiSQL, Spider
There are some data sets that do not give relevant SQL statements, but give a result directly. Take the previous example as an example, "Who are the students older than 18 years old", the output may be SQL statements, or the final result: all of these names are listed, but no SQL statements are given. The "weakly supervised learning" involved will not be explained in detail this time.
TableQA data set
The paper to be explained this time uses a TableQA data set, which is also a single table data. That is, each question is only asked for one table. TableQA and WikiSQL have many similarities, but there are also certain differences, as shown in the following figure:
Paper overview
After introducing the data set, let's make a brief introduction to the model proposed in the paper.
First, let’s think about this question: What methods can be used to generate SQL from natural language? In fact, one of the simplest ideas: the input end is a natural language sentence, and the output end is the corresponding SQL statement (according to the SQL statement, it is generated according to one token and one token).
For example, in our previous example, the Encoder is "who are the students older than 18 years", and the output is the SELECT name FROM XX table Y condition.
This method is very simple, but it also comes with some problems: SQL statements are structured query languages, which have a certain structure, which is somewhat different from general language generation tasks.
For general language generation tasks, if one or two words are changed, its semantics may not change much. But for SQL statements, if a word is different, then there may be no way to continue execution. So we need to make good use of some grammatical information inside the SQL statement, that is, structural information. is generated according to its structure, which is proposed in the paper, and is generated according to the framework of SQL.
M-SQL
Because the TableQA data set is only for a single table, the equivalent of the From clause can be omitted. It can be roughly divided into two parts, one is the Select clause and the other is the Where clause.
There are two parts in the Select clause: one is the name of the selected table, and the other is the aggregation operation. For example, if we require the maximum and minimum values of a column or sum up a column, we need to perform aggregation operations.
For this part of the Where clause, we will introduce in detail:
$WOP: where conditional connector (and /or / Null)
$COLUMN: column name in the database
$AGG: Operation on the selected column (Null, AVG, MAX, MIN, COUNT, SUM)
$OP: column value in the where clause
According to the statistics of the TableQA data set, there are a maximum of 2 columns in select and a maximum of 3 conditions in where:
SELECT ($AGG $COLUMN)*
WHERE $WOP ($COLUMN $OP $VALUE)*
M-SQL model
This model can be roughly divided into three parts from bottom to top.
Encoder: the input; a simple bert model is used, and the version is wwwm-ext. Wwm means that it uses a whole word coverage method, while ext expands its training set and increases its training deployment.
includes: question and column name. also take the previous "who are the students older than 18 years" as an example. As shown in the figure above, T1 to TL are followed by the column name of each column that appears in the questioned table, for example There may be a name, student ID or age in this table. In addition, is different from bert input in that it uses [XLS] to replace [CLS].
Column representation: enhance the representation of the column; since each column may be composed of multiple tokens, for example, the name of a column is "name", which may be two characters, and these two characters have two embedings respectively. , Then how to combine these two embedings into one embeding as a column representation? We can use the previous XLS representation to enhance the column representation, the specific approach is as follows:
First, calculate the attention of all the token representations in this column through the previous XLS representation. After the attention is calculated, add the embeding of the previous XLS representation. The sum of these two forms the enhanced representation of this column.
After the above steps, we get the representation of each token in the question and the representation of each column in the table.
sub-models: 8 sub-models and one multi-task learning for these 8 sub-models.
As mentioned earlier, we can divide the SQL statement into different parts, and then generate each part separately, so we can get 8 sub-tasks, namely:
• Select the number of columns
• Where column number and connector
• Select column
• Select column operation
• Where column
• Where operation for each column
• Value extraction
• Value matching
Next, we will introduce their practices to these 8 subtasks.
Task 1: S-num: the number of columns appearing in Select. [1, 2] (2 categories)
The first is the number of columns that appear in Select. For the TableQA data set, the number of columns in Select can only be one or two, so we can treat it as a two-category problem: use XLS embeding for linear transformation, and then get its probability through sigmoid.
task two: w-num-op: the connector and condition number in Where. [null-1, and-1, or-1, and-2, or-2, and-3, or-3] (7 categories)
The second task is the number of connectors and conditions that appear in Where. The so-called "connector" refers to "And" or "or", etc.; the number of conditions refers to the number of conditions such as ">", "<", "=" and so on that appear in Where. We can divide them into 7 categories, the ones before the "-" are the connectors, and the ones after the "-" are the number of conditions. Of course, these two tasks can also be separated, but if the two tasks are separated, the effect will be greatly reduced compared with doing the two tasks together.
Then there are 7 types in total, which can be regarded as a 7 classification problem, so XLS represents a linear transformation, and then after softmax, the probability distribution on these 7 categories can be obtained.
The third and fourth subtasks are the columns that appear in the Select clause and Where clause. We have already predicted the number of cases in Select and the number of cases in Where, so in this part we can predict the probability of each case separately.
Task 3: S-col: Column that appears in Select
The columns appearing in uses each of our previous columns to get an enhanced representation. After a linear transformation, and then a softmax, the probability of this column can be obtained.
Task 4: W-col: The column that appears in the Where condition
for the column that appears in the Where condition: Similarly, use different linear changes to get the probability of this column.
Task Five: S-col-agg: Operator of the column appearing in Select
- [Null, AVG, MAX, MIN, COUNT, SUM] (6 categories)
The fifth task is the operators that appear in Select. These operators are also called aggregation operations. For example, we can find the maximum, minimum, average, sum, etc. of all the data in this column.
In TableQA, there are a total of 6 types of 5 operators plus NULL. We can see that it is a 6-category problem. Similarly, we do a linear transformation on the enhanced representation of each column, and then pass the softmax to get the probability distribution of each category.
Task 6: W-col-op: The condition operator corresponding to the column appearing in the Where condition
- [> / </ == / !=] (4 categories)
The same is true for these operators that appear in the Where condition. These operators, including this category is greater than a number or less than a number, or equal to a certain value, or not equal to a certain value, there are a total of 4 categories, we can regard it as a 4-category problem. The method is the same as the operator in the previous Select. It also expresses a linear mapping for the column enhancement and then passes through the softmax to obtain the probability distribution of each of the 4 categories, and the largest one is selected as the operator of this column.
The last two subtasks serve to predict the condition value. Take our previous example as an example, "Who are the students over 18 years old". The final result should be that there is an age> 18 in the Where condition, so how should we get 18? The author split it into two subtasks based on the problem:
Task 7: Extract phrases that may be
- Use 0/1 to mark the Token in the question (1 means value, 0 means non-value), each group of consecutive
1 marked token as a whole
The first step is to extract phrases from the question that may be worthwhile. For example, the question "Who are the students older than 18", then this subtask is to extract "18" from the question. The method we can use is to use 0 and 1 to mark the tokens that appear in the question. For example, "18" in "Whoever is older than 18", we mark it as 1, and then all other tokens Mark it as 0, and apply a linear transformation to the expression of one of our tokens in the problem, and use sigmoid to predict whether it is 1 or 0.
Task 8: Match the extracted phrase with the column that appears in Where
On the basis of task seven, we need to match the extracted phrases with the columns appearing in where.
In the previous step, we have labelled "18" with 1, so the token sequence "18" has been generated. It is a value that may appear in a certain condition, but after which column it will appear, it is something to be determined in this step.
Match the extracted phrase with the column that appears in Where. If the phrase is composed of multiple tokens, the text representations of all tokens are averaged. As shown in the figure below, this formula is equivalent to finding a similarity between the phrase and the column appearing in where, and then passing a sigmoid. The previous u is a learnable parameter. After a sigmoid, you can get whether the phrase matches the column: if it matches, the phrase is used as the value of the column, for example, 18 matches age, and then we can write age> 18.
Execution-Guided Decoding
We have made a brief introduction to the above 8 subtasks. Through these 8 subtasks, we can get a SQL statement and ensure that it complies with the grammatical rules. But the SQL statement it generates may still not be executed.
Because there may be some restrictions inside the SQL statement:
• If a string type column appears in the SELECT clause, the corresponding operation cannot be'sum','min','max'
• If a string type column appears in the WHERE clause, the corresponding operation cannot be'<','>'
• The columns appearing in the SELECT clause and the WHERE clause are different from each other (analysis of the data set)
Under these restrictions, we can use the Execution-Guided Decoding method: in the decode process, remove those SQL statements that cannot be executed, for example, the result of the SQL statement execution is empty, or it can’t be executed at all, thus being If an error is reported, these SQL statements can be directly discarded, and the SQL statement with the highest probability that meets the above conditions is selected.
Experimental result
Next are the experimental results.
First, let's briefly introduce the evaluation indicators used, which are LX, X and MX.
LX is the accuracy of its logical form. If the generated SQL statement is exactly the same as the SQL statement of the standard answer, then the above two operations are correct; if there is a difference, for example, ">" is written incorrectly, or this column is selected incorrectly, then this example is mistake.
X is the accuracy of its execution results. If two SQL statements may have different logical forms (there may be some differences between the two SQL statements), but their execution results are consistent, then the prediction is considered correct.
MX is an average of the previous LX and X. It has two models, one is a single model, and the other is an integrated model (Ens later). Integrate the results of multiple training through Ensemble, and finally get a better result. From the figure, we can see that it is better than the results of the previous models.
Because the previous models are implemented based on WikiSQL. The TableQA we used is somewhat different from WikiSQL, and is more difficult than WikiSQL, so the previous models did not work very well on the data set by TableQA.
Subtask performance
The following figure compares the performance of 8 different sub-models:
We can see that on each sub-model, its effect is very good, and now it can achieve better results after Ensemble.
Ablation experiment
In the last part of the experiment we did a series of ablation experiments.
From the experimental results, we can see that the version using BERT-wwm-ext is better than BERT-base, and the effect of using XLS as the front is better than CLS. The lower part of the figure is the method of extracting some values used, as well as some methods of value matching, we will give you a more detailed introduction below.
Detail processing in reproduction
Next, we will introduce some details in the process of reproduction.
The first is the part of data preprocessing. For this data set, its data is not standardized, and the following situations may occur (the ambiguity part is indicated in brackets):
• Numbers: Which cities have sold over 150,000 square meters of first-hand houses last week? (15, 15)
• Year: Do you know the area of land sold in 10 years? (10 years, 2010)
• Unit: Which cities have more than 50,000 new disk inventory in the last week? (50,000, 50,000)
• Date: Which company was established on December 28, 2018? (December 28, 2018, 2018/12/28)
• Synonymous: Can you help me calculate the total number of plays of Mango's dramas? (Mango, Mango TV)
The first few questions can be converted directly according to certain rules; and the latter can be replaced by looking for related category words in the database.
Value extraction
In the part of "value extraction", we tried many methods, such as bert+crf method, bert+bilstm+crf, and bert+half pointer method. In the end, the 0/1 marking method was used because it has the best effect.
• bert + crf,val_acc: 0.8785
• bert + bilstm + crf,val_acc: 0.8801
• bert + half pointer, val_acc: 0.8891
• bert + 0/1 mark, val_acc: 0.8922
How is the 0/1 approach achieved? Let's take the question "Where is Qingxiu Nancheng Department Store Co., Ltd.?" as an example to explain in detail.
query: Where is Qingxiu Nancheng Department Store Co., Ltd.?
bert_tokenizer: ['[XLS]','青','show','South','City','Hundred','Goods','Yes','Limited','Gong','Division', ' where', '? ','[SEP]']
value: Qingxiu South City Department Store Co., Ltd.
tag:[0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0]
First perform Tokenizer on this problem, and then get the token sequence. If the value "Qingxiu Nancheng Department Store Co., Ltd." appears in the SQL statement, mark these tokens as 1; for the others that do not appear in the SQL statement, mark them Into 0.
Detail processing
Value Search
When extracting value, the extracted value may not be standardized, or the problem may not be consistent with what appears in the database. For example, "Renren" and "Renren" in the picture below:
Query1: What is Renren's weekly gains and losses?
Value: Everyone
In this case, we need to perform a search on all the values that appear in the column of value and SQL, and select the word closest to it as the final value. So if you search, we can choose many methods, such as rouge-L matching method, and several machine learning methods: logistic regression, SVR and Bayesian. Through the comparison of effects, we can find that logistic regression is the best method, and its accuracy is 97%.
Table-Column information enhancement:
The last part uses the contents of the table to enhance the column representation.
As shown in the figure above, for example, a column value is randomly selected from the category, such as "Guangxi", our column is expressed as "Region, Guangxi" and the whole is used as a representation of this column, and it is sent to The input end, and then further obtain the representation of the column. By enhancing the columns in this way, you can finally get an effect improvement of 0.4.
Recurring problems and suggestions
1. The data set is not standardized. It is recommended to extract and select part of the standardized data for training and prediction;
2. Don't start from 0 to reproduce, you can refer to the existing code based on the existing model.
M-SQL: A multi-task representation learning method that converts natural language into SQL statements
To view the interpretation video and algorithm link of this issue, please click:
Click to follow and learn about Huawei Cloud's fresh technology for the first time~
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。