Make writing a habit together! This is the 13th day of my participation in the “Gold Digging Day New Plan · April More Text Challenge”. Click here for more details.

Introduction:

SyntaxSQLNet is the first baseline method designed for the Spider dataset, and is the base method proposed by the authors of the Spider dataset at the same time as the release of the dataset.

Introduction to the

SyntaxSQLNet can be thought of as the “SQLNet” model combined with the complex SQL statement structure version. In contrast, the SyntaxSQLNet model has the following improvements:

  • The decoding process introduces structural information and decodes the object as a tree structure composed of SQL statements.
  • The prediction is decomposed into nine modules, each of which corresponds to one component of the SQL statement. The order in which the nine modules are called is determined by a predefined SQL grammar, introducing structural information.
  • Provides a data enhancement method for Text2SQL tasks to generate cross-domain and more diverse training data.

methods

An overview of the module

SyntaxSQLNet introduces the process of decoding into structural information, that is, decoding objects into a tree structure of SQL statements. The prediction is broken down into nine modules, each corresponding to a component of the SQL statement. The sequence of the nine modules is determined by a predefined SQL grammar during decoding, thus introducing structural information. The tree is generated in depth-first order (column is predicted first, then AGG is predicted). The decomposed 9 modules are as follows:

  • IUEN module: Predict INTERCEPT, UNION, EXCEPT, NONE (nested query related)

  • KW module: Predict WHERE, GROUP BY, ORDER BY, SELECT keywords

  • COL module: Prediction column name

  • OP module: predict >, <, =, LIKE and other operators

  • AGG module: prediction of MAX, MIN, SUM and other aggregation functions

  • Root/Terminal module: prediction subquery or Terminal

  • AND/OR module: Predicts relationships between conditional expressions

  • The DESC/ASC/LIMIT module: predicts the keywords associated with Order Derby

  • HAVING modules: Predict HAVING clauses related to GROUPBY

SQL generation example

The SQL statement generation process is as follows:

  • First, look at the “Current Token” section in the middle, where ROOT represents the start of a SELECT statement. From “Current Token” to determine which algorithm module to call to predict subsequent tokens.
  • Secondly, look at the “Module to Call” on the right. Each square represents different algorithm modules. Invoking different modules can generate corresponding tokens.
  • Then, look at the “History” section on the left. In addition to looking at the “Current Token”, SyntaxSQLNet also takes into account previous SQL statements of the Token to improve prediction accuracy. Also, when SQL statements are long, History also includes information about dependencies between statements.
  • Finally, look at the “Stack” section in the upper left corner of the image. Stack is used to help recursively generate SQL syntax trees.

For example, in the beginning, the algorithm initializes a ROOT Token on the stack, and then pushes the ROOT Token off the stack to trigger the IUEN module. If the IUEN predicts NONE Token, the algorithm will push the predicted NONE onto the stack. Next, push NONE Token as “Current Token” to trigger later modules… Follow this recursion until the Stack is empty and the prediction ends.

Model structure

The model input consists of Question, Table Schema and SQL History.

  • Question encodes BiLSTM.

  • For Table Schema and SQLNet, Column Name is only used for encoding. For Spider task complexity, SyntaxSQLNet uses Table Name, Column Name, Column Type as input. Then bi-LSTM represents the Table and Column information in an encoding.

  • For SQL Histroy, the preceding SQL statement of the current Token needs to be encoded. There is a difference between the training phase and the test phase: the gold Query provided in the annotated data is directly used as the SQL History during the training phase; On the test data, SQL statements predicted by the preceding sequence will be used instead.

  • In addition, different words of Question and SQL History play different roles in specific Token prediction. To address this, SyntaxSQLNet uses the Attention mechanism in each module to improve prediction accuracy.

Data to enhance

This paper also provides a data enhancement method for Text2SQL task to generate cross-domain and more diverse training data.

Firstly, the general question-SQL templates were extracted from the annotation samples of Spider. After filtering out some overly simple templates, 280 templates with high complexity were obtained. Then WikiSQL data set is used to randomly select 10 templates for each data table and randomly select Columns in the table for filling. At the same time, the data type of the selected Column matches the data type of the Slot to be filled, for example, the data type is numeric. About 98,000 question-SQL samples were generated for training.

The experiment

Through the application of the above methods, SyntaxSQLNet achieved good results at that time. EM improves by a dozen points compared to SQLNet or TypeSQL models.

conclusion

SyntaxSQLNet is the first baseline method since Spider. Compared with SQLNet on WikiSQL, SyntaxSQLNet utilizes more ADVANCED SQL syntax information and can predict more complex SQL statements. This paper was also accepted by EMNLP 2018 along with the previous Spider dataset paper. The next article will introduce an approach to introducing intermediate representations: IR-NET.