Make writing a habit together! This is the 11th day of my participation in the “Gold Digging Day New Plan · April More text Challenge”. Click here for more details.

Introduction:

In my blog post, I introduced a model IE-SQL for text-to-SQL tasks from the perspective of Sequence Labeling. This blog, the last on the WikiSQL dataset, will introduce a complete Seq2seq method: SeaD, which is also the current SOTA model on WikiSQL tasks.

Introduction to the

SeaD stands for schema-aware Denoising. The authors revert the text-to-SQL problem back to the Seq2Seq generation task, which is accomplished by direct autoregressive generation, that is, predicting SQL sequences token by token. The main methods of this paper are summarized as follows:

  • The main task is the Seq2seq model, which directly translates and generates the results, unlike the previous model where slot filling or decoder constraints are carried out. Using Hybrid Pointer Networks, concatenation is performed for score in the thesaurus and unnormalized attention scores input from token, and then Softmax operation is performed to obtain probability values.

  • Set up two denoising objectives to combine the training model using the BART model (see arxiv.org/abs/1910.13…). And the MASS model (see arxiv.org/abs/1905.02…) The idea, using some Erosion and Shuffle operations, enables the model to recover inputs or predict outputs.

  • The method of Execution guided decoding is improved as syntax parts-sensitive EG decoding to improve the influence of aggregation function in the candidate item of output SQL statement.

Details are as follows.

methods

Similar to the Masked Language model and other denoising tasks, the authors propose two schema-aware targets: Erosion and Shuffle, where the training model either reconstructs the original sequence from noisy inputs or predicts noisy outputs. The denoising process is shown in Figure 2.

Erosion

For a given input sequence X=Q,SX={Q,S}X=Q,S, Q represents the input question,S represents the input database schema. Erosion of schema is caused by three main operations:

  1. Permutation. That is, rearrange the columns of the schema.
  2. Removal. So for each column, it’s discarded randomly with a probability.
  3. Addition. Take a column from another schema in the database and place it in the current schema with a probability.

In all of the above operations, the order in which special tags are separated remains the same, so the corresponding anonymous entities in the SQL query should update the Erosion operations in schema order. If there is no corresponding column in the schema due to deletion, the corresponding Target will be <unk>.

As shown in Figure 2 (a), the original schema information is:

< COL0 > week < COL1 > data < COL3 > opponent...Copy the code

A probability Erosion operation is performed on it, such as deleting a week column and then performing a random ordering. The resulting schema is as follows:

<col0> attendance <col1> venue <col3> result...Copy the code

As one of the columns was deleted, and the sequence of column in THE SQL statement in the schema changed (i.e. the attendance period changed to COL0), the corresponding SQL statement also needs to be modified, which is as follows:

SELECT '<unk>' from 'table' WHERE 'col0>' = '53,677'Copy the code

By making such joint changes to the schema and SQL sequences, the model needs to identify the Schema entities that are truly relevant to the problem and learn to throw unknown exceptions when the schema information is insufficient to constitute the target SQL.

Shuffle

Shuffle fixes the schema sequence S and scrambles user query (Q) and target SQL (Y) ‘S mentioned entities. The task is to make the model reorder Q and Y correctly.

As shown in Figure 2 (b), the original Q and Y are:

SELECT '< COL0 >' FROM 'table' where '< COL4 >' = '53,677' Which week had an attendance of 53,677Copy the code

After the Shuffle, it changes to:

SELECT '53,677' FROM 'table' where '< COL0 >' = '<col4>' Which 53,677 had an week of attendanceCopy the code

One of the training objectives of the model is to restore such out-of-order Q and Y to normal Q and Y.

The goal of recovering the scrambled entity order is to train the model to capture the intrinsic relationships between different entities, thereby improving schema linking performance. It is worth noting that Q and Y, as self-supervised targets, participated in the denoising task and were respectively trained. Although you need to rely on the value entity in the SQL statement to distinguish the value entity in the question. But using only the Column Entity is enough to get a decent performance. And since parallel data is not required, an additional corpus of monolingual data for SQL and question questions can help with the reordering task, which will be one of the further directions of this work.

The proposed framework

This paper adopts Transformer architecture of Seq2seq, in which the output Decoder uses Hybrid Pointer Generator Network, that is, a Pointer Network. The advantage is that tokens can be copied directly from the input Seq to the output, thus solving OOV problems. For details, see blog.csdn.net/qq_38556984… . In this way, the output score in the thesaurus and the input token’s unnormalized attention scores are concatenated, and then the Softmax operation is performed to obtain the probability value.

Then you get the final output.

Clause-sensitive EG Decoding

After the complete Seq2seq architecture is adopted, the original execut-Guided Decoding (EG) technology needs to be improved accordingly before it can be applied. The author has modified it here, and this part is not the focus of this paper, so it will not be introduced in detail here.

The experiment

By using improved EG, SeaD finally achieved current (and possibly long-term) SOTA performance with the following results:

Due to errors in the WikiSQL data set itself, further significant performance improvements are unlikely.

conclusion

In this paper, we introduce SeaD, the latest method on WikiSQL data set, which models text-to-SQL problems on WikiSQL as Seq2seq model. Using the idea of training BART and MASS, we use Erosion and Shuffle operations to train the model. The results of SOTA are obtained.

Because WikiSQL data set itself is relatively simple, which only involves the most basic conditional elements of SQL statements, so the practical application is not strong. After 2020, the Leaderboard of the dataset has reached more than 90%, basically solving the text-to-SQL problem on the dataset. Researchers are increasingly turning to more complex and versatile data sets called spiders. The next blog post will introduce you to the Spider dataset, currently the most widely studied dataset in the text-to-SQL field.