Make writing a habit together! This is the 14th day of my participation in the “Gold Digging Day New Plan · April More Text Challenge”. Click here for more details.

Introduction:

IRNet is one of the early research methods proposed by Spider, which was published in ACL in 2019. Compared with the previous model, IRNet combined with BERT model improved EM performance from 27.2% to 54.7% at one stroke.

Introduction to the

IRNet is used to solve two challenges:

  1. The CONTEXT information provided by the NL alone is not sufficient to complete a mismatch problem. As shown in Figure 1, the column used for the GROUP BY aggregation in the SQL statement is student_id, which is not mentioned in the user question; The root cause of this problem is that SQL statements are designed to make queries more efficient and are not used to represent intents.
  2. Lexical problem caused by a large number of out-of-domain (OOD) words. In cross-domain Settings, approximately 35% of the words in Spider’s validation set are not present in the training set.

To this end, IRNet defines a series of CFG gramms and introduces an intermediate representation IR (called SemQL) that converts SQL into a syntax tree structure IR. Instead of an end-to-end task, the entire task is broken down into three sub-tasks:

  1. Schema linking for question and database schema
  2. SemQL is generated using a grammar-based neural model
  3. SemQL is inferred as SQL Query with domain knowledge

The contributions of this paper are as follows:

  • The baseline is 19.5 percentage points better than the original baseline on the Spider dataset, achieving an accuracy of 46.7%. Top of Spider chart;
  • By adding BERT model, IRNet’s model evaluation results can be improved to 54.7% accuracy.

In addition, this article demonstrates that other text-to-SQL methods, such as SQLNet, TypeSQL, and SyntaxSQLNet, can be greatly improved by learning to compose SemQL queries instead of directly composing SQL queries. This means that establishing IR is efficient and a promising direction for dealing with complex cross-domain text-to-SQL tasks;

methods

The intermediate representation

Inspired by Lambda DCS, the author designed a text-to-SQL intermediate representation named SemQL with a tree structure. The nature of the mismatch problem is that the implementation logic details of the SQL query (such as the HAVING and GROUP BY clauses) are not specified in the NL problem, so it would be natural to hide the implementation logic details in SemQL.

As shown in Figure 2, SemQL is defined by a set of context-free syntax. This is a form of tree structure, and the following figure shows the tree representation of SemQL for the SQL statement in Figure 1.

schema linking

The goal of schema linking in IRNet is to identify column and table mentioned in the question and assign different types to columns based on the way column is mentioned in the question. The paper first defines three types of entities that may be mentioned in the NL question: table, column, value; And enumerates all n-gram phrases of length not greater than 6 in an NL problem for string matching. The phrase whose result is column can be classified into exact match and partial match labels according to whether it is a complete match.

model

We propose a neural model for synthesing SemQL queries with inputs as question, the result of database schema and schema linking. Its model architecture diagram is shown in Figure 4.

NL Encoder

NL Encoder is used to encode Question, Question questions with some auxiliary Type information added in a full or partial N-gram match for column.

Schema Encoder

Schema Encoder Used to encode database Schema information. That s = (c, t) = s (c, t) = s (c, t) said (schema), a database structure including c = {(c1, ϕ I),… , (cn, ϕ n)} c = \ left \ {\ left (c_ {1}, \ phi_ {I} \ right), \ ldots, \ left (c_ {n}, \ phi_ {n} \ right) \ right \} c = {(c1, ϕ I),… ,(cn,ϕn)} represents all columns and their type information, and t represents all tables.

The Schema Encoder takes the entire S as input and outputs the representation of all fields, EcE_cEc, and the representation of the table, EtE_tEt.

Decoder

Decoder end is used to decode SemQL output. Based on a grammar-based decoder, the output is predicted. At each time step, the decoder will produce output from one of three actions:

  • ApplyRule(R): Apply a production rule R to the current spanning tree of a SemQL query.
  • SelectColumn(C): Select a field C from the database structure (Schema);
  • SelectTable(T): Select a table T from the database schema;

IRNet also designs a memory Augmented Pointer network for selecting columns during synthesis; When a field is selected, the network first decides whether to select it in memory, unlike vanilla Pointer Network, The motivation for innovation here comes from the author’s observation that Vanilla Pointer Network tends to select the same field, so if we decide whether to select it in memory first, this situation can be improved.

The experiment

The experimental results of IRNet in Spider are as follows:

The author also experimented with some other neural network models to modify SemQL prediction methods. It can be seen that by introducing SemQL, the previous SQLNet, TypeSQL, SyntaxSQLNet methods have a good performance improvement.

conclusion

This article introduced an approach to introducing an intermediate representation of SemQL in a text-to-SQL task: IRNet. This approach represents a significant improvement over the previous baseline on the Spider dataset. At the same time, the authors introduced SemQL into some of the previous baseline methods, resulting in performance improvements for the previous methods, which may justify the introduction of this intermediate representation. The next blog post will introduce several ways to solve text-to-SQL problems using GNN as a graphical neural network.