“This article has participated in the call for good writing activities, click to view: the back end, the big front end double track submission, 20,000 yuan prize pool waiting for you to challenge!”

1: background

In the crowd portrait, offline data warehouse, report analysis system will involve a large number of SQL input, verification, compression, analysis, automatic maintenance of data warehouse blood relationship, to ensure the accuracy of the input data. We need some automated tools to verify, identify and maintain relevant data to ensure the accuracy of the input data.

2: common solutions

1: Alibaba Druid SQL format and analysis
<dependency>
   <groupId>com.alibaba</groupId>
   <artifactId>druid</artifactId>
   <version>1.2.6</version>
</dependency>
Copy the code

Alibaba Druid’s package is small and beautiful, JDBC query performance is high, and ecologically relevant tools are comprehensive. HikariCP is officially recommended by SpringBoot, and you can choose which JDBC JAR you like to use.

SQL Format supports multiple data source parsing and is very comprehensive. A line of code is done, detailed as follows:

SQLUtils.format(sql, JdbcConstants.HIVE);
Copy the code

The SQL Parse implementation is more powerful. Identifies the query type of the table and supports multiple data sources. The relevant code is as follows:

SQL > select * from table where name = 'library'; The table name *@param sql
 * @return* /
public static  Map<String.Object>  sqlParser(String sql) {
    Map<String.Object>  resultMap = Maps.newHashMap();
    SQLStatementParser parser = new HiveStatementParser(sql);
    SQLStatement statement = parser.parseStatement();
    HiveSchemaStatVisitor visitor = new HiveSchemaStatVisitor();
    statement.accept(visitor);
    Map<TableStat.Name, TableStat> tableOpt = visitor.getTables();
    Set<String>  tableSet = new HashSet<>();
    String sinkTable = null;
    for (TableStat.Name tableStatKey : tableOpt.keySet()) {
        if (null! = tableStatKey ) {String   key = tableStatKey.toString();
            if(key.contains(".")) {String operation = visitor.getTableStat(key).toString();
                if(TableStat.Mode.Insert.name().equals(operation)){
                    sinkTable = key;
                }else {
                    tableSet.add(key);
                }
            }
        }

    }
    resultMap.put("sourceTableList",tableSet);
    resultMap.put("sinkTable",sinkTable);
    return resultMap;
}
Copy the code
2: SQL – table – name – parser
<dependency>
    <groupId>com.github.mnadeem</groupId>
    <artifactId>sql-table-name-parser</artifactId>
    <version>0.0.5</version>
</dependency>
Copy the code

At its core, SQL Format removes \r\n and replaces multiple Spaces with a single space. Details are as follows:

/** * SQL format processing *@param sql
 * @return* /
public  static String formatSql (String  sql) {

    String fromatSql  = null ;

    if(StringUtils.isBlank(sql)){
        return fromatSql ;
    }
    fromatSql = sql;
    fromatSql = fromatSql.trim();
    fromatSql = fromatSql.toLowerCase();
    fromatSql = fromatSql.replaceAll("\r\n"."").replaceAll("\r"."").replaceAll("\n"."").replaceAll("+"."");


    return fromatSql;
}
Copy the code

Sql-table-name-parser has weak recognition ability and cannot recognize the Sink table, with and other related implementations. I briefly deal with the relevant logic in the process of use as follows:

private static final String  WITH_PREFIX = "with ";

private static final String  SELECT_PREFIX =" select ";

private static final String  INSERT_INTO_PREFIX ="insert into";

/** * SQL format processing *@param formatSql
 * @return* /
public  static Map<String.Object> getTableNameMap(String  formatSql) throws   Exception{
    Map<String.Object>  resultMap = Maps.newHashMap();

    if(StringUtils.isBlank(formatSql)){
        return resultMap;
    }
    List<String>  sqlList =  JSON.parseArray(formatSql.toLowerCase(),String.class);

    Set<String>  tableSet = new HashSet<>();
    String sinkTable = null;
    for (String sql : sqlList){
        sql = formatSql(sql);
        if(sql.startsWith(WITH_PREFIX)){
            throw new  Exception(" sql format not support with keyword sql");
        }
        // Insert into table cannot be resolved
        if(sql.startsWith(INSERT_INTO_PREFIX)){
            sql = sql.replaceAll(INSERT_INTO_PREFIX,"insert overwrite ");
        }
        TableNameParser tableNameParser = new TableNameParser(sql);
        Collection<String> tables = tableNameParser.tables();
        tableSet.addAll(tables);


        String  sinkSql =  sql.substring(0,sql.indexOf(SELECT_PREFIX));
        TableNameParser sinkTableNameParser = new TableNameParser(sinkSql);
        Collection<String> sinkTables = sinkTableNameParser.tables();

        if(CollectionUtils.isNotEmpty(sinkTables) &&  StringUtils.isEmpty(sinkTable)){
            List<String> sinkTableList  = Lists.newArrayList(sinkTables);
            sinkTable = sinkTableList.get(0);
            tableSet.remove(sinkTable);
        }
        resultMap.put("sourceTableList",tableSet);
        resultMap.put("sinkTable",sinkTable);
    }
    return resultMap;
}
Copy the code

3:

Sql Format & Sql parsing is widely used in big data scenarios. It is easier to build the underlying standards and then introduce other components later. Alibaba Druid is recommended for overall comparison. If you have other better components, you are welcome to communicate with us.