Welcome toTencent Cloud + community, get more Tencent mass technology practice dry goods oh ~

This article was published in cloud + Community column by Tencent Cloud database TencentDB

Author introduction: Wang Xiaoyu, a member of Tencent database TDSQL team, is currently involved in the development of TDSQL database kernel.

Tencent and university cooperation papers into the database will be the top

Tencent TDSQL team cooperated with School of Information, Renmin University of China, School of Computer, Wuhan University, DEMO paper “MSQL+: A Plugin Toolkit for Similarity Search under Metric Spaces in Distributed Relational Database Systems was selected by VLDB 2018 admissions.

This paper designs a plug-in approximate query tool MSQL+ based on RDBMS. MSQL+ follows THE SQL standard and supports approximate query oriented to metric space (a more concise and general expression than text space, vector space, etc.). Relying on the distributed database system TDSQL, it realizes universal, easy to use and efficient similar query processing technology.

During the meeting, the team demonstrated MSQL+, a similar query tool based on Tencent distributed database TDSQL, which is used to realize similar queries in distributed system TDSQL. More computing functions are integrated in TDSQL database to give the database richer computing power.

The original paper can be found at www.vldb.org/pvldb/vol11… . The paper information is as follows:

Wei Lu, Xinyi Zhang, Zhiyu Shui, Zhe Peng, Xiao Zhang, Xiaoyong Du, Hao Huang, Xiaoyu Wang, Anqun Pan, Haixiang Li: MSQL+: a Plugin Toolkit for Similarity Search under Metric Spaces in Distributed Relational Database Systems. VLDB 2018 Demonstration

If you want to know more technical details, please refer to the following content (MSQL+ is mainly introduced in the background, functions, architecture and design) :

Thesis unscramble

The following focuses on the background, function, architecture and design of MSQL+. Please see www.vldb.org/pvldb/vol11 for the original paper. .

MSQL+ generates the background

Similar query is the basic operation of many database applications.

For example, similar queries play a significant role in text retrieval, spell checking, fingerprint authentication, face recognition and other scenarios.

So how do these applications perform similar queries? Given the object q and the set R, return the elements in R whose similarity differs from q within θ. The most direct way is to traverse R ∈R and calculate the similarity between R and Q. It is conceivable that this method is very inefficient.

Various fields have developed various ways to optimize the above similar query methods, but there are still the following problems:

1. Separation from the existing database system: existing similar query methods, a large number of new systems or new indexes are established to improve efficiency, such as M-tree, D-index, KD-Tree, etc. Although the performance is improved, it is difficult to integrate them into the existing RDBMS. Other methods implement similar queries based on B+ -Tree, but require new apis to be developed on existing RDBMS, and are inefficient. These methods lack unified standards, poor compatibility, each contact with new methods, have to pay extra learning costs.

2 limited data space and poor universality: many application scenarios have different definitions of “similarity”, different measurement dimensions and data types, making it difficult to establish a general similarity query model. With the help of custom pruning rules, performance of similar queries in specific scenarios is improved, but it is almost impossible to migrate to other application scenarios. As a basic operation, similar queries should be universal and guaranteed to perform well across different RDBMS applications.

3 only applicable to centralized systems, difficult to deal with “big data” scenarios: in the era of big data, it is the general trend to maintain increasing data by means of distributed systems. Unfortunately, existing approaches to similar queries do not support distributed systems well.

In order to avoid the above problems, MSQL+ is designed as: based on RDBMS, follow SQL standard, make use of distributed database, to achieve universal, easy to use, efficient. In the actual production system, MSQL+ relies on the distributed database TDSQL of Tencent, and effectively realizes the ideas and functions proposed in this paper.

MSQL+ main functions

MSQL+ consists of two modules:

1. Build indexes: MSQL+ generates comparable signatures for each data object, and establishes B+-tree indexes on the signatures. Objects whose Signature values are within the similarity range are used as candidates for similar queries.

2 Query processing: the user submits select-from-WHERE statement, which must provide two constraints: a) user-defined similarity function, b) similarity range, b) preliminary screening of candidates, and a) refining candidates and returning similar result sets.

Compared with existing similar query methods, MSQL+ has the following advantages:

1 based on the existing functions of RDBMS, using B+-tree index data, using select-from -WHERE statement similar query;

2 support wide data space: any type of data can be reasonably indexed (see below design), through a unified interface similar query;

3 Can be run in single machine and distributed RDBMS, relying on distributed relational database system TDSQL, can speed up the process of pretreatment and similar query.

MSQL+ design scheme

This section gives a brief introduction to the MSQL+ approximate query scheme. See the original paper for details.

1. Similarity Search in Metric Spaces

MSQL+ adopts divide-and-conquer strategy to divide the complete data set into multiple independent fragments, and each fragment selects several relatively similar candidates, which will be selected twice later.

How does MSQL+ divide data sets? The paper states that some objects in the data set are selected as Pivot (the strategy for electing pivot is described in the next section), and the remaining data objects are allocated to a unique pivot (i.e., the nearest pivot) according to a certain strategy, and these pivot and the assigned data objects form a shard. Thus, the complete dataset is divided into multiple disjoint smaller datasets, and similar candidates are screened within each fragment.

What are the rules for selecting candidates? Let’s start with an example: given an object Q and a data set R, a similar query returns data objects in R that differ from Q within θ. Screening for partition Pi r ∈ Pi, and | q, r | ≦ r theta object as a candidate.

Theorem 1* : *

Pi for partitions (its pivot for Pi), ∀ r ∈ Pi, | q, r | ≦ theta necessary condition is:

LBi = | PI, q | – theta ≦ | PI, r | ≦ | PI, q | + theta = UBi

Pivot selection is the basis of this process. How does MSQL+ select Pivot?

2. Pivot Selection

Choosing an appropriate pivot can speed up the process of selecting candidates and selecting result sets. Four pivot selection methods are proposed in this paper:

1Random: Pick an object randomly from the set R as pivot;

2MaxVariance: Select the group of objects with the largest variance from the set R as pivots;

3MaxProb: Pivot needs to meet the minimum number of candidates expected to be screened;

4Heuristic: A k-means-like heuristic algorithm is adopted that, on the whole, elements in each partition are as close to pivot as possible.

At this point, you can screen out relatively similar candidates, so how do you select a more similar result set from them?

3. Processing similarity queries in RDBMS

In order to quickly select the result set, MSQL+ builds a B+-tree index on the data set. The following two steps introduce how to build and use the index.

The paper makes A definition: A table stores data set R, and there are M attributes (M columns) on the table. Some attributes, as measures of similarity, are denoted as A:{A1, A2… , An} n≦M, for r∈R, r[A] represents the data r attribute {A1, A2… The value of An}.

3.1 the Index Building

A +-tree index {A1, A2… B+-tree index {A1, A2… , An} are all comparable, b) the candidate can be selected by comparing the values of each field of A. With this index, similar queries can be easily implemented. So how do you build such an index? The paper makes such a design:

For r ∈ r, A “Signature list” (Signature generation schema) recorded the Signature of r S (r) [A], [A] (r) = S < I, | r, Pi | >, which is A partition ID, I | r, Pi r | is partition data object and the gap between the pivot Pi, The signature comparison rule is as follows:

The original table storage data set (R) on the new column I record signing < I, | R, Pi | >, and set up on the I B + – tree index, the index meet the “comparable” and “comparative index to determine the candidate for” two elements, can be approximate query conveniently using this index.

3.2 Query Processing

Now that we have built a suitable B+-tree index, how can we use this index to select candidates?

DIST(r[A], q[A], θ). This function determines that the distance between r[A] and q[A] does not exceed θ. This design extends the data space and types supported by MSQL+. With DIST, the user enters a select-from-where statement like this:

SELECT R.A1,… ,R.An

FROM R

[A] WHERE DIST (r, q [A], theta)

The above SQL, directly from the data set R to precisely filter the result set, efficiency is worrying.

This is where candidates come in handy. Theorem 1 (see Similarity Search in Metric Spaces) describes how candidates can be screened, reducing the amount of data to be accurately screened and speeding up the sorting process. Combining theorem 1 and DIST, the user enters a select-from-WHERE statement like this:

SELECT R.A1,… ,R.An

FROM R, PivotsRangeSet PRS

WHERE I BETWEEN PRS.LB and PRS.UB AND

[A] DIST (r, q [A], theta)

The temporary table PivotsRangeSet maintains the LU and UB of each pivot. Because of the PivotsRangeSet’s small size, the query optimizer will always scan the index for candidates and refine the result set with the DIST function.

MSQL+ Distributed architecture

MSQL+ works on both local and distributed RDBMSS. This paper presents the architecture of MSQL+ based on TDSQL.

1. System Architecture

1.1 introduce TDSQL

TDSQL is a high consistency, distributed database cluster solution launched by Tencent for financial online transaction scenarios, which can ensure high availability under strong consistency. It has a flexible global deployment architecture, achieves multiple performance improvement, enhances the original security mechanism of MySQL, and can be distributed in the horizontal direction. With automatic operation system and perfect supporting facilities.

TDSQL consists of the following key components:

1Routing Node: load balancer.

2ZooKeeprt: maintains system meta information, such as tables, indexes, and partitions.

3Global Executor: receives SQL, delivers local Executor, collects local results, and generates execution plans.

4Local executor: stores, fetches, and computes local data.

1.2 TDSQL gain

MSQL+ is a plug-in tool implemented by user-defined functions and stored procedures that can be seamlessly integrated into TDSQL.

How does MSQL+ work on TDSQL?

ZooKeeper maintains MSQL+ special meta-information and synchronizes it to each Local Executors. The Global Executor receives similar query requests, distributes them to each Local Executors for execution, collects the final result and presents an execution plan. The Local executor completes the similar query for Local shards and returns the result.

What benefits can TDSQL bring to MSQL+?

First of all, reliability and availability. TDSQL achieves strong consistency of multiple copies and maximizes the security, availability and reliability of a large number of sample data required by MSQL+.

Second, TDSQL supports horizontal distributed expansion, eliminating the worry of insufficient stand-alone storage capacity, no matter how big MSQL+ sample data, TDSQL can easily cope with.

TDSQL optimization in the security mechanism, to a large extent to ensure the security and confidentiality of MSQL+ sample data.

We are most concerned about the performance problem, from the perspective of distribution, TDSQL multiple local nodes parallel query, global similar query efficiency greatly improved; Specific to the local node, TDSQL has made a lot of optimization in the database kernel, so that the efficiency of single node has also been greatly improved.

2. Index Building

ZooKeeper maintains all pivot information and sends the Pivot information to the Local Executors by the Global Executor. The Global Executor coordinates the Local Executors to build the index. Each Local Executor maintains A certain number of partitions and corresponding Pivots. Based on these Pivots, the Local Executor generates the signature S(R [A]). The index is then built.

3. Query Processing

When a user initiates a similar query request, the Routing node selects a Global Executor. The Global Executor coordinates the Local Executors to run similar queries, collect local execution results, and generate an execution plan.

MSQL+ interface display

The operation interface shown in this paper is as follows. MSQL+ supports similar query, index construction, client connection, cluster management, data import, query status display, execution plan visualization and other functions.

Conclusion:

MSQL+ is a plug-in query tool based on RDBMS, based on Tencent TDSQL implementation, has the characteristics of general, easy to use, efficient: unified interface support a variety of data space; Follow the SQL standard, you can run the select-from-where command to complete similar query tasks. MSQL+ relies on TDSQL, a distributed database of Tencent, to achieve load balancing and multi-point parallel, and can efficiently complete similar queries.

Question and answer

PHP + MSQL + functions use requests in functions

reading

TDSQL participates in VLDB 2018 Review

Tencent cloud database MySQL game industry data security practice sharing

MySQL 8.0: Machine learning has changed from MySQL 8.0 to MySQL 8.0. Quick introduction to online advertising business and CTR knowledge

This article has been authorized by the author to Tencent Cloud + community, more original text pleaseClick on the

Search concern public number “cloud plus community”, the first time to obtain technical dry goods, after concern reply 1024 send you a technical course gift package!

Massive technical practice experience, all in the cloud plus community!