Abstract: As the core computing component of Alibaba Cloud Big data platform, MaxCompute undertakes most of the distributed computing requirements inside and outside the group.


Free open big data service: www.aliyun.com/product/odp…


preface


As the core computing component of Alibaba Cloud Big data platform, MaxCompute undertakes most of the distributed computing requirements inside and outside the group. MaxCompute SQL, as the main entrance of distributed data processing, provides strong support for fast and convenient processing/storage of EB-level offline data. With the continuous expansion of big data services, new data usage scenarios are constantly emerging. In this context, the MaxCompute computing framework is also evolving. The powerful computing capability of special internal data is gradually being opened to different external data. In this article, we will introduce how to integrate the data from TableStore(OTS) into the computing ecosystem of MaxCompute to achieve seamless connection between multiple data sources.


For application scenarios such as online services, NoSQL KV Store (e.g., BigTable, HBase) has advantages of flexible schema, easy expansion, and strong real-time performance compared with traditional databases. Ali Cloud TableStore(OTS) is a large-scale NoSQL data storage service implemented on Ali Flying distributed system, providing massive KV data storage and real-time access. It is widely used in BU inside the group and ali cloud ecosystem outside. In particular, the TableStore features such as real-time updates and coverable writes at the row level provide a nice complement to the MaxCompute built-in table append-only batch operation. However, TableStore, as a partial storage service, lacks the computing capability for large-scale batch parallel processing of massive data stored on it. In this context, it is particularly important to get through the data link between MaxCompute and TableStore.


1. MaxCompute Calculates TableStore data


1.0 Use premises and assumptions


1.0.1 MaxCompute 2.0 Enables unstructured functions of the computing framework


MaxCompute 2.0 is not enabled by default, so you need to apply for the new unstructured data processing framework MaxCompute 2.0. MaxCompute 2.0: Set the following table for each SQL query: set the following table for each SQL query:


set odps.task.major.version=2dot0_demo_flighting; set odps.sql.planner.mode=lot; set odps.sql.ddl.odps2=true; set odps.sql.preparse.odps2=lot;


The examples below are not repeated, but all of the features described in this article are based on the above assumptions, although these special Settings can be omitted once the 2.0 computing framework is fully live.


1.0.2 TabelStore basic concepts and network connectivity


If you are not familiar with TableStore or are unfamiliar with the concept of the whole KV table, you can learn some basic concepts (such as primary key, partition key, attribute column, etc.) through the TableStore document. The discussion here will not explain and discuss these basic concepts in depth. For an overview of the MaxCompute unstructured framework, see the previous article on how MaxCompute handles unstructured data stored on OSS. Including External Table, the concept of StorageHandler and the use of SERDEPROPERTIES.


MaxCompute and TableStore are two independent big data computing and big data storage services, so the connectivity between them must be ensured. For MaxCompute public cloud service to access TableStore storage, it is recommended to use the TableStore private network address, that is, the host name ending with ots-internal.aliyuncs.com. For example, tablestore://odps-ots-dev.cn-shanghai.ots-internal.aliyuncs.com.


1.0.3 Mapping between TabelStore type and MaxCompute Type


TableStore and MaxCompute have their own type systems. MaxCompute computes TableStore data by the following type:
MaxCompute Type TableStore Type
STRING STRING
BIGINT INT
DOUBLE DOUBLE
BINARY* BLOB


For MaxCompute, the BINARY type is introduced in the new 2.0 type system, so if you need to use BINARY type, you need to add an additional setting:


set odps.sql.type.system.odps2=true;


1.1 Use STS/RAM to access TableStore data


It is important to point out that the MaxCompute compute service requires a secure authorization channel to access TableStore data. In this problem, MaxCompute combines ali Cloud’s access control service (RAM) and token service (STS) to achieve the security of the data:
You first need to grant MaxCompute access to OSS in RAM. Log in to the RAM console, create the role AliyunODPSDefaultRole, and set the policy content to:


{ “Statement”: [ { “Action”: “sts:AssumeRole”, “Effect”: “Allow”, “Principal”: { “Service”: [ “odps.aliyuncs.com” ] } } ], “Version”: “1” }


Then edit the authorization policy of the role and grant the permission AliyunODPSRolePolicy to the role.


If these steps are too troublesome, you can also log in to Aliyun account and click here to complete one-click authorization.


1.2 Creating an External Table associate MaxCompute with TableStore


MaxCompute connects to TableStore data in the same way that OSS data is used: After the user introduces the description of TableStore TABLE data into the meta system of MaxCompute through a DDL statement of CREATE EXTERNAL TABLE, it can be used as a common TABLE to realize the processing of TableStore data. Here is a simple usage example to discuss some concepts and implementations of ODPS docking with TableStore.


DROP TABLE IF EXISTS ots_table_external; CREATE EXTERNAL TABLE IF NOT EXISTS ots_table_external ( odps_orderkey bigint, odps_orderdate string, odps_custkey bigint, odps_orderstatus string, odps_totalprice double ) STORED BY ‘com.aliyun.odps.TableStoreStorageHandler’ — (1) WITH SERDEPROPERTIES ( — (2) ‘tablestore. Columns. The mapping’ = ‘o_orderkey, : o_orderdate, o_custkey, o_orderstatus, o_totalprice’, — (3) ‘tablestore.table.name’=’ots_tpch_orders’ — (4) ) LOCATION ‘tablestore://odps-ots-dev.cn-shanghai.ots-internal.aliyuncs.com’; – (5)


This DDL statement maps a TableStore Table to an External Table of MaxCompute. On this basis, subsequent operations on TableStore data can be performed directly via External Table. CREATE EXTERNAL TABLE DDL CREATE EXTERNAL TABLE DDL


1. com.aliyun.odps.tablestorestoragehandler StorageHandler is TableStore MaxCompute built-in processing data, defines the MaxCompute and TableStore interaction, The related logic is implemented by MaxCompute.


2.SERDEPROPERITES can be understood as an interface that provides parameter options. When using TableStoreStorageHandler, there are two options that must be specified. Are the following tablestore. Columns. The mapping and tablestore. The table name. More optional options will be mentioned in other examples later.


3. Tablestore. Columns. The mapping options: necessary options, used to describe the tablestore to MaxCompute will visit the columns of the table, including the primary key column and attributes. The ones starting with: are used to represent the TableStore primary keys, such as: o_ORDERKey and: o_orderDate in this example. All other columns are attribute columns. TableStore supports at least one and a maximum of four primary keys. The primary key type is Bigint or String. The first primary key is the partition key. When specifying a mapping, the user must provide all the primary keys of the specified TableStore table. For attribute columns, you can provide only the attribute columns that need to be accessed by MaxCompute.


Tablestore.table. name: specifies the name of the TableStore table to be accessed. If the specified TableStore table name does not exist, an error is reported, and MaxCompute does not create the TableStore table.


5. The LOCATION clause specifies the TableStore information, including the instance name and endpoint. It’s worth noting that secure access to TableStore data is based on the RAM/STS authorization described earlier.


1.3 Accessing TableStore Data through the External Table for calculation


TableStore = TableStore = TableStore = TableStore = TableStore = TableStore = TableStore = TableStore = TableStore = TableStore


SELECT odps_orderkey, odps_orderdate, SUM(odps_totalprice) AS sum_total FROM ots_table_external WHERE odps_orderkey > 5000 AND odps_orderkey < 7000 AND odps_orderdate >= ‘1996-05-03’ AND odps_orderdate < ‘1997-05-01’ GROUP BY odps_orderkey, Odps_orderdate HAVING sum_total > 400000.0;


As you can see, in this example, we use the familiar MaxCompute SQL syntax directly. All the details of accessing the TableStore are handled internally by MaxCompute. This includes the selection of column names: Odps_orderkey, odps_totalprice, o_orderKey, o_totalprice, o_totalprice, o_totalprice, o_orderKey, o_totalprice, o_totalprice, o_totalprice, o_totalprice We have already done the mapping in the DDL statement of the External Table. Of course, each user can control the specific mapping, so it is ok to keep the original TableStore primary key/column name when creating the External Table.


At the bottom level, MaxCompute framework optimizes the storage characteristics of TableStore data, including concurrent reading and filtering operation escape of SQL statements. For example, with a filtering operation like the WHERE statement above, MaxCompute determines whether the filtering key is the primary key of the TableStore table to determine how to use the TableStore GetRange API to read the minimum amount of data. Instead of reading the full data unconditionally, you can filter the data after reading the MaxCompute node. Of course, the optimization of these implementations is not perceived by the user, and the MaxCompute computing framework is responsible for providing efficient implementation.


In addition, if you need to compute a table multiple times, it is more efficient to import the required data into a MaxCompute table at once than to read the data from the TableStore.


CREATE TABLE internal_orders AS SELECT odps_orderkey, odps_orderdate, odps_custkey, odps_totalprice FROM ots_table_external WHERE odps_orderkey > 5000 ;


Internal_orders is now a familiar MaxCompute table with all the features of an internal MaxCompute table: an efficient compressed column storage format, complete internal macro data, and statistics. And because it’s stored inside MaxCompute, it’s faster than accessing an external TableStore, especially for hot data that requires multiple computations.


2. Write data from MaxCompute to TableStore


MaxCompute = TableStore = TableStore = TableStore = TableStore = TableStore = TableStore = TableStore = TableStore The results of offline calculation can be quickly fed back to online applications by using TableStore’s features such as real-time update and single-line coverage. This type of output to TableStore data can be implemented using INSERT OVERWRITE of MaxCompute SQL.


Again, MaxCompute does not actively create an external TableStore table, so you must ensure that the TableStore table has already been created on the TableStore before exporting data to the TableStore table. This is done because the TableStore process may involve CU Settings, billing, and data lifetime options that must be determined by the owner of the data: MaxCompute does not own the external data and cannot make these choices.


Following the example above, suppose we have created the external table ots_table_external using the DDL statement above to link MaxCompute to the TableStoreb table ots_tpCH_Orders. We also want to write the data stored in MaxCompute with the name internal_Orders back to the TableStore. INSERT OVERWITE TABLE into TABLE OVERWITE


INSERT OVERWRITE TABLE ots_table_external SELECT odps_orderkey, odps_orderdate, odps_custkey, CONCAT(odps_custkey, ‘SHIPPED’), CEIL(odps_totalprice) FROM internal_orders;


I’m doing some processing here, and I’m rewriting the processed data back into the TableStore. For NoSQL storage of KV data such as TableStore, the output from MaxCompute will only affect the row corresponding to the primary key: In this example, only odps_ORDERKey + ODPS_ORDERDate will be affected if the odPS_ORDERDate primary key values correspond to the data on the row. In addition, these TabeleStore rows will only update the property columns specified when creating the External Table (ots_table_external), and will not change the data columns that do not appear in the External Table. MaxCompute external table TableStore external table MaxCompute external table TableStore external table


3. Technical details and advanced usage


MaxCompute is a strictly structured table, requiring all rows to follow a strictly consistent schema, whereas TableStore stores NoSQL “semi-structured” K-V data. These differences in basic data formats determine the behavior of the two systems during their interaction. At the same time, as a distributed computing system, MaxCompute reads and writes TableStore concurrently, which requires a cutting mechanism for TableStore data. By default, MaxCompute provides a method that the system deems most appropriate for accessing and computing TableStore data, and these implementations satisfy most users’ needs. At the same time, however, MaxCompute provides more configurable options to meet the special needs of advanced users who are more familiar with the system.


3.1 MaxCompute external table TableStore table


The MaxCompute appearance has a many-to-one (N:1) relationship with the TableStore table. That is, there can be multiple MaxCompute External tables to describe a TableStore Table. Here N:1 is in two dimensions:
  • Different MaxCompute exteriors can describe different subsets of attribute columns in a TableStore table. For example, if there are 3 primary key columns in a TableStore table, (up to 20) primary key columns must be provided by MaxComptue exteriors. For example, if you can provide only one attribute column, then all operations through the MaxCompute exterior will be based only on the primary key and the data on the supplied attribute column.
  • Different MaxCompute exteriors can describe different ranges of a TableStore table. In practice, you can specify the range start and Range end of a TableStore table with additional options.
This feature is not covered here, but you can contact the MaxCompute technical team if you need it.


3.2 MaxCompute Specifies the concurrency of reading TableStore data


TableStore is a distributed KV data storage system. Each data table may be stored on multiple back-end servers and partitioned according to the partitioning key. The partitioning strategy on the specific storage is determined by TableStore. Currently, MaxCompute is used to read TableStore data. The default concurrency is the same as the number of partitions on the TableStore back end. The only exception is that when INTEGER64 is used as the partitioning key and the number of partitions on the TableStore back end is greater than 1, MaxCompute automatically adjusts the concurrency to read data at a higher concurrency. In addition, TableStore’s own system is also constantly developing, and will provide a more powerful API interface for MaxCompute to use, then will be able to accurately perform data cutting according to the size of the back-end data. More accurate control of the amount of data and compute time processed by each concurrent MaxCompute worker. This will be explained in an update after these features are implemented.


Finally, if the user has a good understanding of the data stored in the TableStore, such as the amount of data in different key ranges, MaxCompute also provides a selection that allows the user to specify the concurrency: User control can even be refined to specify which range of data each worker should process. Users with this requirement can contact the MaxCompute technology team.


3.3 MaxCompute Specifies the concurrency of writing TableStore data


When writing the internal data of MaxCompute to the TableStore, the concurrency is automatically controlled by MaxCompute based on the amount of data. Of course, users can manually adjust the number of mapper/ Reducer during SQL execution to adjust the concurrency. However, in most cases, the concurrency of MaxCompute itself is reasonable. It is not recommended to manually set the mapper/reducer number. In addition, in some scenarios, there is a need for adaptation between MaxCompute and TableStore. In general, MaxCompute can schedule a large number of compute nodes, but when a large number of compute nodes write data to the TableStore at the same time, the network may be full. In this case, simply increasing the number of MaxCompute nodes to write out data concurrently does not provide additional speed. Therefore, on a particularly large scale, it is best for users to communicate with TableStore service in advance to ensure that TableStore can provide enough throughput to meet requirements.


3.4 MaxCompute Network connectivity for accessing TableStore


Because MaxCompute and TableStore are two separate cloud services, network connectivity on different deployment clusters may affect the accessibility of MaxCompute to TableStore data. For the concepts of nodes, instances, and service addresses of TableStore, refer to TableStore. As described above, when accessing TableStore storage in MaxCompute public cloud service, it is recommended to use the TableStore private network address (host address ending with ots-internal.aliyuncs.com).


4. Conclusion: Construct big data ecology


With the launch of the MaxCompute unstructured data processing framework, MaxCompute opens up the interface for handling external data, including OSS data as described earlier, and TableStore data as described in this article. We hope to realize the ecological integration of ali Cloud computing and data: in different projects, we have seen the great potential of MaxCompute to process massive unstructured data such as videos and images on OSS. With the addition of TableStore data support, it is expected that computing and more data collisions can open up more application scenarios, so that OSS data, TableStore data and MaxCompute internal storage data, can generate greater value in the MaxCompute core computing engine.


The original link


To read more articles, please scan the following QR code: