Use MaxCompute to build a database

Abstract: MaxCompute is a data warehouse solution independently developed by Alibaba. In addition to features, performance, simplicity and other advantages, products can also save a lot of money. Ink weather with MaxCompute, in addition to improved performance and stability, the overall storage and computation costs are 70% less than before. How does this work? There are some common rules.

Original link:click.aliyun.com/m/43695/

MaxCompute is a data warehouse solution developed by Alibaba. In addition to features, performance, simplicity and other advantages, products can also save a lot of money. Ink weather with MaxCompute, in addition to improved performance and stability, the overall storage and computation costs are 70% less than before. How does this work? There are some common rules.

Before we talk about how to do that, let’s look at how MaxCompute is charged. According to the current document, the current billing method includes data storage, data download and computing fees. The calculation fee is divided into I/O post-paid and pre-paid two charging modes. I/O post-payment is charged according to the actual calculation. However, pre-payment is equivalent to the purchase of several CU can only be used by the buyer, these several CU calculation will not incur computational costs. So how to reduce costs translates into how to reduce storage costs, how to reduce download costs, and how to reduce computing costs.

storage

MaxCompute uses column storage and compression. The underlying data storage mode does not need to be maintained by the user. However, users can reduce the storage cost by reducing the amount of data stored on MaxCompute. You can reduce storage costs by deleting unwanted data in a timely manner. A feature called lifecycle is recommended to recycle expired data periodically. After the life cycle is set, if data on a partition in a partitioned table does not change within a specified period of time, the system automatically deletes the partition as unnecessary. If the table is not partitioned, the entire table will be dropped if no data changes in the specified time. This feature can be used to configure some history tables. For example, if you do not want history records that are more than a year old, you can partition the table according to write time and set the lifetime to a year. When data is written, data is only written to the current partition. After a year, the previous data has not been written for a whole year, triggering the life cycle condition, and the background scheduled deletion of data in the corresponding partition.

In actual use, in addition to the extra storage costs caused by the failure to delete expired and unwanted historical records in time, more because of the data management is not in place, some unnecessary tables have not been deleted resulting in space occupation. Create table xx as SELECT (); Create table xx as SELECT (); But if you really need to find out which data needs to be deleted and which data is business data that can’t be moved, use list tables; Hundreds of thousands of forms are also very difficult to start. Keep layering of data warehouse, completes the naming conventions, for each layer according to the different life cycle of use need to configure the recovery strategy, with a good data management to ensure that the data warehouse tables stored in order, can not only save money for a store, also can be used in real time, quickly find need to watch, to facilitate data management and use. If some temporary tables are generated in the calculation process and later confirmed that they are not needed, they can be deleted at the end of the calculation. For details, please refer to the construction ideas of enterprise big data warehouse architecture based on AliYunshujia written by Yixiu.

download

Not all data downloads are charged, and in some cases they are free. Taking advantage of these rules and choosing the right download method can reduce the cost of downloading (and free downloads are also the fastest way to download). You can refer to the documentation for specific rules.

In addition, the cost of downloading, and download the size of the file is related. Reducing downloads is also a strategy to reduce costs by properly structuring tables and generating data in one table (or partition) at a time. For example, when you need to download the information of grade 3 students, it is better to download all the information of students and then perform filtering than to partition students according to grade first and then download only grade 3.

To calculate

You have to pay more attention to the calculation. But first let’s look at what the calculated costs are related to. I/O postpaid, now only charge SQL fees, so this article will only talk about SQL part. The cost of seeing SQL is based on the amount of data entered and the algorithmic complexity of SQL.

When computing engine is calculating, the partition field of the table can be set properly to reduce the data input. See here for partitioning information. When parsing execution plans, the computing engine can read only the data in the partition to reduce the input data if the query filtering criteria are partition fields.

In addition to reducing the amount of input data, reducing computational complexity is another approach. Optimization of algorithmic complexity generally involves optimization at the code level and sometimes needs to be combined with business. For example, there is no duplication of a certain field in the calculated data after some data preprocessing, so distinct or group by can be added one less time in this calculation. Optimizing SQL statements together with business logic at the code level to reduce computational complexity is a way to reduce computational costs.

Another way is to reduce the number of calculations. For example, some calculations need to use the data of a real-time table, and all of them are reprocessed after some filtering and summarizing, so it is completely possible to save the calculation process that can be reused in the middle, and do an appropriate amount of mild summary or even high summary in the data warehouse. Subsequent calculations can take the summary data for further calculations. In this way, although more data is stored in the middle, the subsequent calculation can be less logic. This is a space-for-time method often used in data warehouses. With the aforementioned lifecycle, you can achieve the greatest reduction in computing overhead at the expense of minimal storage overhead. Of course, reasonable use of this method can not only reduce calculation cost, but also reduce calculation time and improve efficiency by reducing repeated calculation.

In addition to the above mentioned, you can also consider the post-payment method to run the calculation, that is, the calculation cost is the annual package. Generally speaking, the following conditions can be considered to use postpaid

Most of the tasks are periodic tasks, and the number of tasks is relatively uniform. If a task is given priority to with accidental temporary query after paying the task may not apply the submit event of scattered, if is concentrated in a certain time point calculate together, also is not good After a period of trial operation, the task number has been basically can estimate, not a short-term more obvious increase or decrease If you do not satisfy these conditions, It is easy to appear in the peak buy CU is not enough, tasks appear queue; In the trough, resources cannot be used again, resulting in waste.

other

There are actually two very important questions about cost

1. How much do I spend every day and how can I optimize it? Take a look at the daily bill here and check how much computing, storage, and downloading costs each. Thus from the corresponding Angle to do optimization. In addition, you can also sort the amount in the table to see what the highest number of costs are, and see if you can make some targeted optimization.

2. How much will it cost after I submit this SQL? If you don’t know the cost, you can’t commit an SQL with confidence. In data development of big data development suite, running SQL will jump out of the consumption alert, so as to know in advance the cost will be incurred after submission. If you find that the amount is particularly high, consider making some optimizations before submitting to avoid unnecessary fees. If you want to run SQL directly in the MaxCompute client, you can use cost SQL; Estimate the cost of SQL.

Identify the qr code below to read more about dry goods

Related Posts

Interviewer: Brother, tell me the difference between basic type and packaging type

java.lang.IndexOutOfBoundsException

Take you through the high performance network Model Reactor and Proactor