Brief introduction: From the beginning of its establishment, the company has established a big data team and built a big data platform. I built my own Cloudera Hadoop cluster on ECS. However, with the rapid expansion and development of the company’s Internet finance business, the big data team is also shouldering more and more responsibilities. The demand of real-time data warehouse, log analysis, AD hoc query, data analysis, etc., all the demands put forward by the business will greatly test the capability of Cloudera Hadoop cluster. In order to relieve the pressure of Cloudera cluster, we combined with our own business situation, and set up a data lake on AliCloud that is suitable for the current reality.

1. Shuhe Technology

Founded in August 2015, Shuhe Technology is a C-round fintech company jointly invested by Focus Media, Sequoia Capital, Sina and others. The vision of the company is to be an intelligent financier that will accompany users throughout their lives, adhering to the values of openness, challenge, professionalism and innovation, so that everyone can enjoy the best solution of financial services. The company’s main products are Huibei and Latte Investment, which mainly provide credit, financial management, e-commerce and other services, and has 80 million registered users. As a representative financial technology enterprises in China, SW Technology takes the lead in introducing big data and AI technology into intelligent customer acquisition, intelligent risk control, intelligent operation, intelligent customer service and other aspects. Up to now, the company has cooperated with more than 100 financial institutions including banks, credit, licensed consumer funds, funds and insurance.

2. Self-built CDH on cloud

Since its inception, the company has set up a big data team and built a big data platform on a cloud manufacturer. We purchased an EC2 instance from a cloud vendor and built our own Cloudera Hadoop cluster on top of that instance. In the early days, this Cloudera Hadoop cluster was only used for T+1 offline database. At midnight, after the business daily cut, we used SQOOP components to extract the full or incremental data of the business database into the Hadoop cluster. After a series of ETL cleaning with the offline database Hhive, Send the resulting data generation email to the leader for next decision, or push it to the database for Tableau report display, or insert it into the business database for business system to call. However, with the rapid expansion and development of the company’s Internet finance business, the big data team is also shouldering more and more responsibilities. The demand of real-time data warehouse, log analysis, AD hoc query, data analysis, etc., all the demands put forward by the business will greatly test the capability of Cloudera Hadoop cluster. To meet the real-time warehouse requirements, we installed the HBase component on the Cloudera cluster; To meet the needs of log analysis, we installed Flume and Kafka components on the Cloudera cluster; To meet the need for AD hoc queries, we installed the Presto component on the Cloudera cluster; To meet the requirements of data analysis, we installed the Jupyter component on the Cloudera cluster, and every additional business requirement is a huge challenge to the stability of the existing system.

Cloudera cluster

In addition to the increasing business requirements, the increasing organizational complexity, the increasing number of people, and the exponential increase in the amount of data of various types, the disadvantages of Cloudera clusters have emerged and are becoming increasingly unable to withstand these challenges.

  • Expandability of

Cluster scale expansion requires operation on Cloudera Manager, requires operation and maintenance personnel to master certain skills, and there are certain operational risks. In addition, if there is an emergency or temporary demand for large-scale capacity expansion, a large number of EC2 machines need to be purchased first, and then a series of complex operations are needed to join the cluster, and then a series of complex operations are needed to release these machines, and these online operations cause great trouble to the stability of the online business of the cluster.

  • High cost

In terms of storage cost, we did not expect the rapid development of data volume in the future at the beginning. We used three replicas of HDFS storage in Cloudera cluster, and SSD disks were configured on EC2 machine. Besides, weekly data backup also took up a lot of disk resources, so the disk cost was always high. In terms of cost calculation, many tasks in the evening are not enough to calculate resources, while few tasks in the day are redundant. Such poor demand for resources leads to cost waste.

  • Cluster update difficulty

We are using Cloudera version 5.5.1. For the stable operation of the cluster, we have been afraid to update it for several years. However, building a new version of Cloudera cluster for cluster migration involves a lot of manpower and material resources, so this old version has been in service all the time. Because clustering compatibility prevents us from using new open source components, or from refactoring open source components with a lot of effort, it prevents the introduction of new technologies.

  • High maintenance threshold

Building a Cloudera cluster and carrying out subsequent maintenance requires higher technical requirements for operation and maintenance personnel, while solving practical problems requires higher technical requirements. In addition, Cloudera Manager is not open source and Cloudera community is not active enough, which also causes some troubles to cluster operation and maintenance.

  • Cluster disaster tolerance is poor

Data disaster tolerance, HDFS storage three copies cannot be across the available area. Service disaster tolerance, service nodes cannot be deployed across available zones. Availability zone failures can affect the stability of the entire cluster.

3. Hybrid architecture on the cloud

In order to relieve the pressure on the Cloudera cluster, we thought of migrating part of the business to the cloud vendor products, gradually forming a hybrid architecture on the cloud.

  • According to different services and functions, several EMR clusters on the cloud are built

EMR clusters on these clouds share storage and metadata. However, due to incompatible EMR Hive versions and Cloudera Hive versions, metadata could not be unified. Finally, two sets of metadata were formed: Cloudera Hive and EMR Hive. These EMR clusters relieve the pressure on Cloudera clusters

  • In order to relieve the pressure on Cloudera, we designed the EMR Hive hybrid architecture Chive

The Chive architecture connects metadata from EMR Hive to Cloudera Hive, which is equivalent to using Cloudera HDFS storage, but using EMR computing resources. The Hive hybrid architecture also greatly reduces the pressure on Cloudera clusters

  • Separation of hot and cold data

The hot data on the Cloudera cluster is stored in HDFS, while the cold data is placed in the S3 bucket via Cloudera Hive, where the data is placed into cold storage on a regular basis through a lifecycle.

Hybrid architecture on the cloud

With the practice of mixed architecture on the cloud, there is actually a prototype of big data data lake. We want to land a data lake suitable for the current reality situation on Ali cloud while a cloud manufacturer migrates to Ali cloud.

  1. Aliyun first generation data lake

= = = = = = = = = = = = =

4.1 What is a data lake

A data lake is a centralized repository that allows you to store all structured and unstructured data at any scale. You can store data as it is, without having to structure it first, and then analyze it using different types of engines, including big data processing, visualization, real-time analytics, machine learning, etc., to guide better decisions. Data lakes versus data warehouses

Features data warehouse data lake data relational data from transactional systems, operational databases, and line of business applications non-relational and relational data schemas from IoT devices, websites, mobile applications, social media, and enterprise applications designed before data warehouse implementation (written schemas) Writing in analytics (read schemas) cost effective faster query results result in higher storage costs faster query results only need lower storage costs Data quality highly regulated data that can be based on important facts Any data that can or cannot be regulated (such as raw data) User Business Analyst Data scientists, data developers, and business analysts (using supervisory data) analyze batch reports, BI and visual machine learning, predictive analytics, data discovery and analysis

Basic elements of a data lake solution

  • Data movement

The Data Lake allows you to import any amount of data that is acquired in real time. You can collect data from multiple sources and move it in its raw form into the data lake. This process allows you to scale to data of any size, while saving time in defining data structures, schemas, and transformations.

  • Securely store and catalog data

Data lakes allow you to store both relational and non-relational data. They also enable you to understand the data in the lake by crawling, cataloging, and indexing the data. Finally, data must be protected to ensure that your data assets are protected.

  • Analysis of the

Data lakes allow a variety of roles in an organization, such as data scientists, data developers, and business analysts, to access data through their choice of analysis tools and frameworks. This includes open source frameworks such as Apache Hadoop, Presto, and Apache Spark, as well as commercial products from data warehousing and business intelligence vendors. Data lakes allow you to run analytics without having to move data to a separate analytics system.

  • Machine learning

The data lake will allow organizations to generate different types of insights, including reporting historical data and performing machine learning (building models to predict likely outcomes) and suggesting a set of prescribed actions to achieve the best outcome. According to the definition and basic elements of the data lake, we landed on Ali Cloud, which is suitable for the current reality of the first generation of data lake scheme.

4.2 Aliyun Data Lake Design

4.2.1 Overall architecture of Aliyun Data Lake

Overall architecture of Ali Cloud Data Lake

Private network VPC (Virtual Private Cloud) is a custom Private network created by users based on Ali Cloud. Different Private networks are logically isolated at two levels. Users can create and manage Cloud product instances, such as ECS, load balancing, RDS, etc., within the Private network created by themselves. We put the business of the company under two VPCs, the business VPC and the big data VPC. Data extraction EMR extracts data from RDS, OSS and Kafka of business VPC and forms data of ODS layer in data lake OSS. EMR T+1 of core data warehouse performs ETL on data of ODS layer and generates data of CDM data warehouse and ADS market layer for other big data EMR and business EMR. The following chapters introduce our solutions and practices in Aliyun Data Lake.

4.2.2 Unified storage and metadata management

Unified storage means that the storage is set up on the OSS object store as a data lake, which is used by several EMR clusters. AliCloud Object Storage Service (OSS) is a massive, safe, low-cost and durable cloud Storage Service provided by AliCloud. Its data is designed to last no less than 12 nines. OSS has a platform-independent RESTful API interface that allows any type of data to be stored and accessed in any application, anytime, anywhere. AliCloud can also be used to provide APIs, SDK interfaces or OSS migration tools to easily move large amounts of data into or out of AliCloud OSS. After data is stored in AliCloud OSS, Standard storage can be selected as the main storage method. Infrequent Access, Archive, and Cold Archive can also be options for Infrequent Access to data with lower costs and longer storage lives. These OSS – based features are ideal for data lake storage. Unified metadata means that components in several EMRs that use a data lake use one set of metadata, such as Hive, Ranger, Hue, and so on. Relational Database Service (RDS) is a stable, reliable, and flexible online Database Service. Relational Database Service (RDS) is a kind of stable, reliable and flexible online Database Service. Based on AliCloud distributed file system and SSD disk high-performance storage, we can quickly build a stable and reliable database service. Compared with the self-built database, it is cheap and easy to use, with flexible billing, on-demand matching, ready-to-use, high-performance, high-availability architecture, a variety of disaster recovery schemes, high security and other characteristics. It is also suitable for a unified metadata store.

4.2.3 Design of multi-EMR and multi-OSS buckets

Using the architecture of unified OSS storage and unified metadata, we designed the framework of multiple EMR and multiple OSS buckets

Multi-EMR multi-OSS bucket design on data lake

Extract EMR T+1 to extract business RDS to the data lake, the core EMR carries out a series of ETL operations in the layered EMR to generate CDM common dimensional layer data, and the business EMR carries out ETL operations based on the CDM common dimensional layer data to generate ADS market layer data. EMR Presto makes AD hoc queries on CDM and ADS data. A business EMR mainly provides ad-hoc query service and DAG scheduling service. Users can only submit their ad-hoc query and scheduling tasks to their department’s EMR, and we can set up YARN queue resources to separate the resources of the two tasks.

Business EMR provides services

4.2.4 Distributed scheduling system design

Airflow is a workflow platform that can be programmed, scheduled and monitored. DAG Airflow, based on directed acyclic graph, can define a group of tasks with dependencies and execute them in turn according to the dependencies. Airflow provides a wealth of command line tools for system management and control, and its Web management interface can also conveniently control and schedule tasks, and real-time monitoring of the operation status of tasks, which facilitates the operation and management of the system. The Airflow system runs with a number of daemons that provide all the functionality of Airflow. Daemons include Web server-webserver, Scheduler -Scheduler, execution unit -Worker, message queue monitoring tool -Flower, etc. These daemons are independent of each other; they are not dependent or aware of each other, and each daemon only handles the tasks assigned to it at run time. Based on this characteristic of Airflow, we build a distributed scheduling system of Airflow cluster with high availability based on data lake.

Airflow Distributed Scheduling System on Data Lake

To perform tasks easily on the EMR, we deployed the Airflow Worker on the EMR’s Gateway because the Gateway has the client commands and configurations for all the EMR currently deployed components. We can also increase the number of daemons on a single Worker node to vertically expand Worker capacity and improve cluster task concurrency. We can also add more gateways (each Gateway deploys a Worker) to horizontally expand Worker capacity and improve cluster task concurrency. In reality, in order to improve the task concurrency and reduce the pressure of a single Gateway, we configure two Gateway and Airflow Worker for the core warehouse cluster and suction cluster of highly concurrency submitted jobs. In the future, we plan to deploy two nodes for the Airflow Master to solve the problem of single point of failure of the Master node.

4.2.5 User privilege system design

User permission system is always the core of architecture design. We designed a three-layer user access system based on the data lake, the first layer of RAM access control, the second layer of EMR execution engine access, and the third layer of big data interactive analysis access.

Three layers of user authority system on the data lake

The first layer access control (RAM) is a service provided by AliCloud to manage user identity and access rights to resources. RAM allows to create and manage multiple identities under one Ali Cloud account, and allows to assign different permissions to a single identity or a group of identities, so as to realize the purpose that different users have different access rights to resources. Each EMR is bound to an ECS application role, and each ECS application role can only access the corresponding OSS bucket in the data lake. The second layer of EMR execution engine access, including HiveServer2, Presto, Spark and other execution engines. First of all, we need to understand that Authentication means to verify whether the identity used by users is correct, and Authorization means to verify whether the identity operation used by users has authority. HiveServer2 supports multiple user authentication methods: NONE, NOSASL, Kerberos, LDAP, PAM, Custom, etc. Permission authentication can use Hive’s own permission system, Ranger, Sentry and other open source components. Using Presto’s Hive Connector, Presto and Hive can share the same user access system. With the support of AliCloud’s EMR big data team, the Spark client can also support this user rights system. Finally we use EMR OpenLDAP to store user and user group information, and EMR Ranger provides a centralized rights management framework. The user and group information of EMR OpenLDAP will be synchronized with the company’s AD, and the information of new employees or former employees in AD will be synchronized to EMR OpenLDAP in T+1 mode.

OpenLDAP and Ranger user rights management system

The third layer of big data interactive analysis of access rights. We have built a set of Hue unified interactive analysis query system with big data. By limiting the EMR access entrance of the interactive analysis query system, users can only access the EMR of the department. Through the three-layer user authority system, the user fetching requirements of the whole scene can be basically covered.

4.2.6 Elastic telescopic design of EMR

EMR’s elastic scaling capability allows you to set scaling policies based on business requirements and policies. Once elastic scaling is enabled and configured, EMR will automatically add Task nodes for you when business demand increases to maintain computing power, and it will automatically reduce Task nodes when business demand decreases to save costs. We have run a large number of EMR clusters on our data lake. Due to the elastic and scalable characteristics of EMR, we can save costs and improve execution efficiency while meeting business requirements, which is also one of the most important advantages of cloud in big data compared with traditional IDC self-built big data clusters. We set a number of elastic expansion rules as follows, mainly following the principle that elastic expansion should be lower than the threshold of elastic shrinkage.

4.2.7 Load balancing management

The EMR cluster is stateless and can be created and destroyed at any time. However, the new and destroyed EMR clusters should not affect the stability of the service interfaces provided to the outside, so we designed a unified service interface layer for EMR clusters on the data lake. HAProxy is a free, fast and reliable solution that provides high availability, load balancing, and support for virtual hosting over TCP and HTTP applications. We use HAProxy’s four-layer network layer load balancing, that is, TCP and UDP load balancing to provide unified services. In terms of implementation, we mainly use HAProxy to proxy the HiveServer2 interface, ResouceManger interface, HiveMetastore interface, Presto HTTP interface and so on of each EMR. And having HAProxy support for Include loads multiple module configuration files is easy to maintain and restart.

4.2.8 OSS bucket life cycle management

Compared with the data of other data storeys, the data of ODS layer is non-renewable (the data of the business RDS repository will be deleted regularly, and the data of ODS layer is provided with the function of data backup). We put the data of ODS layer on the multi-version bucket, which can also realize the regular data backup of Cloudera Hadoop by taking snapshots regularly. Therefore, we need to set the life cycle of the ODS bucket data to ensure the security of the ODS layer data and to keep the data volume growing steadily.

Life cycle Settings for ODS multi-version buckets

The Hadoop HDFS file system will have a garbage collection mechanism to facilitate the collection of deleted data into the garbage can, to avoid some of the accidental deletion of important files. The data that is recycled into the trash can can be recovered. HDFS creates a recycle bin for each user. The directory is /user/ username /.Trash/ the files or directories deleted by the user. There is a period (fs.trash.interval) in the system recycle bin, after which HDFS will automatically delete the data completely. In the case of data lake architecture, the Recycle Bin directory will be set on the OSS bucket, and HDFS will not delete these garbage files on a regular basis, so we need to set the OSS file life cycle (delete data 3 days old) to delete these garbage files on a regular basis.

Dumpster life cycle Settings

4.2.9 Log management

Log Service (SLS) is a one-stop Service for Log data. Users can quickly complete the functions of data collection, consumption, delivery, query and analysis without development, helping to improve the operation and maintenance efficiency, and establish the massive Log processing capacity in the DT era. Given the periodic deletions of EMR component logs, it is necessary to collect the history logs of the components on the EMR in one place to facilitate subsequent troubleshooting, and SLS is suitable for the scenario of multiple EMR log collection on a data lake. We collected them based on the EMR component common logs

4.2.10 Terminal permission management

Developers need to have login permissions for a specific EMR instance to facilitate development operations.

Terminal privilege management

The way of terminal login is as above, through the company fortress machine, login to a specific Linux jumper under the big data VPC, so as to log in the instance of EMR. Operators with different roles have specific login permissions. The big data operation and maintenance can use the unified key pair to log in any instance of the EMR Hadoop cluster as root account, and then log in any instance of the EMR cluster after switching to the Hadoop account.

4.2.11 Component UI management

As shown above, the address of Knox is not easy to remember, so we used cloud resolution DNS products. Alibaba Cloud Resolution DNS (Alibaba Cloud DNS) is a secure, fast, stable and scalable authoritative DNS service. Cloud Resolution DNS enables enterprises and developers to convert domain names that are easy to manage and identify into digital IP addresses used for interconnection communication by computers, so as to route users’ access to corresponding websites or application servers. We solved this problem nicely by using alias records and pointing easy-to-remember domains to the Knox domain.

4.2.12 Monitoring and alarm management

EMR-APM provides EMR cluster users, especially cluster operation and maintenance personnel, with a complete set of tools for monitoring the cluster, monitoring services, monitoring the overall operation status of the job, troubleshooting and solving cluster operation problems. A common example is the yarn-home diagram, which allows you to see historical elastic scaling instances

EMR APM large-cap yarn-home chart

The yarn-queue diagram, where you can see resource usage and task execution for each QUEUE in history

EMR APM large-cap yarn-queue chart

EMR APM large-cap yarn-queue chart

CloudMonitor is a service that monitors AliCloud resources and Internet applications. Cloud monitoring services can be used to collect AliCloud resources or user-defined monitoring indicators, detect service availability, and set alerts against indicators. Allows you to have a comprehensive understanding of AliCloud resource usage, business operation status and health, and timely receive abnormal alarm to respond to ensure the smooth operation of the application. We used to connect the alarm information of multiple EMR core components on the data lake to cloud monitoring, and let cloud monitoring send alarm to relevant responsible persons by phone, pin and email.

4.2.13 Ad hoc query design

Ad hoc query ability is the test of the output ability of the data lake. We have developed a unified interactive query system with large data, which supports two execution engines: HiveServer2 and Presto. By limiting the number of query entries for the same purpose, users can only submit AD hoc query jobs on their department’s EMR. As Presto’s computing resources interact with Hadoop’s YARN computing resources, we have set up a separate EMR Presto cluster to provide Presto AD hoc query services for unified usage.

Ad hoc query design on the data lake

Unified number to meet the basic needs of users on the basis of AD hoc query, we have also done a lot of personalized needs.

  • Access to work order approval system of the company
  • Component service status monitoring reminders
  • The Hivesql syntax and the Prestosql syntax intersect
  • Metadata display, including sample data display, blood relationship display, scheduling information display, statistical information, etc

4.2.14 Cluster security group design

The security group of the ECS instance is a kind of virtual firewall with state detection and packet filtering capabilities, which is used to divide the security domain in the cloud. A security group is a logical grouping of instances that have the same security protection needs in the same geographic area and trust each other. All EMRs on the data lake must be tied to a specific security group to serve the outside world. We assign different security groups to different instance groups of the big data cluster.

4.2.15 Data desensitization design

Sensitive data mainly includes customer data, technical data, personal information and other high-value data. These data exist in different forms in big data warehouse, and the leakage of sensitive data will bring serious economic and brand losses to enterprises. EMR Ranger supports Data Masking of Hive Data, desensitization of the returned results of SELECT, and Masking sensitive information from the user. However, the EMR Ranger feature is only available for HiveServer2 scenarios, not for Presto scenarios.

Sensitive field scanning of Data Lake is carried out according to preset sensitive field rules, incremental scanning at the minute level and full scanning at the day level. The scan results are written into the Ranger meta database through the Ranger Mask RESTful API. When the user’s AD libs query goes through HiveServer2 and hits the sensitive field, only the first few characters of the sensitive field are displayed normally, and all subsequent characters are desensitized with X.

Ranger desensitization

4.2.16 YARN queue design

A business EMR mainly provides ad-hoc query service and DAG scheduling service. Users can only submit their ad-hoc query and scheduling tasks to their department’s EMR, and we can set up YARN queue resources to separate the resources of the two tasks.

4.3 EMR management of data lake

EMR governance plays an important role in data lake governance. EMR governance includes stability governance, security governance, execution efficiency governance and cost governance.

4.3.1 Adjust EMR pre-scaling time

The midnight T+1 task of the counting bin has timeliness requirements, so we need to prepare sufficient computing resources in advance when the operation of the zero counting bin starts to execute. Due to the limitations of the EMR’s current elastic scaling architecture, graceful downtime can result in scaling and scaling not in parallel.

  • Under the condition that the operation of 0 warehouse is not affected, the pre-expansion time is postponed as far as possible and the EMR OpenAPI is scheduled to be implemented. The pre-expansion time is temporarily shortened from 22:00 to 23:30 when the gracefully offline parameters are available.
  • Check the task operation monitoring, and restore the elastic stretching time in advance as far as possible. Check the EMR APM market monitoring, observe the task execution time, adjust the elastic stretching lower limit in advance and restore the elastic stretching time from 10:00 to 6:00. Before and after optimization, the average number of online nodes decreased from 52 to 44 from 22:00 to 10:00. 4.3.2 Change EMR Elastic Scaling Policy The Elastic Scaling feature allows you to set the scaling policy based on business requirements and policies. After the elastic scaling is enabled and configured, EMR will automatically increase Task nodes when the business demand increases to ensure computing power, and automatically reduce Task nodes when the business demand decreases to save costs. The payment mode of Task node includes annual payment, volume instance and bidding instance. In the case of fully elastic scaling, we should try our best to use bidding as an example, which can be referred to “EMR Elastic Low-Cost Offline Big Data Analysis Best Practices” by Aliyun.
  • This scheme takes into account the cluster computing capacity, cost and stability of elastic scaling. It uses bidding instances as much as possible, and only uses bidding instances when the available area ECS inventory is low. Elastic expansion configuration
  • Available zone migration Different available zone inventories differ, so we should try to deploy or migrate the EMR cluster to a well-stocked available zone so that we can use bid instances to reduce costs as much as possible
  • The properties of the tasks in the night and the day are different. For example, the tasks in the night are mainly scheduled and the DW queue is used, while the tasks in the day are mainly ad-hoc queries and the DEFAULT queue is used. We can use the schedule to refresh the queue resources, and effectively use the queue resources to avoid the waste of the queue resources. After the above series of optimizations, EMR cluster cost is reduced by 1/5 4.3.3 Optimized EMR cloud disk space EMR elastic instance can use cloud disk, cloud disk includes efficient cloud disk, SSD and ESSD
  • ESSD cloud disk: Ultra high performance cloud disk product based on the new generation of distributed block storage architecture. Combined with 25GE network and RDMA technology, a single disk can provide up to 1 million random read and write capabilities and lower single-channel delay capability. Recommended for use in scenarios such as large OLTP databases, NoSQL databases, and ELK distributed logging.
  • SSD cloud disk: High performance cloud disk product with stable high random read and write performance and high reliability. It is recommended for use in scenarios such as I/O intensive applications, small and medium-sized relational databases, and NoSQL databases.
  • Efficient cloud disk: cloud disk product with high cost performance, medium random read and write performance and high reliability. It is recommended to be used in development and test business and system panel scenarios. At present, considering the cost performance, we choose ESSD cloud disk. According to the daily cloud disk monitoring of the elastic node, the number and capacity of the elastic expansion instance data disk are reasonably determined. 4.3.4 Selection of EMR Machine Groups In a business EMR, it mainly provides ad-hoc query service and DAG scheduling task service. Elastic scaling is more suitable for DAG scheduling scenarios, but not for ad-hoc queries, because ad-hoc queries have the characteristics of short query time and high frequency. Due to these considerations, we tend to reserve a fixed number of Task instances for which it is appropriate to pay up front. Therefore, we set up two TASK machine groups, the paid TASK machine group and the paid TASK machine group. The paid TASK machine group mainly meets the demand of AD hoc query, and the paid elastic TASK group meets the demand of DAG scheduling TASK

4.3.5 EMR cost control

In our company’s product consumption distribution, ECS cloud server accounts for a large proportion of the total cost, and EMR elastic instances account for the majority of ECS cloud server, so we need to pay attention to the expense bill of EMR to effectively control the cost. We can use detail list subscription service to call SubscribeBilltoOSS to export Ali Cloud OSS subscription bill detail list data to big data Hive table, and calculate the expense report of each EMR every day after a series of ETL. EMR costs mainly include annual and monthly instance costs, volume instance costs, bidding instance costs, cloud disk costs and reservation voucher deduction costs. AliCloud provides a way to TAG resources to realize account sharing. Specifically, we use the way to TAG EMR clusters to realize account sharing management among multiple business clusters. You can refer to [” single account separate accounting best practices of enterprises “] (https://bp.aliyun.com/detail/… . Through the statement, it is found that the cost of 30 machines of EMR-A is not proportional to that of 50 machines of EMR-B. By analyzing the cost component, it is found that EMR-A is in the resource-scarce available area and uses A large number of volume instances and reserved instance coupons, while EMR-B is in the resource-rich available area and uses A large number of bidding instances. Quantity instance + reserve ticket cost is far higher than the bidding instance. In addition, we also calculated the cost of each SQL in the EMR to urge the business to optimize large SQL and offline useless SQL. We pull the MemorySeconds metric in the ResourceManger and calculate it as SQL Fees =MemorySeconds Of SQL/Total MemorySeconds Of EMR* Total EMR Fees.

4.3.6 Purchase RI reserved deduction vouchers

Instance reservation voucher is a discount voucher that can be used to offset the bill of pay-as-you-go instances (excluding preemptive instances) and to reserve instance resources. Compared with annual and monthly instances, the combination model of reserving instances and pay-as-you-go instances can give consideration to both flexibility and cost. Reserved instance coupons support regions and available zones. Region level reservation instance voucher supports the matching of pay-as-you-go instances across available zones in a specified region. Available extent-level reservation instance coupons can only match pay-as-you-go instances in the same available extents. Reserved instance coupons support three payment types: full prepaid, partial prepaid, and zero prepaid. Different payment types correspond to different billing standards. Since we use the elastic strategy of bidding instance first and offering bottom according to quantity instance, we purchase a part of prepaid reservation instance coupons across available area 0 to offset the elastic scaling quantity instance. The figure below shows the usage of each payment days of the reserved instance voucher.

As can be seen, the utilization rate of two types of reserved instance coupons of ECS specifications was 0 and 62.5% respectively, which did not reach the expected 100%. The reason for this is that the later-stage resource switches from a volume to a preemptive instance, and the reserved instance voucher does not support preemptive instances. Overall, the use of a reserved instance voucher saves about 40 percent of the cost compared to pay-as-you-go. For more details, please refer to Full Link Use Practices for RI and SCU.

Elastic support

Elastic guarantee provides 100% resource certainty guarantee for the daily flexible resource demand of flexible payment. Through the elastic guarantee, only a low guarantee cost is needed to exchange for the certainty guarantee of resources with a fixed period (supporting 1 month to 5 years). Attributes such as available area and instance specification are set when purchasing resiliency, and the system will reserve resources with matching attributes in the way of private pool. Choosing to use the capacity of the private pool when creating pay-as-you-go instances guarantees 100 percent success. We know that for a period of time before and after the double tenth a ali cloud resources nervous would happen, and some of the company’s T + 1 task belongs to extremely important task, in order to guarantee low cost double tenth EMR during a flexible resources, we provide data on the lake some important EMR binding the elastic support private pool to support these important EMR on elastic resources during this time will be able to get.

4.4 Data lake OSS management

The governance of the execution engine EMR on the data lake was described above, and the governance of the storage medium OSS on the data lake is described below.

4.4.1 Data warehouse ODS multi-version bucket management

Versioning is a data protection feature aimed at the OSS storage space (Bucket) level. When versioning is enabled, overrides and deletions to data are saved as historical versions. You can restore objects stored in a Bucket to the historical version at any time after an Object is wrongly overwritten or deleted. We use HDFS Snapshot function in our self-built Cloudera Hadoop to guarantee the security of data. In the data lake architecture, we use the version control function of OSS to guarantee the security of data on the data lake. OSS supports setting a Lifecycle rule that automatically removes expired files and fragments, or dumps expired files to low-frequency or archival storage types to save on storage costs. We also need to set the life cycle of the multi-version bucket to save costs, keeping the current version and automatically deleting the historical version after 3 days.

4.4.2 Data warehouse log bucket management

It can be seen from the figure below that the standard storage has a linear growth before September 28. The cold storage life cycle is set on September 28, and the cold storage has a linear growth while the standard storage basically remains unchanged. The unit price of the standard type is 0.12 yuan /GB/ month, and the unit price of the archive type is 0.033 yuan /GB/ month.

4.4.3 Management of warehouse and market bucket

The EMR’s HDFS recycle bin directory is set on the OSS bucket under the data lake architecture. HDFS will not delete these garbage files regularly, so we need to set the life cycle of the HDFS garbage bin to delete these garbage files in the garbage bin regularly.

4.4.4 Monitor the objects in the bucket

Object storage OSS supports the storage space inventory function, which can periodically export information about the files (objects) inside buckets to the specified buckets to help understand the status of objects, simplify and speed up workflow and big data job tasks, etc. The Bucket list function scans the objects in the Bucket on a weekly basis. After the scan is completed, the list report in CSV format will be generated and stored in the specified Bucket. In the manifest report, you can optionally export metadata information about the specified object, such as file size, encryption status, and so on. We export files in CSV format into Hive table by setting storage space list, and regularly issue reports to monitor the changes of objects in the bucket, find out abnormal growth and manage them.

  1. Aliyun second generation data lake

= = = = = = = = = = = = =

The first generation of data lake execution engine was EMR storage medium was OSS. When our company introduced Dataphin data middle station, its execution engine and storage were MaxCompute, and our current data warehouse execution engine EMR was two sets of heterogeneous execution engines. The problems are as follows

  • Storage redundancy The storage resources of EMR are placed on the OSS object store and the storage resources of MAXCOMPUTE are placed on Pangu, resulting in redundancy of the storage resources.
  • The metadata of EMR is unified in the external RDS database, and the metadata of MAXCOMPUTE is unified in the MC metadata database. The metadata of EMR and MAXCOMPUTE are not unified, so they cannot be shared.
  • The EMR user authority system is built with OpenLDAP and Ranger, while the MAXCOMPUTE user authority system is built with the MAXCOMPUTE user authority system.
  • High throughput, high complexity and low concurrent tasks are suitable to run in EMR and low throughput, low complexity and high concurrent tasks are suitable to run in MAXCOMPUTE. In addition, we can put the computing resources of Singles Day on MAXCOMPUTE to solve the problem of insufficient EMR resources. And the current situation is not free to choose the execution engine Ali cloud provides two sets of lake warehouse integrated solution, which is based on HDFS storage solution, By creating an external project will Hive metadata mapping to MaxCompute (relevant best practices can refer to https://bp.aliyun.com/detail/169). We adopted another DataLake scheme based on DataLake construction DLF (Datalake Formation) to realize the integration of lake and warehouse. We migrate our EMR metadata and MAXCOMPUTE metadata to DLF, and use OSS for unified storage at the bottom level to connect the two systems of data lake constructed by EMR and data warehouse constructed by MAXCOMPUTE, so that data and computation can flow freely between the lake and the warehouse, and truly realize the integration of the lake and warehouse. That is, the essence of the data lake: unified storage, unified metadata and free access to the execution engine. Data Lake Formation (DLF) is a fully managed service that quickly helps users to build a Data Lake on the cloud. The product provides unified rights management, metadata management and automatic metadata extraction capabilities of the Data Lake on the cloud.
  • AliCloud Data Lake is constructed by using AliCloud Object Storage Service (OSS) as the unified Storage of data lake on the cloud. Various computing engines can be used on the cloud to face different big data computing scenarios, open source big data e-MapReduce, real-time computing, MaxCompute Interactive Analysis (Hologres), Machine Learning PAI, etc., but you can use a unified data lake storage solution to avoid the complexity and operational costs associated with data synchronization.
  • AliCloud Data Lake can extract data from a variety of data sources into the data lake. Currently, it supports relational database (MySQL), AliCloud Log Service (SLS), AliCloud Form Storage (OTS), AliCloud Object Service (OSS) and Kafka, etc. Users can specify the storage format. Improve computing and storage efficiency.
  • Data lake metadata management users can define the format of data lake metadata, centralized and unified management, to ensure data quality. We mainly use Aliyun Data Lake to build the unified metadata management function and unified user rights management function of products. Figure schema EMR and MaxCompute share metadata, user rights, and authority management capabilities of the Data Lake DLF. The data flow diagram of the data lake is shown below
  • EMR ETLX extracts the data of business RDS and business OSS into the data lake, that is, the data of ODS layer falls into the data lake.
  • The Dataphin data middle station conducts dimensional modeling on the data of the data lake (the middle table of modeling includes fact logic table and dimension logic table using MaxCompute inner table, which does not fall into the data lake), and the final dimensional modeling results are generated on the data lake of CDM layer or ADS layer.
  • EMR or other execution engines conduct AD hoc query analysis or scheduling for data in the ADS layer on the data lake.

This article is the original content of Aliyun, shall not be reproduced without permission