Where will the architect talk to you on the weekend: quick solutions for big data platforms

Content source: on May 13, 2017, the weekend where architects Li Ximing in “the beauty of Java developers conference | Java [Shanghai]” to “the big data platform fast solution” speech to share. IT big said as the exclusive video partner, by the organizers and speakers review authorized release.

Read the word count: 1891 | 4 minutes to read

Abstract

Big data refers to the collection of data that cannot be captured, managed and processed by conventional software tools within a certain period of time. It is a massive, high-growth and diversified information asset that requires a new processing mode to have stronger decision-making ability, insight and discovery ability and process optimization ability. According to his own successful experience, architect Ximing Li shared fast solutions for big data platform for us.

Address of guest speech video and PPT: t.cn/R9an7Rr

Beginning and end of the building

When we decided we wanted to do big data, there were two choices. The first type of selection is to use the original, open source big data technology, need to build their own; The second is ODPS.

Later, we chose to build a big data platform by ourselves using the original big data. Because we have a certain small accumulation, and also want to do a big data technology precipitation.

In the era of mobile Internet, all the data of users, such as their behaviors, browsing, recording and collecting, will be taken down and analyzed. How many things have been phased precipitated in the previous period is a summary of the previous period. This data can also help us to conduct in-depth mining, and then how to classify different users and make an accurate marketing positioning.

Every company presents this data at the reporting level. Our initial data implementation was to put all the user behavior data into a traditional relational database and read the table using a pure Java application. When a metric is evaluated, several subtables are associated. This main table is probably in the tens of millions, and other sub-tables are also in the millions or even tens of millions. If you just use Java, you have to deal with multiple threads.

Therefore, when we use the traditional Java pure program + relational database to deal with the report, there will be problems in the storage and calculation performance, so that the report demand is slower and slower.

In this context, we changed to use big data to deal with this scenario.

An overview of the technology

Hadoop is the underlying concept of all big data computing and storage, and all subsequent big data products are derived on the basis of Hadoop.

This is the current big data platform architecture.

Native Hadoop includes Hdfs (file storage), Yarn (resource scheduling), and Mapreduce (algorithm).

Spark is a computing framework similar to Mapreduce. The performance of Spark is much better than that of original Mapreduce in many scenarios, especially in iterative computation.

Sqoop is a data migration tool.

Hive abstracts the files of the Hdfs from a relational database similar to Mysql. However, Hive is a relational database based on Hadoop.

Oozie is a framework for scheduling and scheduling tasks.

Hue is the big data management background.

Zookeeper is a distributed coordination tool.

1.Component classification

Basic data: Mysql, File. The basic data layer is a concept separate from big data and is the traditional data source.

Big data storage: Hdfs and Hive. Big data storage is the most basic file storage, on this basis abstracted a big data relational database.

Big data computing: Mapreduce, Spark, and Sqoop. Mapreduce is native, Spark is new, and Sqoop is a data transfer tool.

Big data coordination and scheduling: Yarn, Zookeeper, Oozie. Yarn is native. Zookeeper is a distributed tool to ensure file atomicity. Oozie is a scheduling tool.

Big data presentation: Hue. Curd’s presentation layer.

2,Typical Execution flow

As we said in the beginning, we ran into problems with Mysql tables not being able to hold and computations. In this scenario, you’re going to take the data, you’re going to go from Mysql to big data, you’re going to do calculations with big data, and then you’re going to do a presentation.

Sqoop synchronizes Mysql data to a Hive table at a time or incrementally. After a query is written in Hive Sql, Hive Sql is essentially converted into Mapreduce tasks to execute, and the data is displayed.

Most of the time, the service Control layer in the background will have an entrance and an exit. We need to write down the parameters of the entrance and exit for the convenience of troubleshooting or statistical applications in the future.

In the application program, these messages are periodically written to the message queue, and Spark is periodically used to read the message queue, and the read messages are programmed according to Spark. The task is eventually dumped into Hadoop’s underlying computation, which is then scheduled using Yarn to calculate the result and write the result to Hive, completing a streaming computation.

3,Hue

Hive Sql is written in much the same way as traditional Mysql. After Hive Sql is written, click Execute. Hive uses its Sql parsing engine to translate the task into Mapreduce. Mapreduce then runs the task on Hadoop using Yarn to run the result.

4,Storage: Hadoop HDFS

HadoopHdfs is the basic storage layer.

HadoopHdfs actually contains only two types, one is Namenode and the other is Datanode. Namenode is a management node, while Datanode is only responsible for data storage and redundancy.

5,
Computing: Mapreduce&spark

Hadoop’s native computing framework is Mapreduce, while Spark is an emerging computing framework that is faster and more comprehensive.

6. Resource Manager: YARN, Apache, Hadoop YARN

The architecture of the resource manager includes rescource Manager and Node Manager. Rescource Manager is the management node and Node Manager is the work node.

The task is thrown to Rescource Manager, which distributes the task to each node and makes some state changes. Finally, the results are summarized by Rescource Manager and handed over to the client after processing.

7,hive

The Hive architecture is not very complex, and the upper layer is some user apis, Web pages, and command lines. At its core is an execution engine that translates SQL into tasks that big data platforms can accept. The underlying layer is based on storage, which can exist on HDFS.

Eight,sqoop

It is mainly used for data transfer between Hadoop and traditional databases.

9,ooize

Big data task scheduling.

Learn and use routes

If you want to learn something related to big data, I recommend that you first master some basic knowledge, and then find a scene to incorporate into the technology for quick practice. In the process of fast practice, many problems need to be solved and a lot of knowledge needs to be supplemented, so we should move forward in practice and supplement in mistakes.

This is the end of my sharing, thank you!

Recommend the article

Ye Jinrong, co-founder of Zhidustang: MySQL 5.7 new era

Big data platform architecture technology selection and scenario application

Recent activities

A letter from the future

Live | [ali cloud MVP MeetUp] have taken apart the cloud and play together