Big data basic knowledge collection, big data enthusiasts must collect

J There are too many big data products on the market, but they are far from standardized like IaaS layer, and the differences between each product are not particularly clear. When developing big data platforms or big data solutions, many enterprises often do not know which products to choose to meet their needs. The general approach is to do research, study, build the environment, test, do the integration of various products, but usually this process will be very long, the cost is high.

We want these things to be left to the cloud platform, so that all products on the cloud can be deployed and scaled with one click, and whether adding or subtracting nodes can be directly operated on the UI interface. For an enterprise, the real core is its business, and it doesn’t need to spend too much time figuring out which tools to build, deploy, and manage big data. The operation and management of big data products should be handed over to more dedicated big data service providers with larger benefits, and users only need to focus on their own business

QingCloud provides a complete infrastructure cloud and technology platform cloud with many layers. The bottom layer is known as IaaS, which includes standard computing, storage, and networking. In the network, there are routers and load balancers, block storage, shared storage, object storage and other storage services for different scenarios, and computing resources such as hosts, containers, and images.

Above the IaaS layer is PaaS. QingCloud started the PaaS platform a few years ago. There is a principle that runs through it. For example, resource scheduling, SDN and performance optimization are all shared by PaaS through IaaS. QingCloud is a unified architecture.

QingCloud also provides advanced management services on top of the PaaS layer, such as orchestration, timers, monitoring alarms, and various types of services deployed by customers (e.g., VPC, proprietary cloud, hosted cloud).

QingCloud is a complete enterprise-level big data platform

QingCloud’s big data platform includes a complete data life cycle: Kafka is responsible for data transmission; Data can be stored in object storage, HBase, and MongoDB. There are mainstream real-time processing tools Storm, quasi-real-time processing tools Spark, batch processing tools Hadoop, Hive, etc. There are also some big data components that are used a lot in public cloud, such as Elasticsearch, which has strong performance and business, very clear scenarios, and is easy to use as long as you do big data and massive data search, as well as Redis, Memcached, ZooKeeper and other big data products that are close to users.

Because QingCloud is a cloud service provider, its big data platform is a generic service, unlike the big data concepts of Meituan, Xiaomi, Baidu and other Internet companies, which have a lot of things related to their business. QingCloud, on the other hand, serves all users in the cloud. The big data platform is a common architecture with flexible relationships between each component. QingCloud mainly provides management of mainstream big data components and relationships between components.

Want to learn big data or want to learn big data friends, I organized a set of big data learning video free to share with you, from the entry to the actual combat, you can add wechat: Lxiao_28 to get, but also can enter the wechat group communication! (Remarks to receive information, true and valid oh).

How to select big data products?

Many users will encounter the same problem when facing big data. What products should they choose? In fact, there is no 100% definite answer to this question. Here we will analyze the current mainstream big data products from various dimensions:

Real-time stream processing engine comparison

Mainstream real-time Streaming engine products include Storm, StormTrident, Spark Streaming, SAMZA, Flink, etc. There are many dimensions to consider when choosing them. There is a difference, for example, between at-least-once (At least once, which results in the retransmission of the message) and Exactly once(the message must be processed only once, in the event of an error or otherwise). In terms of Latency, Storm is implemented via Native stream processing with very low Latency. Spark Streaming is micro-batching, which processes streams in small batches over a period of time so that its latency is higher. In terms of Throughput, if Storm’s Native Throughput is not that high, SparkStreaming Throughput will be high.

When choosing these products, an enterprise architect or solution designer needs to balance the dimensions that the business that processes its flows cares about, whether it cares about throughput, performance, or message processing.

Storage – HBase vs Cassandra

HBase and Cassandra are two very close products that many people may not be familiar with. Both are column stores, both can handle huge amounts of data, and both have excellent read and write performance. HBase has strong consistency based on rows, while Cassandra has final consistency. If data is read at a certain point in the middle, the latest data may not be read.

In terms of stability, HBase has its own HA of HMaster and Namenode. Cassandra is a DECENTRALIZED P2P architecture with no single point of failure. In terms of partition policies, HBase is a range partition based on the orderly arrangement of primary keys, and Cassandra is a consistent Hash arrangement and can be customized. Availability: HBase is Down, while Cassandra is Down, data can be read and written.

From these comparisons, it can be seen that HBase sacrifices availability and emphasizes strong consistency, while Cassandra has high availability but no strong consistency. In terms of application scenarios, HBase has strong consistency and can perform some OLTP and transaction tasks. Cassandra has a large amount of concurrent read and write and high performance. However, it does not require strong data consistency or accurate and unified data at all times. Cassandra can be used to store monitoring data.

Ad-hoc OLAP queries are commonly used to analyze product comparisons

There are also a large number of ad-hoc and OLAP products commonly used for data warehousing, and the choice can be measured in three dimensions — data volume, flexibility, and performance.

Hive:

Based on MapReduce, it can process massive data and has flexible query but low performance.

Phoenix + HBase:

It can also process massive data with high performance, but can only be queried through Rowkey, which is less flexible.

Elasticsearch:

It can be used for data analysis and log analysis and other application scenarios. It is characterized by flexible query and high performance. However, it cannot support massive data, only TB data. Elasticsearch is a fully linked structure with all nodes communicating with each other.

Kylin:

Baidu, Xiaomi, Meituan and other Internet companies will use it for data warehouse analysis. It can process massive data with high performance but low flexibility. Because Kylin uses pre-aggregated queries, the cube dimensions and facts need to be pre-calculated in data warehouse and stored in Hbase to achieve high performance, which causes it to lose flexibility.

Druid:

Massive data, performance and query flexibility are all satisfied, but it also has a limitation. Each record of data source must have a Timestand, which is the processing of time series data. It is more used in real-time processing scenarios, such as advertising analysis.

HashData (GreenPlum) :

GreenPlum is a traditional data warehouse. The three founders are the r&d team of Yahoo Hadoop, the operation of the global Hadoop cluster and the core r&d engineers of GreenPlum. They open-source Apache Hook, a Hadoop-based data warehouse project on Apache, in which they contribute more than half of the code. As a result, the team has strong technical capabilities in data warehouse research and development, and worked with QingCloud to implement HashData, a product with flexible query and high performance. However, its limitation is that it can only handle structured data, not unstructured data.

Thinking about the relationship between computation and storage

When doing big data, one question comes to mind: where is the data of big data? Do you want to put it in the hard drive or in the object store? If stored locally, the performance can be satisfied, but the capacity is not enough. In object storage, the capacity can be expanded indefinitely, but the performance is certainly not as fast as that of a local hard disk. For example, Hadoop or Spark, if the following data is stored on a local machine or cluster-related storage, it has the data localization feature of big data. But there are problems with this, too, if you run into lots of small files. Who can help you solve this problem? Object storage. Object storage has special merges and optimizations for small files, and this problem can be solved easily.

So QingCloud is trying to make computing and data not just together, but separate. When the amount of data reaches more than PB level, the cost of storing data in distributed storage on IaaS and PaaS is also high. In this case, object storage solution is required to store historical data and massive small files that are not used for a long time in object storage. When you need to compute, you have two options. If you don’t care about real-time or performance, you can compute directly on the object store. If high performance is required, data can be pulled to HDFS for calculation, which makes it very flexible.

So what is the relationship between computing and storage? Our understanding is that when you want performance, keep them together. When large numbers of calculations are needed, separate them. In the past, we used to have a Hadoop cluster or a Spark cluster. The data was scattered everywhere, but after storing all the data in the object storage, the data could not be moved, and the computing engine could be Hadoop, Spark, and Hive, all of which could calculate the same data.

Case study of big data

Flowchart of Big data analysis of QingCloud online business

QingCloud will use its big data service to analyze business, marketing and sales. We parse and synchronize the data of some online databases, including typical ETL processing. After the data is processed, it is stored in HDFS. Spark calculates the data and saves the processing results in products that are easy to use for businesses, such as PostgreSQL and Elasticsearch. Exposure to front-end use via API-server. This is a typical big data analysis process that we use to validate various services provided by the QingCloud big data platform.

A large Internet social networking platform

QingCloud is a large Internet social platform running on the public cloud, the architecture is very typical, it has a data service layer, can handle MySQL, cache, Elasticsearch, MongoDB and other data stores. The following data layer contains ZooKeeper and Kafka. You can input application-level system logs to the Spark platform for analysis, such as whether a user has added friends to a recommended system.

An innovative comprehensive financial company

This is a private cloud case of QingCloud. QingCloud can deploy its entire software into the user’s own environment. The architecture is mainly composed of three parts: data extraction, data processing and data service. Data sources cover logs, relational data, and some ETL tools, log processing frameworks, and real-time data synchronization systems. There are off-line computing and real-time computing. After the calculation is complete, the system provides offline services, such as Hive and Spark, for internal data analysis. At the same time, there are some online services with high performance and ease of use.

The architectures of these cases are various, but if you look closely, they are all the same. They are all divided into architectures of data source, collection, transmission, computation, storage, etc., but the specific products used are different, depending on the scenario. For QingCloud, we are able to provide a very flexible big data platform to meet the needs of any application scenario, whether users are traditional enterprises or Internet enterprises.

QingCloud Big data platform Rodmap

Big data platform management architecture

When there are many big data components, there needs to be a separate platform to manage them. QingCloud provides an interface for executing Hive, SQL, and Spark scripts. In this platform, you can view Storm and ZooKeeper data. You can view the file structure directly from the browser, HDFS, and object storage. You can submit HBase queries in real time to see what the current data structure looks like in the Kafka transmission queue. So, it’s kind of an administrative UI, and you can manage a lot of big data components. Previously, users had to manage these components separately, and as the system became more complex, ease of use became worse.

Big data platform + AppCenter 2.0

Big data technology is evolving so fast that QingCloud faces a dilemma: There are endless big data products. We can’t offer them all, but what do we do if our users want them?

First, QingCloud’s big data platform will also be delivered via QingCloud AppCenter. Products like Hadoop, Spark, Elasticsearch, etc. will all be an APP in AppCenter, and will also be available as a combined APP when you need some very typical combinations. For users, the ease of use of big data products will be greatly enhanced. QingCloud AppCenter not only provides the APP development framework, it also provides the APP operation framework, you can arrange on the APP, several simple apps into a large one.

Second, there are too many big data scenarios. What if QingCloud doesn’t offer anything you want to use? If you are familiar with the product, you can make it a cloud application using the QingCloud AppCenter framework and offer it to others. The concept is very similar to that of APP developers in Apple’s Appstore. If you are an enterprise user or an individual developer with strong big data skills, you can use the AppCenter framework on QingCloud to turn your familiar big data product into a cloud service in a short period of time. Make it available to everyone on QingCloud’s AppCenter.

Such a thing would have been unthinkable before and would have cost a lot of money. QingCloud AppCenter can help you turn your existing product into a cloud service in a matter of days. Meanwhile, the operation management and life cycle framework of cloud services are built into AppCenter.

Big data basic knowledge collection, big data enthusiasts must collect

Related Posts

Interview with Xiao Xie: The transformation of Ali technology system from the technological road of ten years

SpringBoot+ SockJS client+ STOMpJS to implement WebSocket

The world is retrograde, you’re | Denver annual essay progresses by leaps and bounds