Speaking of big data, I have to mention Hadoop. Let’s talk about it first

Comparison of Apache Hadoop with CDH and HDP

Overview of Hadoop version

There are three main free versions of Hadoop (all from foreign vendors), which are: Apache (the original version on which all distributions are based), Cloudera (Cloudera’s Distribution Including Apache Hadoop, (CDH for short), Hortonworks (Hortonworks Data Platform for short), for domestic, the vast majority of the choice of CDH version, CDH and Apache version of the main differences are as follows:

2. Comparing the Community edition with third-party distributions

1. The Apache community edition

Advantages:

Completely open source and free

The community is active

The documents and information are detailed

Disadvantages:

1. Version management is chaotic. Various versions emerge in an endless stream, which makes it difficult to select.

2. Cluster installation and deployment are complex. A large number of configuration files need to be written and distributed to each node.

3. Cluster o&M is complex and requires third-party software.

2. Third Party Distribution (CDH)

Advantages:

1. Clear version management. CDH3, CDH4, etc.

2. The version updates quickly. Normally, for example, CDH will have an update every quarter and a release every year.

3. The cluster installation and deployment is simple. Provides deployment, installation, and configuration tools to greatly improve the efficiency of cluster deployment

4. Simple operation and maintenance. The espace EMS provides tools for management, monitoring, diagnosis, and configuration modification, facilitating management and configuration, fast and accurate fault location, and simple and effective operation and maintenance.

Disadvantages:

There is a vendor lock-in issue.

Third, third-party distribution comparison

Cloudera: The most established distribution with the most deployment cases. Provides powerful deployment, administration, and monitoring tools.

Hortonworks: The only provider that uses 100% open Source Apache Hadoop without any private (non-open source) modifications. Hortonworks was the first provider to use the metadata service feature of Apache HCatalog. And their Stinger was groundbreaking in optimizing the Hive project. Hortonworks provides a great, easy-to-use sandbox for getting started. Hortonworks has developed and delivered a number of enhancements to the core backbone that enable Apache Hadoop to run natively on Microsft Windows platforms including Windows Server and Windows Azure.

Iv. Version selection

When choosing whether or not to adopt a piece of software for an open source environment, we usually need to consider:

(1) Whether it is open source software, that is, whether it is free.

(2) Whether there is a stable version, the general software official website will give instructions.

(3) Whether it has been proven in practice, this can be known by checking whether some larger companies have used it in production environments.

(4) Whether there is a strong community support, when a problem occurs, can quickly obtain solutions through the community, forum and other network resources.

Introduced Apache Hadoop and CDH, HDP comparison, take a look again

CDH5 is compared with CDH6

1 overview

Cloudera officially released CDH6.0.0 on August 30, 2018, with an iteration of CDH6.1.1 through February 19, 2019. The latest version of CDH is now available at CDH6.3.3. CDH6 is a release based on Hadoop3 with a large number of other component updates. Many users consider using CDH6 for new clusters or upgrading existing CDH5 clusters to CDH6, considering that future CDH releases will be based on CDH6 and CDH5 will slowly stop updating. The first problem is to consider the differences between CDH5 and CDH6. By analyzing some differences, you can determine whether existing applications can be migrated or directly deployed to CDH6, and whether there are problems in compatibility and stability. For this purpose, this article will compare CDH5 and CDH6 from various aspects in detail, so that the user can make a correct judgment and make the appropriate choice. The following content is mainly based on the latest CDH5.16.1 and CDH6.1.1 comparison. Finally, again, there is no best technology, nor is the latest technology the best, there is always the most appropriate technology.

2 Component version comparison

component CDH5.16.1 CDH6.1.1
Cloudera Manager 5.16.1 6.1.1
CDSW 1.5 1.5
Cloudera Navigator 2.15.1 6.1.1
Flume-ng 1.6.0 1.8.0 comes with
Hadoop server 3.0.0
HBase 1.2.0 2.1.0
Hive 1.1.0 2.1.1
Hue 3.9.0 3.9.0
Impala 2.12.0 3.1.0
Oozie 4.1.0 5.0.0
Pig 0.12.0 0.17.0
Sentry 1.5.1 2.1.0
Solr 4.10.3 7.4.0
Spark 1.6.0/2.3.0 2.4.0
Sqoop 1.4.6 1.4.7
Sqoop2 1.99.5 There is no
Zookeeper 3.4.5 3.4.5
Kafka 1.0.1 2.0.0
Kudu 1.7.0 1.8.0 comes with

3 Operating system Support

The operating system CDH5.16.1 CDH6.1.1
RHEL/CentOS/OL with RHCK kernel 7.6, 7.5, 7.4, 7.3, 7.2, 7.1, 6.10, 6.9, 6.8, 6.7, 6.6, 6.5, 6.4, 5.11, 5.10, 5.7 7.6, 7.5, 7.4, 7.3, 7.2, 6.10, 6.9, 6.8
Oracle Linux (OL) 7.5, 7.4, 7.3, 7.2, 7.1 (UEK default), 6.9, 6.8 (UEK R2, R4), 6.7, 6.6, 6.5 (UEK R2, R3), 6.4 (UEK R2), 5.11, 5.10, 5.7 (UEK R2) 7.4, 7.3, 7.2 (UEK default), 6.10 (UEK default)
SLES 12 SP3, 12 SP2, 12 SP1, 11 SP4, 11 SP3 12 SP3, 12 SP2
Ubuntu 16.04 LTS (Xenial), 14.04 LTS (Trusty) 16.04 LTS (Xenial)
Debian 8.9, 8.4, 8.2 (Jessie), 7.8, 7.1, 7.0 (Wheezy) Does not support

4 Metadata database support

The database CDH5.16.1 CDH6.1.1
MySQL 5.1, 5.5, 5.6, 5.7 5.7
MariaDB 5.5, 10.0 5.5, 10.0
PostgreSQL 8.1, 8.3, 8.4, 9.1, 9.2, 9.3, 9.4, 9.5, 9.6, 10 8.4, 9.2, 9.4, 9.5, 9.6, 10.x
Oracle 11g R2, 12c R1, 12.2 12.2

5 JDK support

JDK CDH5.16.1 CDH6.1.1
JDK7 1.7 u55, 1.7 u67, 1.7 u75, 1.7 u80 Does not support
JDK8 1.8 U31, 1.8 U74, 1.8 U91, 1.8 U102, 1.8 U111, 1.8 U121, 1.8 U131, 1.8 U144, 1.8 U162, 1.8 U181 1.8 U31, 1.8 U74, 1.8 U91, 1.8 U102, 1.8 U111, 1.8 U121, 1.8 U131, 1.8 U141, 1.8 U162, 1.8 U181
OpenJDK 1.8U181 (earlier versions of CDH5.16 do not support OpenJDK) 1.8 u181

6 Single user mode

When we install CDH, we usually install it with root or sudo permission. If you’ve noticed, the Cloudera Manager Agent service (Cloudera-SCM-Agent) for managing Hadoop processes on each host is running as root. However, in some enterprises, the operation and maintenance department has strict requirements that CDH use its own users to manage the service. For example, it requires that other users start, stop, and manage cloudera-SCM-Agent. Cloudera has provided the Single User Mode since Cloudera Manager 5.3 to meet this requirement. In single-user mode, Cloudera Manager Agent and all service processes managed by Cloudera Manager can be started by the configured user. Single-user mode prioritizes isolation between Hadoop processes and the rest of the system, rather than between Hadoop processes. For Cloudera Manager deployments, the single-user mode is global and applies to all clusters managed by that Cloudera Manager. By default, the single user is Cloudera-SCM.

Starting from CDH6, the single-user mode is no longer supported and the installation must be performed as root or a user with sudo permission.

7 summary

Compared with CDH5, CDH6 is a major version upgrade of each component. To understand the major version update, this section first explains the version description of related Components of Hadoop. For example, the Version number is X.Y.Z. Z indicates the Maintenance Version or Patch Version. This Version mainly fixes bugs, does not change apis, and does not involve new functions. Y indicates the Minor Version. This Version upgrade is mainly to add new functions and new APIS. X stands for Major Version, which often adds new functions or even changes the API. In this case, CDH5 to CDH6 is a Major Version upgrade, which adds a lot of functions. API changes may cause your old program to be incompatible and need to be modified or redeveloped. However, according to the principle of software development iteration, the new Version will be backward compatible for a period of time, and you only need to go through strict testing. In a period of time the application of the development of the modification theory will not be too big, but it should be combined with the actual situation, strict testing before judgment and evaluation.

CDH5 is still the most widely used and widely promoted version on the market, with high customer recognition. At the same time, CDH5 is the most mature and stable version after nearly 5 years of iteration. With the update iteration of the community version, the bugs of each component have been fully fixed. In addition, both foreign and domestic support cases are also many, many domestic implementation cases can be found for reference, can save the operation and maintenance of manpower, material costs. Finally, CDH5 has rich peripheral ecosystem support, whether it is open source or commercial products, ETL tools, scheduling tools, reporting tools, BI tools, etc. If you choose CDH5 now or are already using it, Cloudera has announced that it will continue to support it for another three years.

CDH6 has a large update, and many components are big updates, introducing a number of exciting new features, while also fixing a number of known issues and security holes in each component. For example, the erasure codes of HDFS are used for cold data to reduce storage costs and ensure data availability; the federation of NameNode and YARN solves the performance bottleneck of large-scale clusters; YARN introduces GPU support, and docker support will be introduced in the future. In the long term, upgrading from Hadoop2 to Hadoop3 or CDH5 to CDH6 is a must, as both the community and Cloudera will now focus on both Hadoop3 and CDH6, while CDH5 will focus on maintaining and fixing bugs.

Advice:

1. If you are new to the cluster and not very large (less than 50 nodes), and hadoop-based applications are new to development, CDH6 is a good choice, after all, you will not have to consider the inconvenience of upgrading from CDH5 to CDH6, and also need to do an application test and migration. It will also be easier to migrate to CDP, a merged version of CDH and HDP in the future.

2. If you already have a CDH5 cluster that has been running stably for a long time as a production system, you can wait and see for a while without considering upgrading at this stage if the new version is not necessary for functional or performance reasons. Once you decide to upgrade, you need to be careful and careful, including you need to consider OS, JDK, metadata and other upgrades, there are a variety of applications such as Hive/Impala/Spark SQL, MapReduce/Spark code, scripts, Python/R algorithm engineering and so on are tested in CDH6. At the same time, some surrounding tools such as ETL, scheduling, reports, BI tools also need to be tested for docking. After passing all the tests, a reasonable downtime should be planned, and then the upgrade should be carried out.

Note: The above summary analysis and recommendations are not universal and are for reference only. If you have problems selecting the CDH version, please contact Cloudera sales representatives and technical support for appropriate analysis and recommendations based on your actual situation.

Some CDH5 versions are no longer supported.

Discontinued support means that updates and bug fixes are no longer available, and the vendor is no longer providing technical support to users of this version. CDH5 will be fully discontinued by the end of this year.

Product List (Version) Stop support time
Cloudera Enterprise 6.3 In March 2022
Cloudera Enterprise 6.2 In March 2022
Cloudera Enterprise 6.1 In December 2021
Cloudera Enterprise 6.0 In August 2021
Cloudera Enterprise 5.16 In December 2020
Cloudera Enterprise 5.15 In December 2020
Cloudera Enterprise 5.14 In December 2020
Cloudera Enterprise 5.13 In October 2020
Cloudera Enterprise 5.12 In July 2020
Cloudera Enterprise 5.11 Has stopped
Cloudera Enterprise 5.10 Has stopped
Cloudera Enterprise 5.9 Has stopped

Changes to CDP components that CDH users and developers should be concerned about

Following the merger of Cloudera and Hortonworks, Cloudera launched its next-generation Data platform product, CDP Data Center (CDP), on November 30, 2019.

The CDP version number continues the previous CDH version number, starting from 7.0, and the latest version number is 7.0.3.0.

What’s the difference between CDP and Cloudera Enterprise Data Hub (CDH Enterprise) and HDPEnterprise Plus (HDPEnterprise)?

As THE market share of HDP in the domestic market is very small, most companies are using CDH, so for some things brought by HDP, users and developers will be unfamiliar with CDH, the following will take you a detailed understanding of some changes in CDP components, also convenient for you to prepare for the future study.

1 CDP, CDH, and HDP

Apache Hadoop (HDFS/YARN/MR)

Apache HBase

Apache Hive

Apache Oozie

Apache Spark

Apache Sqoop

Apache Zookeeper

Apache Parquet

Apache Phoenix (additional installation required in *CDH)

Basically just a version of a certain upgrade. If you are a previous CDH user, it is worth noting that the version of Hive uses 3.1.2 in CDP, which is a significant upgrade over CDH6 (where Hive is 2.1.1). Cloudera has been conservative in its component choices, while Hortonworks will be more aggressive (HDP is already on Hive3) and will be closer to the latest releases in the community.

Some important new features of Hive 3:

Change the default execution engine to TEZ

ACID support is enabled by default to support transactions

LLAP support for second – and millisecond – level real-time queries

The Hive CLI is completely deprecated on the client side, which means beeline is the only option

2 The CDH exists and the CDP is still in preparation

Apache Accumulo

Navigator Encrypt

Key HSM

In practice, these are not used much, and will be added to the CDP sooner or later, do not need to pay too much attention

3 The part contained in CDP and CDH

Apache Avro

Cloudera Manager

Hue

Apache Impala

Key Trustee Server

Apache Kudu

Apache Solr

Apache Kafka (*CDH requires additional installation)

4 The part that exists in HDP but the CDP is still being prepared

Apache Druid

Apache Knox

Apache Livy

Ranger KMS

Apache Zeppelin

One of the biggest concerns is Apache Druid. Apache Druid is a real-time big data analysis engine, notice that it and Alibaba produced a database connection pool Druid exactly the same name, but in fact is two different software, there is no relationship between the two.

Apache Druid enables fast real-time storage and multi-dimensional real-time analysis of massive data. There is good support for fast query, horizontal scaling, real-time data intake and analysis. It’s essentially a time-series database that allows you to quickly analyze and query time-driven data.

Note that Druid sacrifices a number of features to achieve high performance, such as not supporting full SQL semantics (only limited join support). So Druid is not a replacement for Hive or Impala, they are mutually reinforcing, such as in the following scenarios: Hive or Impala is used to build offline data warehouse or data mart, and then on this basis, the data that needs to carry out multidimensional analysis query is processed in Druid to provide data interface to the multidimensional analysis class system. Apache Druid vs Kudu

5 The part contained in the CDP and HDP

Apache Atlas

Apache Ranger

Apache ORC

Apache Tez

In CDP, two former CDH components are removed. One is Apache Sentry, which is used for security authorization. One is Cloudera Navigator, for metadata management and auditing.

The reason for their removal is that their functions have been replaced by other components in CDP, which will use the Security authorization and audit scheme centered on Ranger+Knox+Atlas. For users using CDH, this is the part where the changes are the biggest.

The CURRENT EoS Date (support end Date) for CDH 6.3 is March 2022, cDH6-based implementation projects are still the mainstream choice in the next two years, and sentry-based solutions will still be used in daily project implementation. So if you’re not familiar with Hadoop security, you can’t just skip Sentry. In addition, it is foreseeable that there will be issues related to permission migration when upgrading the existing CDH cluster, so both Sentry and Ranger need to be understood.

Although Apache ORC has been used in CDH before, Impala itself has poor support for the ORC format, and the use of the ORC format in either Impala or Hive has not been supported and recommended by CDH (CDH both recommend the Use of the Parquet format). This in itself is a bit of a business strategy: Before the merger, the Parquet project was dominated by Cloudera, while the ORC project was dominated by Hortonworks. Both formats are widely used throughout the Hadoop ecosystem, and the merger naturally required official support for both.

Tez is basically for the same reason (business strategy considerations), so until now, Hive in CDH has not been able to use Tez directly (MR and Spark are the only engines that can be selected directly on the options page). Now there is no question of such a business strategy.

6 CDP New part

Apache Ozone (Technical Preview)

Ozone is a scalable, redundant, distributed Hadoop object storage.

Applications based on Apache Spark, YARN, and Hive can work on Ozone without modification.

Hadoop Distributed Data Store(HDDS) Hadoop Distributed Data Store(HDDS)

In short, Ozone is designed to solve HDFS’s existing problems of extremely large scale scalability and small file storage, which are severely limited by NameNode. Ozone is designed to support tens of billions of files and blocks (or more in the future).

At present, Ozone is still in the technical preview phase. If there is time to do some basic understanding, it will not be used in the actual production environment.

CDP7 inductive

As many of you know, Cloudera and Hortonworks, two major Hadoop providers, merged during The 2018 National Day holiday

In the HDP cluster era, Hive on Tez with ORC storage and SparkSQL was already able to implement what many vendors now call HTAP on the big data platform, using Sqoop + Pig + TEZ + Hive + Oozie + Falcon run. Run phoenix+hbase+ Kafka + Spark to run streams. At that time, I think ORC + Hive + HPLSQL is really good, much better than Vertica.

Contact THE CDH cluster, the first time to use impala, the number of tasks, often burst memory, and later I learned about impala optimization. Impala is only read-only for THE ORC format, but Cloudera also has Kudu for use as their intermediate between OLAP and OLTP, which works well in practice.

Half a year after the merger, CDF appeared, and it turned out to be NIFI

This time, CDP7 will not be as exciting as the upgrade from CDH5 to CDH6, CDH6 is currently good to use, occasionally a small BUG can be tolerated. Many CDH5 component versions are too old, and many enterprises may not dare to upgrade the CDH5 online for many years. Of course, you can also install the higher version of the application, but do not enjoy the manufacturer’s after-sales service. CDP7 compared to CDH6 overall stop is to tell us a word “on the cloud!”

There’s been a lot of talk about the cloud, public cloud, private cloud, on-premises deployment.

There are AWS,Azure and Google in foreign countries, and Ali, Tencent and Huawei in China. Unfortunately, CDP7 only supports AWS and Azure at present, and has not talked about cooperation with domestic cloud manufacturers, so domestic public cloud is not about it in the short term.

Ozone will not be supported until CDP7.2.

As for the private cloud, the premise of the private cloud of CDP7 is to deploy the data center version of CDP7, which is similar to the installation and deployment mode of CDH6, local deployment. The data storage mode is HDFS. Ozone object storage is not supported. Then what are the advantages of the private cloud deployment over the data center version, DevOps or storage and computing separation? Whether the benefits of consuming more bandwidth are worth it will require a second thought.

Finally, the data center version CDP-DC, or CDH7, is essentially a component upgrade to CDH6. Some CDH components are replaced with HDP components. For example, Sentry becomes Ranger, Navigator becomes Atlas, and Hive2 is upgraded to Hive3. The rest of the components are pretty much the same, not much different than CDH6.3.

During the upgrade, how to synchronize sentry permissions to Ranger? Is there any risk? Currently, only the upgrade from CDH5 to CDP7 is supported, but the upgrade from CDH6 to CDP7 is not supported.

HDFS has been criticized for small file problems, and Ozone object storage is a bit late. CDH and HDP have been fighting for a long time, and each has built a lot of wheels. After the merger, they also began to replace the overlapping functions. I really want to say “why so many wheels of big data?” DeltaLake, Hudi, and Iceberg haven’t figured it out yet. There are a lot of HTAP database vendors emerging. Why so many overlapping components, alas! May the Hadoop ecosystem get better and better and cleaner.

The most important point: CDP component code is not available on Github, it is no longer open source, there is no community version after CDP7.