CDH off-line setup

In fact, three years ago, I had been with CDH for a long time. At that time, I also planned to use CDH as the big data platform of the company. However, I was not good enough at that time, so I changed to Ambari later.

CDH recently acquired HDP, which is also a big data bucket I’ve been using. I want to feel CDH for a change this time.

There are only two test machines, ES01 and ES02. That’s what we used to call an elastic cluster.

Before the installation

Versioning compatibility between Cloudera Manager and CDH must be considered.

For 3-10 machines, CDH recommends planning as follows. For two, just install ZooKeeper and CM to get a feel for it.

Hardware requirements

The hard disk resources required by CM

CDH gives the suggested value of memory and CPU required by each role (including Agent,DataNode,HBase, etc.) and is very careful. HDP wonders if I was careless enough not to find these options. CDH is so friendly.

I’m not going to focus on that. After all, it’s just a test environment.

Software requirements

Dependent jar package

  • Python: Cloudera Enterprise 6, with the exception of Hue, is included in the operating system by default in the Python version and later, but is not compatible with Python 3.0 or later. For example, Cloudera Enterprise 6 requires Python 2.6 or higher on RHEL 6 compatible operating systems, but Python 2.7 or higher on RHEL 7 compatible operating systems. Hue in CDH 6 requires Python 2.7 or higher on all operating systems. For the RHEL 6 compatible operating system running Hue, you must manually install Python 2.7. Spark 2 requires Python 2.7 or higher. If the correct level of Python is not selected by default, set the PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON environment variables to point to the correct Python executables before running the pyspark command. Python 3 is not supported.
  • Perl – Cloudera Manager requires Perl
  • Cloudera Manager 6 relies on the Python-psycopg2 package. Hue in CDH 6 requires a higher version of psycopg2 than the Cloudera Manager dependency requires. For more information, see Installing the psycopg2 Python package.
  • Iproute package – Cloudera Enterprise 6 relies on the Iproute package. This package is required by any host running Cloudera Manager Agent. The version required varies by operating system. Centos7.6 is used this time, and the corresponding version is Iproute-3.10

Operating system requirements

File type requirements

The Hadoop Distributed File System (HDFS) is designed to run on top of the underlying file system in the operating system. Cloudera recommends that you use any of the following file systems on the supported operating systems

  • Ext3: This is HDFS’s most tested underlying file system.
  • Ext4: This extensible extension of ext3 is supported in recent Linux versions. Cloudera does not support in-place upgrades from ext3 to ext4. Cloudera recommends that you format the disk as ext4 before using it as a data directory.
  • XFS: This is the default file system in RHEL 7.
  • S3: Amazon Simple Storage Service

There is also a CDH called Kudu from the development of the database, it requires the file system is ext4 and XFS.

In addition, CDH recommends tuning your file system

File access time

The Linux file system keeps metadata that records the access to each file. This means that even a read results in a write to disk. To speed up file reading, Cloudera recommends that you disable this option, called atime, using the noatime mount option in /etc/fstab.

/dev/sdb1 /data1 ext4 defaults,noatime 0

Let it work mount -o remount /data1

File system mount options

The file system mount option has synchronization options that allow you to synchronize writes. Using the Sync Filesystem mount option can degrade the performance of services that write data to disk, such as HDFS, YARN, Kafka, and Kudu. In CDH, most of the writes have already been copied. As a result, synchronous writes to disk are unnecessary, expensive, and do not measurably improve stability. The installation of NFS and NAS options as a DataNode data directory is not supported, even with the layered storage feature.

Nproc configuration

Cloudera Manager automatically in the/etc/security/limits set in the conf nproc configuration, but the/etc/security/limits of d/a single file can override this configuration. This can cause problems with Apache Impala and other components. Be sure to set the nproc limit high enough, such as 65536 or 262144.

Database requirements

I prefer to use MySQL here, if you need to see others please see here

Java version requirements

Configure the network name

sudo hostnamectl set-hostname es01.example.com

[root@es01 ~]# cat /etc/hosts 127.0.0.1 localhost. Localhost ::1 localhost. Localdomain localhost 172.17.0.11 ES01.ljktest.com ES01 172.16.0.11 ES02.ljktest.com ES02

Edit /etc/sysconfig/network using only this host’s FQDN

[root@es01 ~]# cat /etc/sysconfig/network
# Created by cloud-init on instance boot automatically, do not edit.
#
NETWORKING=yes
HOSTNAME=es01.ljktest.com

Disable Firewall

sudo systemctl disable firewalld
sudo systemctl stop firewalld

Set the SELinux mode

  1. Check SELinux statusgetenforce, if the output is Permissive or Disabled, the task can be skipped. If the output is executing, proceed to the next step.
  2. Open the /etc/selinux/config file (on some systems, the /etc/sysconfig/selinux file). Change SELinux =enforcing to SELinux = permissive
  3. Restart the system or run the following command to disable SELinux immediately

    setenforce 0

Enable the NTP service

It is assumed that the external network can be connected. For Intranet, you’ll need your own internal NTP server. You can use the chrony synchronization time in my other post, but this ensures the same intra-cluster time, but the time will be different from the real NTP server.

  1. Install the NTP package

    yum install ntp

  2. Edit the /etc/ntp.conf file to add an NTP server, as shown in the following example.

    server 0.pool.ntp.org
    server 1.pool.ntp.org
    server 2.pool.ntp.org
  3. Start the NTPD service

    systemctl start ntpd

  4. Configure NTPD service to boot up

    systemctl enable ntpd

  5. Synchronize the system clock to the NTP server

    ntpdate -u <ntp_server>

  6. Synchronize the hardware clock with the system clock

    hwclock --systohc

The development of

Use the CDH 6 Maven repository

Configure the local package repository

HTTPD installed

sudo yum install httpd

AddType application/ x-gzip.gz.tgz.parcel

sudo systemctl start httpd

Download the CM and CDH packages

The official instructions are to execute the following command on the server where HTTPD is installed

  • CM
sudo mkdir -p /var/www/html/cloudera-repos sudo wget --recursive --no-parent --no-host-directories https://archive.cloudera.com/cm6/6.3.0/redhat7/ - P/var/WWW/HTML/cloudera - repos sudo wget https://archive.cloudera.com/cm6/6.3.0/allkeys.asc - P/var/WWW/HTML/cloudera - repos/cm6 6.3.0 /

sudo chmod -R ugo+rX /var/www/html/cloudera-repos/cm6

  • CDH
sudo mkdir -p /var/www/html/cloudera-repos sudo wget --recursive --no-parent --no-host-directories https://archive.cloudera.com/cdh6/6.3.0/redhat7/ - P/var/WWW/HTML/cloudera - repos

Sudo wget -- recursive - no - the parent - no - host - directories https://archive.cloudera.com/gplextras6/6.3.0/redhat7/ - P /var/www/html/cloudera-repos

sudo chmod -R ugo+rX /var/www/html/cloudera-repos/cdh6
sudo chmod -R ugo+rX /var/www/html/cloudera-repos/gplextras6

I directly downloaded some packages with the download tool, and downloaded the packaged packages (because it is too slow to install the command, of course, if the network speed is OK, we recommend using the method of the official website, mainly because it is not easy to make mistakes).

CM package CDH package GPL package

You can get the following by manually downloading it

Cm6.3.0 - redhat7. Tar. Gz allkeys. Asc CDH 6.3.0-1. Cdh6.3.0. P0.1279813 - el7. Parcel CDH 6.3.0-1. Cdh6.3.0. P0.1279813 - el7. Parcel. Sha1 manifest. Json GPLEXTRAS 6.3.0-1. Gplextras6.3.0. P0.1279813 - el7. Parcel GPLEXTRAS 6.3.0-1. Gplextras6.3.0. P0.1279813 - el7. Parcel. Sha1 manifest. The json

Note that allKeys. ASC file cannot be missing, otherwise the installation agent will report an error.

Configure the internal repository

touch /etc/yum.repos.d/cloudera-repo.repo

write

[cloudera - cm6.3.0] name = cloudera - cm6.3.0 baseurl gpgcheck = 0 = http://49.234.43.99/cloudera-repos/cm6.3.0 enabled = 1

Install JDK1.8

Cloudera-Manager-Agent Cloudera-Manager-Server: Oracle JDK does not match the JDK version.

I’ll have to use the OpenJDK that comes with it.

Yum install Java -- 1.8.0 comes with its devel

Install Cloudera Manager Packages

sudo yum install cloudera-manager-daemons cloudera-manager-agent cloudera-manager-server

Mysql installation

The original local MySQL package, CDH documentation is too conscience, even install MySQL have. Let’s go through the documentation

  1. The installation

    wget http://repo.mysql.com/mysql-community-release-el7-5.noarch.rpm
    
    sudo rpm -ivh mysql-community-release-el7-5.noarch.rpm
    
    sudo yum update
    
    sudo yum install mysql-server
  2. Move the old InnoDB log files /var/lib/mysql/ib_logfile0 and /var/lib/mysql/ib_logfile1 from /var/lib/mysql/ib_logfile1 to the backup location.

    mv ib_logfile0 ib_logfile1 ~/mysql_backup/

  3. Modify the MySQL configuration file my.conf

    • To prevent deadlocks, set the isolation level to read-committed.
    • Configure the InnoDB engine. If Cloudera Manager’s tables are configured with the MyISAM engine, it will not start. (In general, if the InnoDB engine is configured incorrectly, the tables revert to MyISAM.) To check the engine used by the table, run the following command from the MySQL shell
    • The default Settings for MySQL installations in most distributions use conservative buffer sizes and memory usage. Cloudera Management Service roles require high write throughput because they may insert many records into the database. Cloudera recommends that you set the innodb_flush_method property to O_direct.
    • Set the max_connections property based on the size of the cluster

      Less than 50 hosts – you can store multiple databases on the same host (for example, an activity monitor and a service monitor). If you do this, you should: put each database on its own volume. Allow a maximum of 100 connections per database, then add 50 additional connections. For example, for two databases, set the maximum number of connections to 250. If you store five databases on a single host (Cloudera Manager Server, Activity Monitor, Report Manager, Cloudera Navigator, and Hive Metastore databases), Set the maximum number of connections to 550.

In the end, CDH came up with a list of suggested configurations, which was really nice.


```
[mysqld]
datadir=/var/lib/mysql
socket=/var/lib/mysql/mysql.sock
transaction-isolation = READ-COMMITTED
# Disabling symbolic-links is recommended to prevent assorted security risks;
# to do so, uncomment this line:
symbolic-links = 0

key_buffer_size = 32M
max_allowed_packet = 32M
thread_stack = 256K
thread_cache_size = 64
query_cache_limit = 8M
query_cache_size = 64M
query_cache_type = 1

max_connections = 550
#expire_logs_days = 10
#max_binlog_size = 100M

#log_bin should be on a disk with enough free space.
#Replace '/var/lib/mysql/mysql_binary_log' with an appropriate path for your
#system and chown the specified folder to the mysql user.
log_bin=/var/lib/mysql/mysql_binary_log

#In later versions of MySQL, if you enable the binary log and do not set
#a server_id, MySQL will not start. The server_id must be unique within
#the replicating group.
server_id=1

binlog_format = mixed

read_buffer_size = 2M
read_rnd_buffer_size = 16M
sort_buffer_size = 8M
join_buffer_size = 8M

# InnoDB settings
innodb_file_per_table = 1
innodb_flush_log_at_trx_commit  = 2
innodb_log_buffer_size = 64M
innodb_buffer_pool_size = 4G
innodb_thread_concurrency = 8
innodb_flush_method = O_DIRECT
innodb_log_file_size = 512M

[mysqld_safe]
log-error=/var/log/mysqld.log
pid-file=/var/run/mysqld/mysqld.pid

sql_mode=STRICT_ALL_TABLES
```
  1. Start the mysql

    systemctl start mysqld

  2. Mysql > set password for mysql

    sudo /usr/bin/mysql_secure_installation

    CDH is so sweet, here are the choices you will encounter.

    [...].  Enter current password for root (enter for none): OK, successfully used password, moving on... [...].  Set root password? [Y/n] Y New password: Re-enter new password: Remove anonymous users? [Y/n] Y [...]  Disallow root login remotely? [Y/n] N [...]  Remove test database and access to it [Y/n] Y [...]  Reload privilege tables now? [Y/n] Y All done!
  3. Install the MySQL JDBC driver

    Install the JDBC driver on the Cloudera Manager Server host and on any other host running services that require database access. For more information about Cloudera software that uses databases, see Required Databases. Cloudera recommends only using version 5.1 of the JDBC driver.

    Wget HTTP: / / https://dev.mysql.com/get/Downloads/Connector-J/mysql-connector-java-5.1.46.tar.gz

    The tar ZXVF mysql connector - Java - 5.1.46. Tar. Gz

    Copy the renamed JDBC driver to /usr/share/java /. If the destination directory does not already exist, create it. Such as:

    Sudo mkdir -p /usr/share/ Java/CD mysql-connector-java-5.1.46 sudo cp mysql-connector-java-5.1.46-bin.jar /usr/share/java/mysql-connector-java.jar

Create a database for Cloudera software

In theory, you need to create all the databases in the following table for your CHD. For now, let’s simply create Cloudera Manager Server.

  1. Enter the mysql

    mysql -u root -p

  2. Create the database

    CREATE DATABASE scm DEFAULT CHARACTER SET utf8 DEFAULT COLLATE utf8_general_ci;

    GRANT ALL ON scm.* TO 'scm'@'%' IDENTIFIED BY 'scm';

  3. Verify that the above command was successfully executed

    SHOW DATABASES;

    SHOW GRANTS FOR 'scm'@'%';

Set up the Cloudera Manager database

Cloudera Manager Server includes a script that creates and configures a database for itself. The script can:

  • Create Cloudera Manager Server database configuration file.
  • (MariaDB, MySQL, and PostgreSQL) create and configure databases for Cloudera Manager Server to use.
  • (MariaDB, MySQL, and PostgreSQL) Create and configure user accounts for Cloudera Manager Server.

/opt/cloudera/cm/schema/scm_prepare_database.sh mysql scm scm

Install CDH and other software

  1. Start the cloudera – SCM – server

    sudo systemctl start cloudera-scm-server

  2. Wait a few minutes to start Cloudera Manager Server. To observe the startup process, run the following command on the Cloudera Manager Server host

    sudo tail -f /var/log/cloudera-scm-server/cloudera-scm-server.log

    When you see this log entry, the Cloudera Manager administration console is ready

    INFO WebServerImpl:com.cloudera.server.cmf.WebServerImpl: Started Jetty server.

  3. In your Web browser, go to http://< Server_host >:7180, where

    is the FQDN or IP address of the host running Cloudera Manager Server.

Distribution of CHD and GPL packages.

So far, CDH has been successfully installed offline.

Some warnings in interface installation

Generally you will get two warning messages

  • Cloudera recommends setting /proc/sys/vm/swappiness to a maximum of 10. It is currently set to 30. Use the sysctl command to change this setting at run time and edit /etc/sysctl.conf to save the setting after reboot. You can proceed with the installation, but Cloudera Manager may report that your host is not performing well due to switching. The following hosts will be affected

    echo 'vm.swappiness=10'>> /etc/sysctl.conf

    sysctl -p

  • Transparent large page compression is enabled and can cause significant performance problems. Please run “echo never > / sys/kernel/mm/transparent_hugepage/defrag” and “echo never > / sys/kernel/mm/transparent_hugepage/enabled “to disable this setting, and then add the same command to the/etc/rc. Local initialization scripts, such as shall be set up so that the system reboots. The following hosts will be affected

conclusion

Compared to Ambari, I personally feel that CDH is friendlier in terms of document quality.

Although some of the internal mechanisms have not been specifically covered, but we can see that CDH is more detailed, there are more steps to show.

Of course CDH is not open source, but it seems possible to use custom services as well ~ we will try this later.

The appendix