In this chapter, we start to set up the big data environment formally. The goal is to build a stable big data environment that can be operated, maintained and monitored. We will use Ambari to build the underlying Hadoop environment, and use the native way to build real-time computing environment such as Flink, Druid and Superset. Install the big data environment together with big data construction tools and native installation.

Ambari builds the underlying big data environment

Apache Ambari is a Web-based tool that supports provisioning, management, and monitoring of Apache Hadoop clusters. Ambari supports most Hadoop components, including HDFS, MapReduce, Hive, Pig, Hbase, Zookeeper, Sqoop, and Hcatalog.

Apache Ambari supports centralized management of HDFS, MapReduce, Hive, Pig, Hbase, Zookeepr, Sqoop, and Hcatalog. It is also one of the top Hadoop management tools.

Currently, Ambari has been updated to version 2.7, and more and more components are supported.

There are many distributions of Hadoop, including Huawei distribution, Intel distribution, Cloudera distribution (CDH), MapR release, and HortonWorks release. All distributions are derived from Apache Hadoop, and the reason for these releases is that the Open source license for Apache Hadoop allows anyone to modify it and distribute and sell it as an open source or commercial product.

Paid versions: Paid versions are usually made up of new features. Most of the versions issued by domestic companies are charged, such as Intel and Huawei.

Free version: there are three free versions (all from foreign manufacturers). Cloudera’s Distribution Including Apache Hadoop is called CDH. Apache Foundation Hadoop Hontonworks Data Platform (HDP). Although CDH and HDP are paid versions, they are open source and only charge for service. Strictly speaking, they are not paid versions.

Ambari is installed based on HDP, but they have different mappings between different versions.

Ambari2.7 and HDP HDF

The latest version is HDP 3.1.5 and HDP contains the following basic components of big data:

Already very rich, let’s start the Ambari installation.

preparation

The preliminary preparation is divided into four parts

Host, database, browser, JDK

The host

Please prepare the host for installing Ambari first. Three development environments are ok. Other environments are determined according to the scale of the company’s machines.

Suppose the three machines in the development environment are:

192.168.12.101 master 192.168.12.102 slave1 192.168.12.103 slave2

The minimum requirements for hosts are as follows:

Software requirements

On each host:

  • yumandrpm(RHEL/CentOS/Oracle/Amazon Linux)
  • zypperandphp_curl(SLES)
  • apt(Debian/Ubuntu)
  • scp, curl, unzip, tar.wgetandgcc*
  • OpenSSL (V1.01, Internal 16 or later)
  • Python (with python-devel *)

The Ambari host should have at least 1 GB of RAM and 500 MB of free space.

To check the available memory on any host, run:

free -m
Copy the code

Local repository

If the connection isn’t fast enough, we can download the package and set up a local repository. The Internet is fast enough to skip this step.

Download the installation package first

Install the HTTPD service

yum install yum-utils createrepo
[root@master ~]# yum -y install httpd
[root@master ~]# service httpd restart
Redirecting to /bin/systemctl restart httpd.service
[root@master ~]# chkconfig httpd on
Copy the code

Then set up a local YUM source

mkdir -p /var/www/html/
Copy the code

Unzip the package you just downloaded into this directory.

The access succeeds through the browser

Createrepo./ Modify the source address in the local source file vi ambari.repo vi hdp.repo# VERSION_NUMBER = 2.7.5.0-72[ambari - 2.7.5.0]#json.url = http://public-repo-1.hortonworks.com/HDP/hdp_urlinfo.jsonName = ambari Version - ambari - 2.7.5.0 baseurl=https://username:[email protected]/p/ambari/centos7/2.x/updates/2.7.5.0 gpgcheck = 1 gpgkey=https://username:[email protected]/p/ambari/centos7/2.x/updates/2.7.5.0/RPM-GPG-KEY/RPM-GPG-KEY-Jenki ns enabled=1 priority=1 [root@master ambari]# yum clean all
[root@master ambari]# yum makecache
[root@master ambari]# yum repolist
Copy the code

Software to prepare

In order to facilitate future management, we need to do some configuration of the machine

Install JDK download address: http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html rpm -ivh jdk-8u161-linux-x64.rpm java -versionCopy the code
Change machine name through vi /etc/hostname here is mainly in order to realize the name to find the corresponding server     Change the name of each node to master,slave1.slave2 &emsp.   Vi /etc/hosts 192.168.12.101 master 192.168.12.102 slave1 192.168.12.103 slave2 vi /etc/sysconfig/network NETWORKING=yes HOSTNAME=masterCopy the code
Disabling the Firewall [root@master~]#systemctl disable firewalld
[root@master~]#systemctl stop firewalld
Copy the code
SSH encrypted ssh-keygen ssh-copy-id -i ~/. SSH /id_rsa.pub remote-hostCopy the code

Different environment will have different problems, you can refer to the official website manual for the corresponding installation.

Install ambari – server

Ambariserver will finally take us through the installation of a big data cluster

Yum install ambari-server Installing: postgresqL-libs-9.2.18-1. El7.x86_64 1/4 Installing: Postgresql-9.2.18-1. El7.x86_64 2/4 Installing: PostgresqL-server-9.2.18-1. El7.x86_64 3/4 Installing: postgresqL-server-9.2.18-1. Verifying: ambari-server-2.7.5.0-124.x86_64 4/4 Verifying: ambari-server-2.7.5.0-124.x86_64 4/4 Verifying: ambari-server-2.7.5.0-124.x86_64 4/4 Verifying: Postgresql-9.2.18-1. El7.x86_64 2/4 Verifying: postgresql-server-9.2.18-1. El7.x86_64 3/4 Verifying: postgresql-server-9.2.18-1. Postgresql-libs-9.2.18-1.el7.x86_64 4/4 Installed: Ambari-server.x86_64 0:2.7.5.0-72 Dependency Installed: postgresql-libs-9.2.18-1.el7.x86_64 4/4 Installed: Ambari-server.x86_64 0:2.7.5.0-72 Dependency Installed: X86_64 0:9.2.18-1. El7 Postgresql-libs.x86_64 0:9.2.18-1. El7 Postgresql-server.x86_64 0:9.2.18-1.Copy the code

Startup and Setup

Set up the

ambari-server setup
Copy the code

It is not recommended to use postgresQL directly because other services use mysql

Configure the MySql installation yum install wget - y wget - I - c http://dev.mysql.com/get/mysql57-community-release-el7-10.noarch.rpm RPM - the ivh mysql57-community-release-el7-10.noarch.rpm yum -y install mysql-community-server systemctlenable mysqld

systemctl start mysqld.service

systemctl status mysqld.service

grep "password" /var/log/mysqld.log

mysql -uroot -p

set global validate_password_policy=0;

set global validate_password_length=1;

set global validate_password_special_char_count=0;

set global validate_password_mixed_case_count=0;

setglobal validate_password_number_count=0; select @@validate_password_number_count,@@validate_password_mixed_case_count,@@validate_password_number_count,@@validate_passwo rd_length; ALTER USER'root'@'localhost' IDENTIFIED BY 'password';

grant all privileges on . to 'root'@The '%' identified by 'password' with grant option;

flush privileges;

exitYum -y remove mysql57-community-release-el7-10.noarch install mysql57-community-release-el7-10.noarch Mysql -uroot -p create database ambari; /opt/ambari/mysql-connector-java-5.1.48.jar mysql -uroot -p create database ambari; use ambarisource /var/lib/ambari-server/resources/Ambari-DDL-MySQL-CREATE.sql
 
 
 
CREATE USER 'ambari'@'localhost' IDENTIFIED BY 'bigdata';
 
CREATE USER 'ambari'@The '%' IDENTIFIED BY 'bigdata';
 
GRANT ALL PRIVILEGES ON ambari.* TO 'ambari'@'localhost';
 
GRANT ALL PRIVILEGES ON ambari.* TO 'ambari'@The '%';
 
FLUSH PRIVILEGES;

Copy the code

The ambari configuration is complete

[root@localhost download]# ambari-server setup
Using python  /usr/bin/python
Setup ambari-server
Checking SELinux...
SELinux status is 'enabled'
SELinux mode is 'permissive'
WARNING: SELinux is set to 'permissive' mode and temporarily disabled.
OK to continue [y/n] (y)? y
Customize user account for ambari-server daemon [y/n] (n)? y
Enter user account for ambari-server daemon (root):
Adjusting ambari-server permissions and ownership...
Checking firewall status...
Checking JDK...
[1] Oracle JDK 1.8 + Java Cryptography Extension (JCE) Policy Files 8
[2] Custom JDK
==============================================================================
Enter choice (1): 2
WARNING: JDK must be installed on all hosts and JAVA_HOME must be valid on all hosts.
WARNING: JCE Policy files are required forconfiguring Kerberos security. If you plan to use Kerberos,please make sure JCE Unlimited Strength Jurisdiction Policy Files are valid on all hosts. Path to JAVA_HOME: /etc/usr/lib/jvm/java-1.8.0-openJDK-1.8.0.242.b08-0.el7_7.x86_64 / JRE Validating JDK on Ambari Server... done. Check JDK versionfor Ambari Server...
JDK version found: 8
Minimum JDK version is 8 for Ambari. Skipping to setup different JDK for Ambari Server.
Checking GPL software agreement...
GPL License forLZO: https://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html Enable Ambari Server to download and install the GPL Licensed LZO packages [y/n] (n)? y Completing setup... Configuring database... Enter advanced database configuration [y/n] (n)? y Configuring database... ============================================================================== Choose one of the following options: [1] - PostgreSQL (Embedded) [2] - Oracle [3] - MySQL / MariaDB [4] - PostgreSQL [5] - Microsoft SQL Server (Tech Preview) [6] - SQL Anywhere [7] - BDB ============================================================================== Enter choice (1): 3 Hostname (localhost): Port (3306): Database name (ambari): Username (ambari): Enter Database Password (bigdata): Configuring ambari database... Enter full path to custom jdbc driver: Copying /opt/ambari/mysql-connector-java-5.1.48.jar to /usr/share/java Configuring remote database connection properties... WARNING: Before starting Ambari Server, you must run the following DDL directly from the database shell to create the schema: /var/lib/ambari-server/resources/Ambari-DDL-MySQL-CREATE.sql Proceed with configuring remote database connection properties [y/n] (y)? y Extracting system views... . Ambari repo file contains latest json url http://public-repo-1.hortonworks.com/HDP/hdp_urlinfo.json, updating stacks repoinfos with it... Adjusting ambari-server permissions and ownership... Ambari Server'setup' completed successfully.
Copy the code

Then you can start

ambari-server start

ambari-server status

ambari-server stop
Copy the code

Visit the following address

http://<your.ambari.server>:8080
Copy the code

Cluster installation

Next, install the cluster, including naming, SSH security, selecting the version, and planning the cluster

With the cluster installation finally complete, we can manage our cluster on the page.

Please reply to Ambari in the background of “Real-time streaming computing”

Real-time computing environment construction

Flink is not currently supported because ambari supports druid versions, so real-time computing components other than Kafka need to be manually installed to facilitate future upgrades.

Install Flink on your Linux system

Cluster installation

Cluster installation consists of the following steps:

1. Copy the extracted Flink directory on each machine.

Conf /flink-conf.yaml on all machines

Jobmanager.rpc. address = master host nameCopy the code

3. Modify conf/ Slaves to write all work nodes

work01
work02
Copy the code

4. Start the cluster on master

bin/start-cluster.sh
Copy the code

Installed in the Hadoop

We can choose to have Flink run on a Yarn cluster.

Download the Flink for Hadoop package

Make sure HADOOP_HOME is set correctly

Start the bin/yarn – session. Sh

Run the Flink sample program

Batch example:

Submit flink’s batch examples program:

bin/flink run examples/batch/WordCount.jar
Copy the code

Flink examples batch processing examples program, count the number of words.

$ bin/flink run examples/batch/WordCount.jar Starting execution of program Executing WordCount example with default input data set. Use --input to specify file input. Printing result to stdout. Use --output to specify output path. (a,5)  (action,1) (after,1) (against,1) (all,2) (and,12) (arms,1) (arrows,1) (awry,1) (ay,1)Copy the code

Druid Cluster deployment

Deployment recommendations

The allocation of cluster deployment is as follows:

  • The Coordinator and Overlord processes are deployed on the active node
  • The two data nodes run the Historical and MiddleManager processes
  • A query node deploys the Broker and Router processes

We can add more master nodes and query nodes in the future

8 vcpus 32GB memory is recommended for the active node

The configuration file is located

conf/druid/cluster/master
Copy the code

Data Node Suggestion

16 vCPU 122GB memory 2 x 1.9TB SSD

The configuration file is located

conf/druid/cluster/data
Copy the code

8vCPU 32GB memory recommended

The configuration file is located

conf/druid/cluster/query
Copy the code

Start the deployment

Download the latest 0.17.0 release

Unpack the

The tar - XZF apache - druid - 0.17.0 - bin. Tar. GzcdApache - druid - 0.17.0Copy the code

The main configuration files for cluster mode are located in:

conf/druid/cluster
Copy the code

Configure the metadata store

conf/druid/cluster/_common/common.runtime.properties
Copy the code

replace

druid.metadata.storage.connector.connectURI
druid.metadata.storage.connector.host
Copy the code

For example, configure mysql as the metadata store

Mysql > select * from ‘mysql’;

-- create a druid database, make sure to use utf8mb4 as encoding
CREATE DATABASE druid DEFAULT CHARACTER SET utf8mb4;

-- create a druid user
CREATE USER 'druid'@'localhost' IDENTIFIED BY 'druid';

-- grant the user all the permissions on the database we just created
GRANT ALL PRIVILEGES ON druid.* TO 'druid'@'localhost';
Copy the code

Configure it in Druid

druid.extensions.loadList=["mysql-metadata-storage"]
druid.metadata.storage.type=mysql
druid.metadata.storage.connector.connectURI=jdbc:mysql://<host>/druid
druid.metadata.storage.connector.user=druid
druid.metadata.storage.connector.password=diurd
Copy the code

Configuring deep Storage

Configure the data store as S3 or HDFS

For example, configure HDFS

conf/druid/cluster/_common/common.runtime.properties
Copy the code
druid.extensions.loadList=["druid-hdfs-storage"]

#druid.storage.type=local
#druid.storage.storageDirectory=var/druid/segments

druid.storage.type=hdfs
druid.storage.storageDirectory=/druid/segments

#druid.indexer.logs.type=file
#druid.indexer.logs.directory=var/druid/indexing-logs

druid.indexer.logs.type=hdfs
druid.indexer.logs.directory=/druid/indexing-logs
Copy the code

Put Hadoop configuration XML (core-site. XML, hdFs-site. XML, yarn-site. XML, and mapred-site. XML) in Druid

conf/druid/cluster/_common/
Copy the code

Configure the ZooKeeper connection

Or modify

conf/druid/cluster/_common/
Copy the code

Under the

druid.zk.service.host
Copy the code

Zk server address is ok

Start the cluster

Turn on port limits before starting

The master node:

derby 1527

zk 2181

Coordinator 8081

Overlord 8090

Data node:

Historical 8083

Middle Manager, 8091, 8100-8199

Query nodes:

Broker 8082

Router 8088

Remember to copy the druid you just configured to each node

Start the primary node

Since we are using external ZK, we start with no-Zk

bin/start-cluster-master-no-zk-server
Copy the code

Start the data server

bin/start-cluster-data-server
Copy the code

Start the query server

bin/start-cluster-query-server
Copy the code

Then the cluster starts successfully!

At this point, our big data environment is basically set up. In the next chapter, we will access data and start the development of labels, which will be continued

reference

User Profiling: Methodology and Engineering Solutions

For more blog posts and scientific news about real-time data analysis, please follow the ambari website installation document PDF. Please reply to Ambari in the background of “Real-time streaming computing”