According to hadoop2. X? There’s a paper on IBM that goes into great detail: article links. The most obvious improvements are the addition of Yarn resource manager to optimize resource allocation and HA mode to prevent single point of failure. You can apply for free cloud services for one year or six months or use virtual machines (VMS). According to hadoop3. X? Personal is purely loaded force to play, can refer to the official document update, remember! Never use the latest version for enterprise use. Why? Take, for example, the Flink update! Hadoop is a big family, and it’s not the latest component version that determines the hipness of the family, but the version supported by the most backward members of the family.

Preparation stage

  • Understand Hadoop+HA frame

Active and standby NameNode: 1. The active NameNode provides services externally, and the standby NameNode synchronizes the metadata of the active NameNode for switchover. 2. When the information of the active NameNode changes, it will be written to the shared data storage system for the standby NameNode to be merged into its own memory. 3. All Datanodes send heartbeat information (block information) to two Namenodes at the same time.

Two switchover modes are available: 1. Manual switchover: The active/standby switchover is implemented by using commands, which can be used for HDFS upgrade. 2. Automatic switching: based on Zookeeper; Zookeeper Failover Controller: Registers NameNode with Zookeeper and monitors the health status of NameNode. When NM fails, ZKFC compets for the lock for NameNode, and the NameNode that obtains the lock becomes active. Multiple Journal nodes form a cluster (recommended). The basic principle is that data is written to all JNS at the same time. Generally, an odd number of JNS are configured. The more JNS, the better fault tolerance. For example, if there are three JNS, if two JNS are written successfully, the data is written successfully and a maximum of one JNS is allowed to fail. The node distribution of the four machines is as follows: 1 indicates that the service should be provided on the machine:

NN DN ZK ZKFC JN RM DM
node1 1 1 1 1
node2 1 1 1 1 1 1
node3 1 1 1 1
node4 1 1 1
  • Install the cluster OS: Check the configuration of your computer, because at least 3 virtual machines to enjoy the addiction, then I used 4, even for fun, each memory 2G also need 8G memory, better have more than 12G memory and i5 processor, otherwise the virtual machine is very card, no game experience (of course, if there are many broken computers at home, You can get better results by connecting these broken computers to a router and making them into a small LAN), download VMware Workstation Pro, download a Linux installation package (I used Centos7.iso, download link), install 4 CentOS7, Hostname is set to node1, node2, node3, and node4 respectively, and the network connection mode is set to bridge. Otherwise, it is very troublesome to have network failure. Note: In Hadoop cluster, the value of fs.defaultFS in core-site. XML cannot contain underscore (_). Use underscore (_) carefully in other configuration files. You can use “-“, not “_”
  • Find an easy to use remote Linux tool X-shell or MobaXterm. Putty is not recommended because of poor flexibility. MobaXterm is recommended because it can support interface Unold and DownLoad files. Very intuitive and convenient operation of small files to achieve Linux and Windows interactive operation.
  • To configure password-free login in a cluster: Vi /etc/hosts file, and then append IP and hostname to the source file. The purpose is to enable password-free login through IP and password-free login through hostname. You can run the SCP /etc/hosts node2:/etc/hosts command to implement cross-server replication and replace the /etc/hosts files on node2, node3, and node4.
127.0.0.1 localhost localhost.localdomain localhost4 localhost4. Localdomain4 ::1 localhost localhost.localdomain Localhost6 localHost6. Localdomain6 192.168.1.10 node1 192.168.1.9 node2 192.168.1.11 node3 192.168.1.12 node4Copy the code

Generate the public and private keys of all nodes in the cluster and Copy the public keys of all nodes into the authorized_keys file as follows:

# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # this below two lines of code to perform CD ~ # on each node back to the main interface SSH - the keygen - t Rsa -f ~/. SSH /id_rsa # Run the following command to generate the private key id_rsa: Pub # Run cd. SSH # to enter. SSH Folder cat id_rsa.pub >> authorized_keys # will be public key id_rsa. Pub save as under this folder authorized_keys file within # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # Ssh-copy-id -i node1 authorized_keys on node1 authorized_keys on node2, node3, node4 Copy the public key of the local host to the authorized_keys file of the remote host node1 # cat /root/.ssh/authorized_keys # /root/.ssh/authorized_keys /root/.ssh/authorized_keys /root/.ssh/authorized_keys /root/.ssh/authorized_keys /root/.ssh/authorized_keys Ssh/authorized_keys node2: /root/.ssh/scp /root/.ssh/authorized_keys node3: /root/.ssh/scp Authorized_keys node4: /root/.ssh/authorized_keys node4: /root/.ssh/authorized_keys node4: /root/.ssh/authorized_keys node4: /root/.ssh/authorized_keys node4: /root/.ssh/authorized_keys node4: /root/.ssh/authorized_keys node4:/root/.ssh/ #Copy the code
  • Install the JDK

Hadoop2.x with JDK7, 3.x with JDK8, select carefully, otherwise incompatible build you doubt life.

/usr/ Java # create a root folder for Java. /usr/ Java # create a root folder for Java. /usr/ Java The folder name after decompressing the tar package is very long. For ease of use, you can use a software similar to the shortcut in Windows. Ln -sf /usr/java/jdk1.8.0_211 /usr/java/jdk8 /usr/java/jdk8 = /usr/java/jdk8 = /usr/java/jdk8 = /usr/java/jdk8 Export JAVA_HOME= /usr/jav/jdk1.8.0_211 export JRE_HOME=${JAVA_HOME}/jre export CLASSPATH=.:${JAVA_HOME}/lib:${JRE_HOME}/lib export PATH=${JAVA_HOME}/bin:$PATH */ source /etc/profile -version # Enter the Java version in any folder, check the Java version, it will prompt the version number, then the installation is successful.Copy the code

Install Hadoop 3.X+HA (HDFS is the main content in this section, the plug-in will be updated later)

  • A new user

Create a myHadoop group and create a myHadoop user, and grant sudo permission, because normal sound environment is impossible to give anyone root permission, so everyone should form a habit, do not rely too much on root when playing, give Sudo design VI operation, If you have any questions about the VI operation, check the common vi/vim commands in Linux.

Sudo groupadd -g 9000 myhadoop # add myhadoop sudo useradd myhadoop -g 9000 -u 100000 -m # add user with id=100000 Sudo passwd myhadoop # select * from 'myhadoop' where 'myhadoop' = '/home'; /etc/sudoers =(ALL) ALL=(ALL) ALL=(ALL) ALL Add myHadoop ALL=(ALL) ALL, save and exitCopy the code
  • The formal installation
Wget download http://mirror.bit.edu.cn/apache/hadoop/core/hadoop-3.2.0/hadoop-3.2.0.tar.gz # 3.2 tar - ZXVF hadoop - 3.2.0. Tar. Gz /usr/hadoop-3.2.0 /usr/hadoop-3.2.0 All configuration files are in the hadoop directory under the etc directory in this directoryCopy the code

All configuration files after hadoop2.X are in hadoop-xx. XX/etc/hadoop. All configuration files after hadoop2.X are under hadoop-xx.xx /etc/hadoop, and all configuration files after hadoop2.X are under hadoop-xx.xx /etc/hadoop. 1) Run the vi hadoop-env.sh command to modify the hadoop-env.sh configuration file

JAVA_HOME=/usr/ JAVA /jdk1.8.0_211Copy the code

2) Run the vi hdfs-site. XML command to modify the hdFS-site. XML configuration file

<configuration> #HA mode is a cluster mode, the first step is to configure the cluster name, I took a shadow pinyin, note, the name can not contain the underscore, ruo_ying, you can name, once the name, use the following cluster name changed to this name. <property> <name>dfs.nameservices</name> <value>ruoying</value> </property> Nn2 is the name of the NameNode. It is not the hostname of the machine. <property> <name>dfs.ha.namenodes. Ruoying </name> <value>nn1,nn2</value> </property> The RPC protocol transfers data within the Hadoop cluster. <property> <name>dfs.namenode.rpc-address.ruoying.nn1</name> <value>node1:8020</value> </property> <property> <name>dfs.namenode.rpc-address.ruoying.nn2</name> <value>node2:8020</value> </property> Node2 HTTP protocol port, available for web query, <property> <name>dfs.namenode.http-address.ruoying.nn1</name> <value>node1:50070</value> </property> <property> <name>dfs.namenode.http-address.ruoying.nn2</name> <value>node2:50070</value> </property> # Config journal Journal node is installed on node2, node3, and node4. The corresponding service port is 8485. Cluster name consistent with the first step in the < property > < name > DFS. The namenode. Shared. The edits. Dir < / name > < value > qjournal: / / 2:8485; node3:8485; Node4 :8485/ruoying</value> </property> Cluster name consistent with the first step in the < property > < name >. DFS client. Failover. Proxy. Provider. Ruoying < / name > < value > org. Apache. Hadoop. HDFS. Server. The namenode. Ha. ConfiguredFailoverProxyProvider < value > / < / property > # configuration sshfence, Configure the private key file, pay attention to your private key file in the directory, the name is what it is, write what it is, <property> <name> DFs.ha.fencing. Methods </name> <value>sshfence</value> </property> <property> < name > DFS. Ha. Fencing. SSH. The private key - files < / name > < value > / root /. SSH/id_rsa < value > / < / property > # configuration journal node working directory, Just choose whatever you want, don't put it in a temporary directory, TMP, Temporary directory operating system restart will lose data < property > < name >. DFS journalnode. Edits. Dir < / name > < value > / opt/Jacqueline Nottingham/journal/node/local/data value > < / </property> </property> <property> <name>dfs.ha.automatic-failover. Enabled </name> <value>true</value> </property> </configuration>Copy the code

3) Run the vi core-site. XML command to modify the core-site. XML configuration file

<configuration> # configure the hadoop portal, note that the cluster namenode is a cluster, cannot write fixed IP, <property> <name>fs.defaultFS</name> <value> HDFS ://ruoying</value> # configuration in zookeeper nodes and port < property > < name > ha. The zookeeper. Quorum < / name > < value > node1:2181, 2:2181, node3:2181 value > < / </property> # configure working directory, default in temporary directory, better change to physical directory, <property> <name> hado.tmp. dir</name> <value>/opt/hadoop3</value> </property> </configuration> 1440 is reserved for 1440 minutes. When HDFS deletes a file, it is temporarily saved in the recycle bin and can be reclaimed using mv command. <property> <name>fs.trash.interval</name> <value>1440</value> </property>Copy the code

Yarn – site. 4) configuration XML, must pay attention to the yarn. The resourcemanager. Ha. The id attribute, the value of the node1 for rm1, rm2 for 2, other stages the namenode don’t need this attribute

<configuration> <! -- Site specific YARN configuration properties --> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.resourcemanager.ha.enabled</name> <value>true</value>  </property> <property> <name>yarn.resourcemanager.cluster-id</name> <value>rmcluster</value> </property> <property> <name>yarn.resourcemanager.ha.rm-ids</name> <value>rm1,rm2</value> </property> <property> <name>yarn.resourcemanager.hostname.rm1</name> <value>node1</value> </property> <property> <name>yarn.resourcemanager.hostname.rm2</name> <value>node2</value> </property> <property> <name>yarn.resourcemanager.zk-address</name> <value>node1:2181,node2:2181,node3:2181</value> </property> <property> <name>yarn.resourcemanager.recovery.enabled</name> <value>true</value> </property> <property> <name>yarn.resourcemanager.ha.id</name> <value>rm1</value> </property> <property> <name>yarn.resourcemanager.store.class</name> <value>org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore</value> </property> </configuration>Copy the code

5) Configure the mapreduce-site. XML file. If hadoop does not contain the file, you can use cp mapred-site.xml.template mapred-site. XML to generate the file.


<configuration>
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
</configuration>

Copy the code

6) Install and configure the ZooKeeper cluster

Wget http://39.137.36.61:6310/mirror.bit.edu.cn/apache/zookeeper/zookeeper-3.4.14/zookeeper-3.4.14.tar.gz # download the zookeeper Tar -zxvf zookeeper-3.4.14.tar.gz # Decompress zookeeper ln -sf /usr/zookeeper-3.4.14 /usr/zk # CD /usr/zk/conf CFG zoo. CFG # Copy the zoo. CFG file from the zoo_sample. CFG source file as the configuration file vi zoo. CFG. DataDir =/opt/zookeeper =/ zookeeper =/ zookeeper =/ zookeeper =/ zookeeper =/ zookeeper =/ zookeeper X is a number corresponding to the number in the /opt/zookeeper/myid file on the node. Server. 1=node1:2888:3888 server.2=node2:2888:3888 server.3=node3:2888:3888 3. Node3 create /opt/zookeeper directory as working directory 4. 5. SCP -r /opt/zookeeper root@node2:/opt/zookeeper SCP -r /opt/zookeeper root@node3:/opt/zookeeper Copy /opt/zookeeper across servers to node2 and node3. 6. SCP -r zookeeper-3.4.14/ root@node2:/usr/ SCP -r zookeeper-3.4.14/ root@node3:/usr/ Use SCP to cp zookeeper configuration files to node2 and node3, and use ln -sf /usr/zookeeper-3.4.14/ /usr/zk to create soft link 7. Run the following command to configure zooKeeper environment variables: vi /etc/profile Add export PATH=$PATH:/usr/zookeeper-3.4.14/bin source /etc/profile # Update environment variables 8. Turn off the firewall, Sudo systemctl stop firewalld sudo systemctl startFirewalld sudo systemctl startFirewalld Start ZooKeeper. All zkServer commands are in /usr/zk/bin. zkServer.sh startCopy the code

Node1. Node2 and node3 respectively start ZK. Troubleshooting of the main problems 1. So if the volume is too big or too small, there may be some pits. CFG configuration, check whether the hostname for node1, node2, and node3 is correct, and also check whether the IP 3.myid file and its contents are correct.5) Vi Slaves configure datanaode, note, hadoop3.x is called workers, just changed the name

Save node2 node3 node4Copy the code

6) The HDFS configuration is complete, and the configured Hadoop SCP is sent to the other 3 machines

SCP -r /usr/hadoop-3.2.0 root@node2: /usr/scp-r /usr/hadoop-3.2.0 root@node3: /usr/scp-r /usr/hadoop-3.2.0 root@node4:/usr/Copy the code

7) to start the hadoop

Journalnode./hadoop-daemon.sh start journalnode./hadoop-daemon.sh journalNode. CD /usr/hadoop-3.2.0/bin # Change one of the namenodes to the bin directory./ HDFS namenode -format # Initialize one of the namenodes, So once I've initialized it, CD /usr/hadoop-3.2.0/sbin # Switch to sbin./hadoop-daemon.sh start on the initialnamenode Namenode # start a namenode, JPS check whether it is started /logs tail-n50 hadoop-root-namenode-node1.log / HDFS namenode-bootstrapStandby # Synchronize metadata # same, if error, switch to CD.. /usr/hadoop-3.2.0/bin Logs tail-n50 hadoop-root-namenode-node2.log /opt/hadoop3 / DFS /name/current/* To initialize ZooKeeper, run CD /usr/hadoop-3.0/bin./ HDFS zkFC-formatZk 6.hadoop3.X in the /usr/hadoop-3.2/sbin directory. Sh: vi start-dfs.sh,stop-dfs.sh,start-yarn.sh, and stop-yarn.sh respectively.  HDFS_DATANODE_USER=root HDFS_JOURNALNODE_USER=root HDFS_ZKFC_USER=root HADOOP_SECURE_DN_USER=hdfs HDFS_NAMENODE_USER=root HDFS_SECONDARYNAMENODE_USER=root YARN_NODEMANAGER_USER=root YARN_RESOURCEMANAGER_USER=root ERROR: Attempting to operate on YARN nodemanager as root ERROR: Attempting to operate on YARN nodemanager as root ERROR: Attempting to operate on yarn nodemanager but there is no YARN_NODEMANAGER_USER defined. Aborting operation. 7. Export PATH=$PATH:/usr/hadoop-3.2.0/bin:/usr/hadoop-3.2.0/sbin! Restart the DFS service. Stop all Hadoop services before starting HDFS. Run the following command anywhere because environment variables are configured. -rwxrwxr-x 1 Hadoop supergroup 2752 Sep 10 2018 -rwxr -x 1 Hadoop supergroup 2752 Sep 10 2018 distribute-exclude.sh -rwxrwxr-x 1 hadoop supergroup 6509 Oct 3 2019 hadoop-daemon.sh -rwxrwxr-x 1 hadoop supergroup 1360 Sep 10 2018 hadoop-daemons.sh -rwxrwxr-x 1 hadoop supergroup 1427 Sep 10 2018 hdfs-config.sh -rwxrwxr-x 1 hadoop supergroup 2339 Sep 10 2018 httpfs.sh -rwxrwxr-x 1 hadoop supergroup 3763 Sep 10 2018 kms.sh -rwxrwxr-x 1 hadoop supergroup 4134 Sep 10 2018 mr-jobhistory-daemon.sh -rwxrwxr-x 1 hadoop supergroup 1648 Sep 10 2018 refresh-namenodes.sh  -rwxrwxr-x 1 hadoop supergroup 2145 Sep 10 2018 slaves.sh -rwxrwxr-x 1 hadoop supergroup 1471 Sep 10 2018 start-all.sh -rwxrwxr-x 1 hadoop supergroup 1128 Sep 10 2018 start-balancer.sh -rwxrwxr-x 1 hadoop supergroup 3734 Sep 10 2018 start-dfs.sh -rwxrwxr-x 1 hadoop supergroup 1357 Sep 10 2018 start-secure-dns.sh -rwxrwxr-x 1 hadoop supergroup 1347 Sep  10 2018 start-yarn.sh -rwxrwxr-x 1 hadoop supergroup 1462 Sep 10 2018 stop-all.sh -rwxrwxr-x 1 hadoop supergroup 1179 Sep 10 2018 stop-balancer.sh -rwxrwxr-x 1 hadoop supergroup 3206 Sep 10 2018 stop-dfs.sh -rwxrwxr-x 1 hadoop supergroup 1340 Sep 10 2018 stop-secure-dns.sh -rwxrwxr-x 1 hadoop supergroup 1340 Sep 10 2018 stop-yarn.sh -rwxrwxr-x 1 hadoop Supergroup 4331 Oct 3 2019 yarn-daemon.sh -rwxrwxr-x 1 hadoop supergroup 1353 Sep 10 2018 yarn-daemons.sh # /stop-all.sh&&./start-dfs.sh # Storage node to check output logs, use the above commands, including start journalNode (several), namenode (two), one active, one standby, Datanode, ZKFC, namenode, hadoop2.0, hadoop2.0 The default computing nodes are Resourcemanager and NodeManager of YARN. Sh &&./start-yarn.sh # Starts Resourcemanager and NodeManager, which are related to the corresponding nodes configured by your yarn-site. XML./ / Digress. //1. There are stop-all.sh and start-all.sh in this directory. Can it be used? It can be used, but it is not advocated in principle. Generally, decent people do not use it. //2. In the future, when your Hadoop is running normally, some daemons are dead, but the whole cluster is not hung, please start some daemons separately, not restart them all at once; // If a namenode hangs, another namenode in the cluster is working because of high availability, but the cluster does not hang, you cannot restart the cluster. You can only start the namenode directly, mainly under that node. CD /usr/hadoop-3.2.0/sbin # switch to sbin./hadoop-daemon.sh start namenode # start namenode, JPS checks to see if the daemon is started. Sh start journalNode./hadoop-daemon.sh start journalnode./hadoop-daemon.sh start journalNode Sh start resourcemanager # Start nodeManager./yarn-daemon.sh start resourcemanager # Start Resourcemanager on this machine independentlyCopy the code

Since then, the entire pure Hdfs3.X+HA configuration has been completed on each machine, resulting in the following JPS: node1:2:Node3:Node4:You can see that it matches our original table exactly, accessed on any node1https://node1:50070Can monitor the cluster through web version;

The default hadoop port is attached. Of course, if you have changed it during configuration, you still need to use your own configuration.

port role
9000 Fs defaultFS, such as: HDFS: / / 172.25.40.171:9000
9001 Dfs.namenode. rpc-address, DataNode will connect to this port
50070 dfs.namenode.http-address
50470 dfs.namenode.https-address
50100 dfs.namenode.backup.address
50105 dfs.namenode.backup.http-address
50090 DFS. The namenode. Secondary. HTTP – address, such as: 172.25.39.166:50090
50091 DFS. The namenode. Secondary. HTTPS – address, such as: 172.25.39.166:50091
50020 dfs.datanode.ipc.address
50075 dfs.datanode.http.address
50475 dfs.datanode.https.address
50010 Dfs.datanode. address, datanode data transfer port
8480 dfs.journalnode.rpc-address
8481 dfs.journalnode.https-address
8032 yarn.resourcemanager.address
8088 Yarn. The resourcemanager. Webapp. Address, yarn HTTP port
8090 yarn.resourcemanager.webapp.https.address
8030 yarn.resourcemanager.scheduler.address
8031 yarn.resourcemanager.resource-tracker.address
8033 yarn.resourcemanager.admin.address
8042 yarn.nodemanager.webapp.address
8040 yarn.nodemanager.localizer.address
8188 yarn.timeline-service.webapp.address
10020 mapreduce.jobhistory.address
19888 mapreduce.jobhistory.webapp.address
2888 ZooKeeper, in the case of the Leader, monitors the connections of followers
3888 ZooKeeper: used for Leader election
2181 ZooKeeper, used to monitor client connections
60010 Hbase.master.info. port: HTTP port of HMaster
60000 Hbase.master. port: indicates the RPC port of HMaster
60030 Hbase. Regionserver. Info. Port, HRegionServer HTTP port
60020 Hbase.regionserver. port: RPC port of HRegionServer
8080 Hbase.rest. port: port number of the hbase REST server
10000 hive.server2.thrift.port
9083 hive.metastore.uris