YAZONG 我的开源

Hadoop3.4.0完全分布式集群模式

  ,
0 评论0 浏览

基于个人Win10主机,vmvare16搭建,网络模式为NAT。

注:以下为三个节点的主要公共配置。

#基本信息


完全分布式集群模式:
3个节点,1主节点,2从节点。
#一定要用外网IP而不是内网IP
主节点:10.0.0.11 node01
从节点:10.0.0.12 node02
从节点:10.0.0.13 node03

/etc/hosts:
10.0.0.11 node01
10.0.0.12 node02
10.0.0.13 node03

操作系统:
CentOS Linux release 7.9.2009 (Core)
Linux 主机名 3.10.0-1160.el7.x86_64 #1 SMP Mon Oct 19 16:18:59 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux


JDK版本(使用JDK8并尽可能的接近HADOOP版本发布时间):
(“https://www.oracle.com/cn/java/technologies/downloads/#java8”)
java version "1.8.0_431"

HADOOP版本:
(“https://www.apache.org/dyn/closer.cgi/hadoop/common/hadoop-3.4.0/hadoop-3.4.0.tar.gz”)
Hadoop 3.4.0

硬件资源:
内存:2GB、CPU:双核、硬盘:20GB。

cat /etc/profile
#java
JAVA_HOME=/data/software/install/jdk
PATH=$JAVA_HOME/bin:$PATH
CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
export JAVA_HOME
export CLASSPATH
#hadoop
HADOOP_HOME=/data/software/install/hadoop
PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
export PATH
export HDFS_NAMENODE_USER=root
export HDFS_DATANODE_USER=root
export HDFS_SECONDARYNAMENODE_USER=root
export YARN_RESOURCEMANAGER_USER=root
export YARN_NODEMANAGER_USER=root

#ssh配置

[root@localhost system]# chmod +x /data/scripts/system/createssh.sh
[root@localhost system]# /bin/sh /data/scripts/system/createssh.sh

[root@localhost system]# ll ~/.ssh/id_rsa.pub
-rw-r--r-- 1 root root 396 Oct 31 02:53 /root/.ssh/id_rsa.pub

[root@localhost system]# vim /data/scripts/system/fen.sh
#!/bin/sh
ssh-copy-id -i ~/.ssh/id_rsa.pub root@node01
ssh-copy-id -i ~/.ssh/id_rsa.pub root@node02
ssh-copy-id -i ~/.ssh/id_rsa.pub root@node03

[root@localhost system]# /bin/sh /data/scripts/system/fen.sh
[root@localhost system]# cat ~/.ssh/id_rsa.pub
[root@localhost system]# cat ~/.ssh/authorized_keys


[root@localhost ~]# ll -ld ~/.ssh
drwx------ 2 root root 4096 Oct 31 02:58 /root/.ssh
[root@localhost ~]# ll ~/.ssh/authorized_keys
-rw-r--r-- 1 root root 1188 Oct 31 02:58 /root/.ssh/authorized_keys

[centos@hadoop1 ~]$ chmod 700 ~/.ssh
[centos@hadoop1 ~]$ chmod 644 ~/.ssh/authorized_keys

[root@localhost system]# ssh root@node01
[root@localhost system]# ssh root@node02
[root@localhost system]# ssh root@node03

#HDFS集群配置信息


hadoop-env.sh文件:
export JAVA_HOME=/data/software/install/jdk

workers文件:
node02
node03

core-site.xml文件:
<configuration>
 <property>
    <name>fs.defaultFS</name>
    <value>hdfs://node01</value>
  </property>
  <property>
    <name>hadoop.tmp.dir</name>
    <value>/data/temp/hadoop</value>
  </property>
</configuration>

hdfs-site.xml文件:
<configuration>
  <property>
    <name>dfs.replication</name>
    <value>2</value>
  </property>
  <property>
    <name>dfs.namenode.name.dir</name>
    <value>file:/data/temp/hdfs/namenode</value>
  </property>
  <property>
    <name>dfs.datanode.data.dir</name>
    <value>file:/data/temp/hdfs/datanode</value>
  </property>
</configuration>

mapred-site.xml文件:
<configuration>
        <property>
                <name>mapreduce.framework.name</name>
                <value>yarn</value>
        </property>
   <property>
        <name>yarn.app.mapreduce.am.env</name>
        <value>HADOOP_MAPRED_HOME=/data/software/install/hadoop</value>
    </property>
    <property>
        <name>mapreduce.map.env</name>
        <value>HADOOP_MAPRED_HOME=/data/software/install/hadoop</value>
    </property>
    <property>
        <name>mapreduce.reduce.env</name>
        <value>HADOOP_MAPRED_HOME/data/software/install/hadoop</value>
    </property>
</configuration>

yarn-site.xml文件:
<configuration>
  <property>
    <name>yarn.resourcemanager.hostname</name>
    <value>node01</value>
  </property>
   <property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
  </property>
#hadoop classpath得到
<property>
        <name>yarn.application.classpath</name>
        <value>
/data/software/workspace/hadoop-3.4.0/etc/hadoop:/data/software/workspace/hadoop-3.4.0/share/hadoop/common/lib/*:/data/software/work
space/hadoop-3.4.0/share/hadoop/common/*:/data/software/workspace/hadoop-3.4.0/share/hadoop/hdfs:/data/software/workspace/hadoop-3.4
.0/share/hadoop/hdfs/lib/*:/data/software/workspace/hadoop-3.4.0/share/hadoop/hdfs/*:/data/software/workspace/hadoop-3.4.0/share/had
oop/mapreduce/*:/data/software/workspace/hadoop-3.4.0/share/hadoop/yarn:/data/software/workspace/hadoop-3.4.0/share/hadoop/yarn/lib/
*:/data/software/workspace/hadoop-3.4.0/share/hadoop/yarn/*
        </value>
</property>
</configuration>

#启动HDFS

在主节点node01执行格式化HDFS
hdfs namenode -format
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at node01/10.0.0.11
************************************************************/
在主节点node01的hadoop安装目录启动集群
sbin/start-all.sh

#测试集群

[root@localhost ~]# cat ~/text 
hello
world
[root@localhost ~]# hdfs dfs -put ~/text  /input
[root@localhost ~]# hdfs dfs -ls /input
-rw-r--r--   2 root supergroup         12 2024-10-31 04:29 /input

[root@localhost sbin]# hdfs dfsadmin -report
Configured Capacity: 38420357120 (35.78 GB)
Present Capacity: 25072865280 (23.35 GB)
DFS Remaining: 25072816128 (23.35 GB)
DFS Used: 49152 (48 KB)
DFS Used%: 0.00%
Replicated Blocks:
        Under replicated blocks: 0
        Blocks with corrupt replicas: 0
        Missing blocks: 0
        Missing blocks (with replication factor 1): 0
        Low redundancy blocks with highest priority to recover: 0
        Pending deletion blocks: 0
Erasure Coded Block Groups: 
        Low redundancy block groups: 0
        Block groups with corrupt internal blocks: 0
        Missing block groups: 0
        Low redundancy blocks with highest priority to recover: 0
        Pending deletion blocks: 0

-------------------------------------------------
Live datanodes (2):

Name: 10.0.0.12:9866 (node02)
Hostname: localhost
Decommission Status : Normal
Configured Capacity: 19210178560 (17.89 GB)
DFS Used: 24576 (24 KB)
Non DFS Used: 5674422272 (5.28 GB)
DFS Remaining: 12536283136 (11.68 GB)
DFS Used%: 0.00%
DFS Remaining%: 65.26%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 0
Last contact: Thu Oct 31 06:46:39 CST 2024
Last Block Report: Thu Oct 31 06:41:00 CST 2024
Num of Blocks: 0


Name: 10.0.0.13:9866 (node03)
Hostname: localhost
Decommission Status : Normal
Configured Capacity: 19210178560 (17.89 GB)
DFS Used: 24576 (24 KB)
Non DFS Used: 5674172416 (5.28 GB)
DFS Remaining: 12536532992 (11.68 GB)
DFS Used%: 0.00%
DFS Remaining%: 65.26%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 0
Last contact: Thu Oct 31 06:46:39 CST 2024
Last Block Report: Thu Oct 31 06:41:00 CST 2024
Num of Blocks: 0





[root@node01 ~]# hadoop fs -mkdir -p /user/root/input
[root@node01 ~]# echo "Hello Hadoop" > input.txt
[root@node01 ~]# hadoop fs -put input.txt /user/root/input/
[root@node01 ~]# hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar wordcount /user/root/input /user/root/output
2024-10-31 07:28:29,233 INFO client.DefaultNoHARMFailoverProxyProvider: Connecting to ResourceManager at node01/10.0.0.11:8032
2024-10-31 07:28:29,637 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/root/.staging/job_1730330555537_0002
2024-10-31 07:28:29,880 INFO input.FileInputFormat: Total input files to process : 1
2024-10-31 07:28:29,942 INFO mapreduce.JobSubmitter: number of splits:1
2024-10-31 07:28:30,053 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1730330555537_0002
2024-10-31 07:28:30,054 INFO mapreduce.JobSubmitter: Executing with tokens: []
2024-10-31 07:28:30,182 INFO conf.Configuration: resource-types.xml not found
2024-10-31 07:28:30,183 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
2024-10-31 07:28:30,232 INFO impl.YarnClientImpl: Submitted application application_1730330555537_0002
2024-10-31 07:28:30,263 INFO mapreduce.Job: The url to track the job: http://node01:8088/proxy/application_1730330555537_0002/
2024-10-31 07:28:30,263 INFO mapreduce.Job: Running job: job_1730330555537_0002
2024-10-31 07:28:36,345 INFO mapreduce.Job: Job job_1730330555537_0002 running in uber mode : false
2024-10-31 07:28:36,347 INFO mapreduce.Job:  map 0% reduce 0%
2024-10-31 07:28:40,449 INFO mapreduce.Job:  map 100% reduce 0%
2024-10-31 07:28:45,485 INFO mapreduce.Job:  map 100% reduce 100%
2024-10-31 07:28:45,497 INFO mapreduce.Job: Job job_1730330555537_0002 completed successfully
2024-10-31 07:28:45,568 INFO mapreduce.Job: Counters: 54
        File System Counters
                FILE: Number of bytes read=31
                FILE: Number of bytes written=619369
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=117
                HDFS: Number of bytes written=17
                HDFS: Number of read operations=8
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=2
                HDFS: Number of bytes read erasure-coded=0
        Job Counters 
                Launched map tasks=1
                Launched reduce tasks=1
                Data-local map tasks=1
                Total time spent by all maps in occupied slots (ms)=1926
                Total time spent by all reduces in occupied slots (ms)=1973
                Total time spent by all map tasks (ms)=1926
                Total time spent by all reduce tasks (ms)=1973
                Total vcore-milliseconds taken by all map tasks=1926
                Total vcore-milliseconds taken by all reduce tasks=1973
                Total megabyte-milliseconds taken by all map tasks=1972224
                Total megabyte-milliseconds taken by all reduce tasks=2020352
        Map-Reduce Framework
                Map input records=1
                Map output records=2
                Map output bytes=21
                Map output materialized bytes=31
                Input split bytes=104
                Combine input records=2
                Combine output records=2
                Reduce input groups=2
                Reduce shuffle bytes=31
                Reduce input records=2
                Reduce output records=2
                Spilled Records=4
                Shuffled Maps =1
                Failed Shuffles=0
                Merged Map outputs=1
                GC time elapsed (ms)=85
                CPU time spent (ms)=850
                Physical memory (bytes) snapshot=574197760
                Virtual memory (bytes) snapshot=5532917760
                Total committed heap usage (bytes)=413663232
                Peak Map Physical memory (bytes)=324878336
                Peak Map Virtual memory (bytes)=2765701120
                Peak Reduce Physical memory (bytes)=249319424
                Peak Reduce Virtual memory (bytes)=2767216640
        Shuffle Errors
                BAD_ID=0
                CONNECTION=0
                IO_ERROR=0
                WRONG_LENGTH=0
                WRONG_MAP=0
                WRONG_REDUCE=0
        File Input Format Counters 
                Bytes Read=13
        File Output Format Counters 
                Bytes Written=17

[root@node01 ~]# hadoop dfs -ls /user/root/output/
WARNING: Use of this script to execute dfs is deprecated.
WARNING: Attempting to execute replacement "hdfs dfs" instead.

Found 2 items
-rw-r--r--   2 root supergroup          0 2024-10-31 07:28 /user/root/output/_SUCCESS
-rw-r--r--   2 root supergroup         17 2024-10-31 07:28 /user/root/output/part-r-00000
[root@node01 ~]# hadoop dfs -cat /user/root/output/part-r-00000
WARNING: Use of this script to execute dfs is deprecated.
WARNING: Attempting to execute replacement "hdfs dfs" instead.

Hadoop  1
Hello   1





#控制台访问IP和端口的信息

控制台界面访问:
“http://10.0.0.11:8088/cluster/nodes”

image.png

“http://10.0.0.11:9870/dfshealth.html#tab-datanode”

image.png

image.png

#数据集信息

源地址:
”https://archive.ics.uci.edu/dataset/450/sports+articles+for+objectivity+analysis”
数据规模:”1000个txt文本文件”。(注:见下述网站和下载的压缩包截图)
文件内容:用于机器学习的1000篇体育文章。
所有文件实际总大小:约为3.8MB。

image.png

image.png

#源程序


Java程序在maven工程中配置的pom.xml内容:
<dependency>
	<groupId>org.apache.hadoop</groupId>
	<artifactId>hadoop-client</artifactId>
	<version>3.4.0</version>
</dependency>


下述程序的编译和运行环境:

windows机器配置hosts映射:
10.0.0.11 node01
10.0.0.12 node02
10.0.0.13 node03

IDE版本:idea64-2023.1.3.exe
JDK版本:java version "1.8.0_431"(与安装HDFS集群的服务器版本一致)。
MAVEN版本:Apache Maven 3.3.9

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.LocatedFileStatus;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.fs.RemoteIterator;
import java.io.File;
import java.net.URI;
import java.time.LocalDateTime;
import java.time.format.DateTimeFormatter;
import java.util.concurrent.atomic.AtomicInteger;

public class HadoopDemo {

    //格式化时间格式
    public static final DateTimeFormatter DATE_TIME_FORMATTER
            = DateTimeFormatter.ofPattern("yyyy-MM-dd HH:mm:ss");
    //连接hdfs地址
    public static final String HADOOP_CONNECT_URL = "hdfs://node01";
    //连接hdfs操作用户
    public static final String HADOOP_CONNECT_USER = "root";
    //创建hadoop远程文件目录
    public static final String HADOOP_REMOTE_PATH = "/user/root/input/example/sports01";
    //本地上传数据集文件目录
    public static final String HADOOP_UPLOAD_LOCAL_PATH
            = "D:\\sports+articles+for+objectivity+analysis\\Raw data";

    public static void main(String[] args) throws Exception {

        LocalDateTime beginTime = LocalDateTime.now();
        System.out.println("开始时间 : " + beginTime.format(DATE_TIME_FORMATTER));

        Configuration configuration = new Configuration();
        FileSystem fileSystem = FileSystem.get(new URI(HADOOP_CONNECT_URL), configuration, HADOOP_CONNECT_USER);
        /**
         * 第1步-创建HADOOP远程目录
         */
        Path hadoopRemotePath = new Path(HADOOP_REMOTE_PATH);
        boolean hadoopRemotePathExistsFlag = fileSystem.exists(hadoopRemotePath);
        if(hadoopRemotePathExistsFlag){
            fileSystem.delete(hadoopRemotePath, true);
        }
        fileSystem.mkdirs(hadoopRemotePath);

        /**
         * 第2步-上传本地目录文件到HADOOP远程目录
         */
        File localFile = new File(HADOOP_UPLOAD_LOCAL_PATH);
        File[] localListFiles = localFile.listFiles();
        for (int i = 0; i < localListFiles.length; i++) {
            fileSystem.copyFromLocalFile(false, true, new Path(localListFiles[i].getPath()), hadoopRemotePath);
        }

        /**
         * 第3步-遍历HADOOP目录并得到数据集中的文件个数
         */
        RemoteIterator<LocatedFileStatus> hadoopRemoteIterator = fileSystem.listFiles(hadoopRemotePath, true);
        AtomicInteger hadoopRemoteSize = new AtomicInteger(0);
        while(hadoopRemoteIterator.hasNext()){
            hadoopRemoteIterator.next();
            hadoopRemoteSize.getAndIncrement();
        }
        System.out.println("数据集中的文件个数为 : " + hadoopRemoteSize.get());

        /**
         * 第4步-关闭FileSystem
         */
        fileSystem.close();

        LocalDateTime endTime = LocalDateTime.now();
        System.out.println("结束时间 : " + endTime.format(DATE_TIME_FORMATTER));

    }


}
文件类型:txt文件
数据结构:纯文本
比如在主节点查看数据集中的Text1000.txt文件:
[root@node01 ~]# hdfs dfs -cat /user/root/input/example/sports01/Text1000.txt
A Night of Validation
Rodman?s ceremony celebrates his, Bad Boys? greatness
by George Blaha

Lou Capozzola/NBAE/Getty Images
When you think about all the people who?ve played for the Pistons in the 54 years since they?ve been around in Detroit and realize that only seven men have been honored to have their names hanging above the court at The Palace and that five of them were directly involved with the Bad Boys, you realize what a special group that was.

上传后的1000个文件列表:

image.png

#问题

问题1:


hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar pi 10 100
Error: Could not find or load main class org.apache.hadoop.mapred.YarnChild
#解决方案
[root@node01 ~]# hadoop classpath
#复制到下述配置中
[root@node01 hadoop]# vim yarn-site.xml 
<property>
        <name>yarn.application.classpath</name>
        <value>
/data/software/workspace/hadoop-3.4.0/etc/hadoop:/data/software/workspace/hadoop-3.4.0/share/hadoop/common/lib/*:/data/software/work
space/hadoop-3.4.0/share/hadoop/common/*:/data/software/workspace/hadoop-3.4.0/share/hadoop/hdfs:/data/software/workspace/hadoop-3.4
.0/share/hadoop/hdfs/lib/*:/data/software/workspace/hadoop-3.4.0/share/hadoop/hdfs/*:/data/software/workspace/hadoop-3.4.0/share/had
oop/mapreduce/*:/data/software/workspace/hadoop-3.4.0/share/hadoop/yarn:/data/software/workspace/hadoop-3.4.0/share/hadoop/yarn/lib/
*:/data/software/workspace/hadoop-3.4.0/share/hadoop/yarn/*
        </value>
   </property>

问题2:

执行start-all.sh警告127.0.0.1添加到网络中;
hdfs namenode -format提示"SHUTDOWN_MSG: Shutting down NameNode at node01/127.0.0.1";
hdfs dfsadmin -report打印"172.16.1.11"内网IP;
jps显示的端口,外网IP等部分不能访问,只能通过内网IP访问,curl http://10.0.0.11:8088/cluster  
curl: (7) Failed connect to 10.0.0.11:8088; Connection refused;

#解决方案
/etc/hosts配置,千万不要映射内网IP以及加入127.0.0.1 localhost:
10.0.0.11 node01
10.0.0.12 node02
10.0.0.13 node03

标题:Hadoop3.4.0完全分布式集群模式
作者:yazong
地址:https://blog.llyweb.com/articles/2024/11/03/1730632847512.html