基于个人Win10主机,vmvare16搭建,网络模式为NAT。
注:以下为三个节点的主要公共配置。
#基本信息
完全分布式集群模式:
3个节点,1主节点,2从节点。
#一定要用外网IP而不是内网IP
主节点:10.0.0.11 node01
从节点:10.0.0.12 node02
从节点:10.0.0.13 node03
/etc/hosts:
10.0.0.11 node01
10.0.0.12 node02
10.0.0.13 node03
操作系统:
CentOS Linux release 7.9.2009 (Core)
Linux 主机名 3.10.0-1160.el7.x86_64 #1 SMP Mon Oct 19 16:18:59 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
JDK版本(使用JDK8并尽可能的接近HADOOP版本发布时间):
(“https://www.oracle.com/cn/java/technologies/downloads/#java8”)
java version "1.8.0_431"
HADOOP版本:
(“https://www.apache.org/dyn/closer.cgi/hadoop/common/hadoop-3.4.0/hadoop-3.4.0.tar.gz”)
Hadoop 3.4.0
硬件资源:
内存:2GB、CPU:双核、硬盘:20GB。
cat /etc/profile
#java
JAVA_HOME=/data/software/install/jdk
PATH=$JAVA_HOME/bin:$PATH
CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
export JAVA_HOME
export CLASSPATH
#hadoop
HADOOP_HOME=/data/software/install/hadoop
PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
export PATH
export HDFS_NAMENODE_USER=root
export HDFS_DATANODE_USER=root
export HDFS_SECONDARYNAMENODE_USER=root
export YARN_RESOURCEMANAGER_USER=root
export YARN_NODEMANAGER_USER=root
#ssh配置
[root@localhost system]# chmod +x /data/scripts/system/createssh.sh
[root@localhost system]# /bin/sh /data/scripts/system/createssh.sh
[root@localhost system]# ll ~/.ssh/id_rsa.pub
-rw-r--r-- 1 root root 396 Oct 31 02:53 /root/.ssh/id_rsa.pub
[root@localhost system]# vim /data/scripts/system/fen.sh
#!/bin/sh
ssh-copy-id -i ~/.ssh/id_rsa.pub root@node01
ssh-copy-id -i ~/.ssh/id_rsa.pub root@node02
ssh-copy-id -i ~/.ssh/id_rsa.pub root@node03
[root@localhost system]# /bin/sh /data/scripts/system/fen.sh
[root@localhost system]# cat ~/.ssh/id_rsa.pub
[root@localhost system]# cat ~/.ssh/authorized_keys
[root@localhost ~]# ll -ld ~/.ssh
drwx------ 2 root root 4096 Oct 31 02:58 /root/.ssh
[root@localhost ~]# ll ~/.ssh/authorized_keys
-rw-r--r-- 1 root root 1188 Oct 31 02:58 /root/.ssh/authorized_keys
[centos@hadoop1 ~]$ chmod 700 ~/.ssh
[centos@hadoop1 ~]$ chmod 644 ~/.ssh/authorized_keys
[root@localhost system]# ssh root@node01
[root@localhost system]# ssh root@node02
[root@localhost system]# ssh root@node03
#HDFS集群配置信息
hadoop-env.sh文件:
export JAVA_HOME=/data/software/install/jdk
workers文件:
node02
node03
core-site.xml文件:
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://node01</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/data/temp/hadoop</value>
</property>
</configuration>
hdfs-site.xml文件:
<configuration>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/data/temp/hdfs/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/data/temp/hdfs/datanode</value>
</property>
</configuration>
mapred-site.xml文件:
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>yarn.app.mapreduce.am.env</name>
<value>HADOOP_MAPRED_HOME=/data/software/install/hadoop</value>
</property>
<property>
<name>mapreduce.map.env</name>
<value>HADOOP_MAPRED_HOME=/data/software/install/hadoop</value>
</property>
<property>
<name>mapreduce.reduce.env</name>
<value>HADOOP_MAPRED_HOME/data/software/install/hadoop</value>
</property>
</configuration>
yarn-site.xml文件:
<configuration>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>node01</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
#hadoop classpath得到
<property>
<name>yarn.application.classpath</name>
<value>
/data/software/workspace/hadoop-3.4.0/etc/hadoop:/data/software/workspace/hadoop-3.4.0/share/hadoop/common/lib/*:/data/software/work
space/hadoop-3.4.0/share/hadoop/common/*:/data/software/workspace/hadoop-3.4.0/share/hadoop/hdfs:/data/software/workspace/hadoop-3.4
.0/share/hadoop/hdfs/lib/*:/data/software/workspace/hadoop-3.4.0/share/hadoop/hdfs/*:/data/software/workspace/hadoop-3.4.0/share/had
oop/mapreduce/*:/data/software/workspace/hadoop-3.4.0/share/hadoop/yarn:/data/software/workspace/hadoop-3.4.0/share/hadoop/yarn/lib/
*:/data/software/workspace/hadoop-3.4.0/share/hadoop/yarn/*
</value>
</property>
</configuration>
#启动HDFS
在主节点node01执行格式化HDFS
hdfs namenode -format
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at node01/10.0.0.11
************************************************************/
在主节点node01的hadoop安装目录启动集群
sbin/start-all.sh
#测试集群
[root@localhost ~]# cat ~/text
hello
world
[root@localhost ~]# hdfs dfs -put ~/text /input
[root@localhost ~]# hdfs dfs -ls /input
-rw-r--r-- 2 root supergroup 12 2024-10-31 04:29 /input
[root@localhost sbin]# hdfs dfsadmin -report
Configured Capacity: 38420357120 (35.78 GB)
Present Capacity: 25072865280 (23.35 GB)
DFS Remaining: 25072816128 (23.35 GB)
DFS Used: 49152 (48 KB)
DFS Used%: 0.00%
Replicated Blocks:
Under replicated blocks: 0
Blocks with corrupt replicas: 0
Missing blocks: 0
Missing blocks (with replication factor 1): 0
Low redundancy blocks with highest priority to recover: 0
Pending deletion blocks: 0
Erasure Coded Block Groups:
Low redundancy block groups: 0
Block groups with corrupt internal blocks: 0
Missing block groups: 0
Low redundancy blocks with highest priority to recover: 0
Pending deletion blocks: 0
-------------------------------------------------
Live datanodes (2):
Name: 10.0.0.12:9866 (node02)
Hostname: localhost
Decommission Status : Normal
Configured Capacity: 19210178560 (17.89 GB)
DFS Used: 24576 (24 KB)
Non DFS Used: 5674422272 (5.28 GB)
DFS Remaining: 12536283136 (11.68 GB)
DFS Used%: 0.00%
DFS Remaining%: 65.26%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 0
Last contact: Thu Oct 31 06:46:39 CST 2024
Last Block Report: Thu Oct 31 06:41:00 CST 2024
Num of Blocks: 0
Name: 10.0.0.13:9866 (node03)
Hostname: localhost
Decommission Status : Normal
Configured Capacity: 19210178560 (17.89 GB)
DFS Used: 24576 (24 KB)
Non DFS Used: 5674172416 (5.28 GB)
DFS Remaining: 12536532992 (11.68 GB)
DFS Used%: 0.00%
DFS Remaining%: 65.26%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 0
Last contact: Thu Oct 31 06:46:39 CST 2024
Last Block Report: Thu Oct 31 06:41:00 CST 2024
Num of Blocks: 0
[root@node01 ~]# hadoop fs -mkdir -p /user/root/input
[root@node01 ~]# echo "Hello Hadoop" > input.txt
[root@node01 ~]# hadoop fs -put input.txt /user/root/input/
[root@node01 ~]# hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar wordcount /user/root/input /user/root/output
2024-10-31 07:28:29,233 INFO client.DefaultNoHARMFailoverProxyProvider: Connecting to ResourceManager at node01/10.0.0.11:8032
2024-10-31 07:28:29,637 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/root/.staging/job_1730330555537_0002
2024-10-31 07:28:29,880 INFO input.FileInputFormat: Total input files to process : 1
2024-10-31 07:28:29,942 INFO mapreduce.JobSubmitter: number of splits:1
2024-10-31 07:28:30,053 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1730330555537_0002
2024-10-31 07:28:30,054 INFO mapreduce.JobSubmitter: Executing with tokens: []
2024-10-31 07:28:30,182 INFO conf.Configuration: resource-types.xml not found
2024-10-31 07:28:30,183 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
2024-10-31 07:28:30,232 INFO impl.YarnClientImpl: Submitted application application_1730330555537_0002
2024-10-31 07:28:30,263 INFO mapreduce.Job: The url to track the job: http://node01:8088/proxy/application_1730330555537_0002/
2024-10-31 07:28:30,263 INFO mapreduce.Job: Running job: job_1730330555537_0002
2024-10-31 07:28:36,345 INFO mapreduce.Job: Job job_1730330555537_0002 running in uber mode : false
2024-10-31 07:28:36,347 INFO mapreduce.Job: map 0% reduce 0%
2024-10-31 07:28:40,449 INFO mapreduce.Job: map 100% reduce 0%
2024-10-31 07:28:45,485 INFO mapreduce.Job: map 100% reduce 100%
2024-10-31 07:28:45,497 INFO mapreduce.Job: Job job_1730330555537_0002 completed successfully
2024-10-31 07:28:45,568 INFO mapreduce.Job: Counters: 54
File System Counters
FILE: Number of bytes read=31
FILE: Number of bytes written=619369
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=117
HDFS: Number of bytes written=17
HDFS: Number of read operations=8
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
HDFS: Number of bytes read erasure-coded=0
Job Counters
Launched map tasks=1
Launched reduce tasks=1
Data-local map tasks=1
Total time spent by all maps in occupied slots (ms)=1926
Total time spent by all reduces in occupied slots (ms)=1973
Total time spent by all map tasks (ms)=1926
Total time spent by all reduce tasks (ms)=1973
Total vcore-milliseconds taken by all map tasks=1926
Total vcore-milliseconds taken by all reduce tasks=1973
Total megabyte-milliseconds taken by all map tasks=1972224
Total megabyte-milliseconds taken by all reduce tasks=2020352
Map-Reduce Framework
Map input records=1
Map output records=2
Map output bytes=21
Map output materialized bytes=31
Input split bytes=104
Combine input records=2
Combine output records=2
Reduce input groups=2
Reduce shuffle bytes=31
Reduce input records=2
Reduce output records=2
Spilled Records=4
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=85
CPU time spent (ms)=850
Physical memory (bytes) snapshot=574197760
Virtual memory (bytes) snapshot=5532917760
Total committed heap usage (bytes)=413663232
Peak Map Physical memory (bytes)=324878336
Peak Map Virtual memory (bytes)=2765701120
Peak Reduce Physical memory (bytes)=249319424
Peak Reduce Virtual memory (bytes)=2767216640
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=13
File Output Format Counters
Bytes Written=17
[root@node01 ~]# hadoop dfs -ls /user/root/output/
WARNING: Use of this script to execute dfs is deprecated.
WARNING: Attempting to execute replacement "hdfs dfs" instead.
Found 2 items
-rw-r--r-- 2 root supergroup 0 2024-10-31 07:28 /user/root/output/_SUCCESS
-rw-r--r-- 2 root supergroup 17 2024-10-31 07:28 /user/root/output/part-r-00000
[root@node01 ~]# hadoop dfs -cat /user/root/output/part-r-00000
WARNING: Use of this script to execute dfs is deprecated.
WARNING: Attempting to execute replacement "hdfs dfs" instead.
Hadoop 1
Hello 1
#控制台访问IP和端口的信息
控制台界面访问:
“http://10.0.0.11:8088/cluster/nodes”
“http://10.0.0.11:9870/dfshealth.html#tab-datanode”
#数据集信息
源地址:
”https://archive.ics.uci.edu/dataset/450/sports+articles+for+objectivity+analysis”
数据规模:”1000个txt文本文件”。(注:见下述网站和下载的压缩包截图)
文件内容:用于机器学习的1000篇体育文章。
所有文件实际总大小:约为3.8MB。
#源程序
Java程序在maven工程中配置的pom.xml内容:
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>3.4.0</version>
</dependency>
下述程序的编译和运行环境:
windows机器配置hosts映射:
10.0.0.11 node01
10.0.0.12 node02
10.0.0.13 node03
IDE版本:idea64-2023.1.3.exe
JDK版本:java version "1.8.0_431"(与安装HDFS集群的服务器版本一致)。
MAVEN版本:Apache Maven 3.3.9
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.LocatedFileStatus;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.fs.RemoteIterator;
import java.io.File;
import java.net.URI;
import java.time.LocalDateTime;
import java.time.format.DateTimeFormatter;
import java.util.concurrent.atomic.AtomicInteger;
public class HadoopDemo {
//格式化时间格式
public static final DateTimeFormatter DATE_TIME_FORMATTER
= DateTimeFormatter.ofPattern("yyyy-MM-dd HH:mm:ss");
//连接hdfs地址
public static final String HADOOP_CONNECT_URL = "hdfs://node01";
//连接hdfs操作用户
public static final String HADOOP_CONNECT_USER = "root";
//创建hadoop远程文件目录
public static final String HADOOP_REMOTE_PATH = "/user/root/input/example/sports01";
//本地上传数据集文件目录
public static final String HADOOP_UPLOAD_LOCAL_PATH
= "D:\\sports+articles+for+objectivity+analysis\\Raw data";
public static void main(String[] args) throws Exception {
LocalDateTime beginTime = LocalDateTime.now();
System.out.println("开始时间 : " + beginTime.format(DATE_TIME_FORMATTER));
Configuration configuration = new Configuration();
FileSystem fileSystem = FileSystem.get(new URI(HADOOP_CONNECT_URL), configuration, HADOOP_CONNECT_USER);
/**
* 第1步-创建HADOOP远程目录
*/
Path hadoopRemotePath = new Path(HADOOP_REMOTE_PATH);
boolean hadoopRemotePathExistsFlag = fileSystem.exists(hadoopRemotePath);
if(hadoopRemotePathExistsFlag){
fileSystem.delete(hadoopRemotePath, true);
}
fileSystem.mkdirs(hadoopRemotePath);
/**
* 第2步-上传本地目录文件到HADOOP远程目录
*/
File localFile = new File(HADOOP_UPLOAD_LOCAL_PATH);
File[] localListFiles = localFile.listFiles();
for (int i = 0; i < localListFiles.length; i++) {
fileSystem.copyFromLocalFile(false, true, new Path(localListFiles[i].getPath()), hadoopRemotePath);
}
/**
* 第3步-遍历HADOOP目录并得到数据集中的文件个数
*/
RemoteIterator<LocatedFileStatus> hadoopRemoteIterator = fileSystem.listFiles(hadoopRemotePath, true);
AtomicInteger hadoopRemoteSize = new AtomicInteger(0);
while(hadoopRemoteIterator.hasNext()){
hadoopRemoteIterator.next();
hadoopRemoteSize.getAndIncrement();
}
System.out.println("数据集中的文件个数为 : " + hadoopRemoteSize.get());
/**
* 第4步-关闭FileSystem
*/
fileSystem.close();
LocalDateTime endTime = LocalDateTime.now();
System.out.println("结束时间 : " + endTime.format(DATE_TIME_FORMATTER));
}
}
文件类型:txt文件
数据结构:纯文本
比如在主节点查看数据集中的Text1000.txt文件:
[root@node01 ~]# hdfs dfs -cat /user/root/input/example/sports01/Text1000.txt
A Night of Validation
Rodman?s ceremony celebrates his, Bad Boys? greatness
by George Blaha
Lou Capozzola/NBAE/Getty Images
When you think about all the people who?ve played for the Pistons in the 54 years since they?ve been around in Detroit and realize that only seven men have been honored to have their names hanging above the court at The Palace and that five of them were directly involved with the Bad Boys, you realize what a special group that was.
上传后的1000个文件列表:
#问题
问题1:
hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar pi 10 100
Error: Could not find or load main class org.apache.hadoop.mapred.YarnChild
#解决方案
[root@node01 ~]# hadoop classpath
#复制到下述配置中
[root@node01 hadoop]# vim yarn-site.xml
<property>
<name>yarn.application.classpath</name>
<value>
/data/software/workspace/hadoop-3.4.0/etc/hadoop:/data/software/workspace/hadoop-3.4.0/share/hadoop/common/lib/*:/data/software/work
space/hadoop-3.4.0/share/hadoop/common/*:/data/software/workspace/hadoop-3.4.0/share/hadoop/hdfs:/data/software/workspace/hadoop-3.4
.0/share/hadoop/hdfs/lib/*:/data/software/workspace/hadoop-3.4.0/share/hadoop/hdfs/*:/data/software/workspace/hadoop-3.4.0/share/had
oop/mapreduce/*:/data/software/workspace/hadoop-3.4.0/share/hadoop/yarn:/data/software/workspace/hadoop-3.4.0/share/hadoop/yarn/lib/
*:/data/software/workspace/hadoop-3.4.0/share/hadoop/yarn/*
</value>
</property>
问题2:
执行start-all.sh警告127.0.0.1添加到网络中;
hdfs namenode -format提示"SHUTDOWN_MSG: Shutting down NameNode at node01/127.0.0.1";
hdfs dfsadmin -report打印"172.16.1.11"内网IP;
jps显示的端口,外网IP等部分不能访问,只能通过内网IP访问,curl http://10.0.0.11:8088/cluster
curl: (7) Failed connect to 10.0.0.11:8088; Connection refused;
#解决方案
/etc/hosts配置,千万不要映射内网IP以及加入127.0.0.1 localhost:
10.0.0.11 node01
10.0.0.12 node02
10.0.0.13 node03