Hadoop-5-HDFS of Big Data

1. Introduction to HDFS

HDFS (Hadoop Distributed File System) is the core sub-project of the Hadoop project, which stores and manages massive amounts of data through distributed computing in the development of big data.

HDFS is a typical master/slave distributed system. An HDFS cluster consists of a metadata node (NameNode) and some data nodes (DataNode).

For example, we can think of the NameNode as a warehouse manager, managing the goods in the warehouse; the DataNode as a warehouse for storing goods, and the goods are what we call data.

HDFS command line operations

The command line interface is as follows:

$ bin/hadoop fs -命令 文件 路径

$ bin/hdfs dfs -命令 文件路径

ls

Use the ls command to view directories and files in the HDFS system.

$ hadoop fs -ls /

Operation demonstration:

[root@centos01 ~]# hadoop fs -ls /
Found 2 items
drwxr-xr-x   - hadoop supergroup          0 2021-07-10 08:58 /input
drwx------   - hadoop supergroup          0 2021-07-10 08:38 /tmp

Recursively list all directories and files in the root directory of the HDFS file system:

[root@centos01 ~]# hadoop fs -ls -R /
drwxr-xr-x   - hadoop supergroup          0 2021-07-10 08:58 /input
-rw-r--r--   2 hadoop supergroup         83 2021-07-10 08:58 /input/wc.txt
drwx------   - hadoop supergroup          0 2021-07-10 08:38 /tmp
drwx------   - hadoop supergroup          0 2021-07-10 08:38 /tmp/hadoop-yarn
drwx------   - hadoop supergroup          0 2021-07-10 08:38 /tmp/hadoop-yarn/staging
drwx------   - hadoop supergroup          0 2021-07-10 08:38 /tmp/hadoop-yarn/staging/hadoop
drwx------   - hadoop supergroup          0 2021-07-10 08:49 /tmp/hadoop-yarn/staging/hadoop/.staging

put

Use the put command to upload local files to the HDFS system. For example, upload the local file a.txt to the input folder of the root directory of the HDFS file system, the command is as follows:

$ hadoop fs -put a.txt /input/

get

Use the get command to download files in the HDFS file system to the local. Note that the name of the file cannot be the same as the local file when downloading, otherwise it will prompt that the file already exists.

$ hadoop fs -get /input/a.txt a.txt

Download the folder to the local:

$ hadoop fs -get /input/ ./

Common commands

列出 hdfs 下的文件
$ hadoop dfs -ls
列出 hdfs / 路径下的所有文件，文件夹  
$ hadoop dfs -ls -R /
创建目录 /input
$ hadoop dfs -mkdir /input
列出 hsfs 名为 input 的文件夹中的文件
$ hadoop dfs -ls input
将 test.txt 上传到 hdfs 中
$ hadoop fs -put /home/binguner/Desktop/test.txt /input
将 hsdf 中的 test.txt 文件保存到本地桌面文件夹
$ hadoop dfs -get /input/test.txt /home/binguenr/Desktop
删除 hdfs 上的 test.txt 文件
$ hadoop dfs -rmr /input/test.txt
查看 hdfs 下 input 文件夹中的内容
$ hadoop fs -cat input/*
进入安全模式
$ hadoop dfsadmin –safemode enter
退出安全模式
$ hadoop dfsadmin -safemode leave
报告 hdfs 的基本统计情况
$ hadoop dfsadmin -report

Two, Java API operation

1. Create

Introduce Hadoop's Java API dependency package in the pom.xml file:

  <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-client</artifactId>
            <version>2.8.2</version>
        </dependency>

New com/homay/hadoopstudy/FileSystemCat.java class

package com.homay.hadoopstudy;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import sun.nio.ch.IOUtil;

import java.io.InputStream;

/**
 * @author: kaiyi
 * @Date 2021/7/12 0:25
 */
public class FileSystemCat {

    public static void main(String[] args) throws Exception{

        Configuration conf = new Configuration();
        conf.set("fs.defalut.name", "hdfs://192.168.222.10:9000");
        FileSystem fs = FileSystem.get(conf);

        // 打开文件输入流
        InputStream in = fs.open(new Path("hdfs:/input/wc.txt"));
        IOUtils.copyBytes(in, System.out, 4096, false);

        // 关闭输入流
        IOUtils.closeStream(in);


    }
}

View Hadoop files:

[hadoop@centos01 sbin]$ hadoop dfs -ls -R /
WARNING: Use of this script to execute dfs is deprecated.
WARNING: Attempting to execute replacement "hdfs dfs" instead.

drwxr-xr-x   - hadoop supergroup          0 2021-07-10 08:58 /input
-rw-r--r--   2 hadoop supergroup         83 2021-07-10 08:58 /input/wc.txt
drwx------   - hadoop supergroup          0 2021-07-10 08:38 /tmp
drwx------   - hadoop supergroup          0 2021-07-10 08:38 /tmp/hadoop-yarn
drwx------   - hadoop supergroup          0 2021-07-10 08:38 /tmp/hadoop-yarn/staging
drwx------   - hadoop supergroup          0 2021-07-10 08:38 /tmp/hadoop-yarn/staging/hadoo
drwx------   - hadoop supergroup          0 2021-07-10 08:49 /tmp/hadoop-yarn/staging/hadoo
[hadoop@centos01 sbin]$ hadoop -v

run

Run the file and report this error:
java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset

The local remote connection to the Hadoop cluster is abnormal, and the log is as follows:

22:27:56.703 [main] DEBUG org.apache.hadoop.util.Shell - Failed to detect a valid hadoop home directory
java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset.
    at org.apache.hadoop.util.Shell.checkHadoopHomeInner(Shell.java:448)
    at org.apache.hadoop.util.Shell.checkHadoopHome(Shell.java:419)
    at org.apache.hadoop.util.Shell.<clinit>(Shell.java:496)

The content of the log description is very clear, and HADOOP_HOME and hadoop.home.dir are not set. What are these two main items? It is the Hadoop address configured in the local environment variable, so do I need to download the Windows version of Hadoop to set it up? If you are remotely connecting to a Hadoop cluster on Linux, you do not need to download and install the Windows version of Hadoop at all! ! !

When connecting to the Hadoop system locally, you need to configure the relevant Hadoop variables locally, including hadoop.dll and winutils.exe, etc.

winutils: Since hadoop is mainly written based on linux, winutil.exe is mainly used to simulate the directory environment under linux. When Hadoop runs under windows or calls a remote Hadoop cluster, this auxiliary program is needed to run. Winutils is a binary file in Windows, applicable to different versions of Hadoop systems and built on Windows VM, which is used to test Hadoop-related applications in Windows systems.

Solution

After understanding the reason, you can download the corresponding winutils according to the version of the Hadoop cluster installed.

Download : 160ececab319d4 https://github.com/steveloughran/winutils

Note: If there is no same version, you can choose the nearest version to download and use. If the version used in the cluster is 2.8.5, you can download and use the 2.8.3 version file.

Set the environment variable %HADOOP_HOME% to point to the directory above the BIN WINUTILS.EXE which is:

Add system variables
Copy the bin folder in the 2.8.3 folder and store the address as follows:
Rebirth after restarting idea, the problem is solved.

Note: does not need to download and install the windows version of Hadoop, only need to import winutils.exe to .

After restarting, the above problem was solved, and this problem appeared again: Wrong FS: hdfs:/input/wc.txt, expected: file:///

Detailed error information:

23:51:26.466 [main] DEBUG org.apache.hadoop.fs.FileSystem - FS for file is class org.apache.hadoop.fs.LocalFileSystem
Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS: hdfs:/input/wc.txt, expected: file:///
    at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:730)

solution:

Hadoop needs to put core-site.xml and hdfs-site.xml on the cluster under the current project.

1)hdfs-site.xml

2)core-site.xml

3)mapred-site.xml

The above three files are the xml files of your linux environment to install hadoop configuration. Put the core-site.xml and hdfs-site.xml on the hadoop cluster into the src directory of the project under /resource

Then execute the file, you can see that java calls Hadoop api successfully ^_^

Hadoop-5-HDFS of Big Data

1. Introduction to HDFS

HDFS command line operations

ls

put

get

Common commands

Two, Java API operation

1. Create

run

Solution

Corwien

引用和评论

CDH6 离线安装

【Hadoop】HBase系统解析及适用场景

【赵渝强老师】史上最详细：Hadoop HDFS的体系架构

【Hadoop】Yarn资源管理调度

【大数据内核解密】HDFS 架构与数据模型：从理论到实战全解析