hadoop cluster setup

Setup for cluster

Add User

    sudo addgroup hadoop
    sudo adduser --ingroup hadoop hadoop
    sudo usermod -a -G sudo hadoop(将hadoop加入sudoers)

Set env

    export JAVA_HOME=jdk_path (eg: /usr/lib/jvm/java-6-sun)
    export PATH=${JAVA_HOME}/bin:${JAVA_HOME}/jre/bin:${PATH}
    export HADOOP_HOME=hadoop_root (eg: export HADOOP_HOME=/usr/local/hadoop)
    export PATH=$PATH:$HADOOP_HOME/bin
    unalias fs &> /dev/null
    alias fs="hadoop fs"
    unalias hls &> /dev/null
    alias hls="fs -ls"

Hadoop config

hadoop-env.sh

    vi etc/hadoop/hadoop-env.sh
    change JAVA_HOME
    export JAVA_HOME=dk_path (eg: /usr/lib/jvm/java-6-sun)

yarn-env.sh

    vi etc/hadoop/yarn-env.sh
    change JAVA_HOME
    export JAVA_HOME=dk_path (eg: /usr/lib/jvm/java-6-sun)

Configuring all machines

configure all machines

    su hadoop
    ssh-keygen -t rsa -P ""
    cd ~/.ssh
    cat id_rsa.pub >> authorized_keys

modify hosts for all machines

    192.168.202.92  master(hostname)
    192.168.202.13  slave

Attention: 1. master/slave should be the hostname, because of the mapreduce
use the hostname; 2. remove other binders for master/slave.

    127.0.0.1    localhost
    #127.0.1.1  sh030  (attention: because this binder, the slave cannot connect to master by sh030:54310)
    192.168.202.92  sh030
    192.168.202.13  zxx-desktop
    192.168.0.62    jack-desktop

copy master id_rsa.pub to slave authorized_keys

    cat id_rsa.pub | ssh hadoop@slave "cat >> /home/hadoop/.ssh/authorized_keys"

configure master only

    cat etc/hadoop/masters
    master (hostname)
    cat etc/hadoop/slaves ( is used only by the scripts like bin/start-dfs.sh hdfs)
    master
    slave

Attention: the master / slave should be the same name within the hosts
file

etc/hadoop/*-site.xml for all machines

core-site.xml

    <property>
      <name>hadoop.tmp.dir</name>
      <value>/data/hadoop</value>
      <description>A base for other temporary directories.</description>
    </property>
    <property>
      <name>fs.defaultFS</name>
      <value>hdfs://master:54310</value>
      <description>The name of the default file system.  A URI whose
      scheme and authority determine the FileSystem implementation.  The
      uri's scheme determines the config property (fs.SCHEME.impl) naming
      the FileSystem implementation class.  The uri's authority is used to
      determine the host, port, etc. for a filesystem.</description>
    </property>

hdfs-site.xml

    <property>
      <name>dfs.replication</name>
      <value>2</value>
      <description>Default block replication.
      The actual number of replications can be specified when the file is created.
      The default is used if replication is not specified in create time.
      </description>
    </property>  

    <property>
      <name>dfs.namenode.secondary.http-address</name>
      <value>testHadoop-162:50090</value>
    </property>

    <property>
      <name>dfs.namenode.name.dir</name>
      <value>file:///data/hdfs/name</value>
    </property>

    <property>
      <name>dfs.namenode.checkpoint.dir</name>
      <value>file:///data/hdfs/checkpoint</value>
    </property>

    <property>
      <name>dfs.datanode.data.dir</name>
      <value>file:///data/hdfs/data</value>
    </property>

    <property>
      <name>dfs.webhdfs.enabled</name>
      <value>true</value>
    </property>

    <property>
      <name>dfs.support.append</name>
      <value>true</value>
    </property>

    <property>
      <name>dfs.support.broken.append</name>
      <value>true</value>
    </property>

yarn-site.xml

    <property>
        <name>yarn.resourcemanager.address</name>
        <value>sh030:8032</value>
    </property>
    <property>
        <name>yarn.resourcemanager.scheduler.address</name>
        <value>sh030:8030</value>
    </property>
    <property>
        <name>yarn.resourcemanager.resource-tracker.address</name>
        <value>sh030:8031</value>
    </property>
    <property>
        <name>yarn.resourcemanager.admin.address</name>
        <value>sh030:8033</value>
    </property>
    <property>
        <name>yarn.resourcemanager.webapp.address</name>
        <value>sh030:8088</value>
    </property>
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
    <property>
        <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
        <value>org.apache.hadoop.mapred.ShuffleHandler</value>
    </property>

mapred-site.xml

    <configuration> 
        <property> 
        <name>mapreduce.framework.name</name> 
        <value>yarn</value> 
        </property> 
        <property> 
        <name>mapreduce.jobhistory.address</name> 
        <value>master-hadoop:10020</value> 
        </property> 
        <property> 
        <name>mapreduce.jobhistory.webapp.address</name> 
        <value>master-hadoop:19888</value> 
        </property> 
    </configuration>

Formatting the HDFS filesystem

    bin/hadoop namenode -format

Attention: If configure from single node to cluster, should delete all the
file in /data/hadoop firstly. Otherwise, the slave datanode cannot launch. rm
-fr /data/hadoop/*

Launch HDFS

    ./sbin/start-dfs.sh
    will launch NameNode SecondaryNameNode DataNode(master also as dataNode)

MapReduce

    ./sbin/start-yarn.sh
    jps on Master
    29252 DataNode (Master also as slave)
    29940 NodeManager (Master also as slave)
    29051 NameNode
    29732 ResourceManager
    29515 SecondaryNameNode

    jps on Slave
    27858 DataNode
    28116 NodeManager

Stopping the cluster

MapReduce

    ./sbin/stop-yarn.sh

HDFS

    ./sbin/stop-dfs.sh

Test

put data

    hadoop fs -mkdir /testdata
    hadoop fs -put -f ./*.txt /testdata

mapreduce

    hadoop jar ./share/hadoop/mapreduce/sources/hadoop-mapreduce-examples-2.3.0-sources.jar org.apache.hadoop.examples.WordCount /testdata /testdata-output

hadoop cluster setup

Setup for cluster

Add User

Set env

Hadoop config

Configuring all machines

etc/hadoop/*-site.xml for all machines

Formatting the HDFS filesystem

Launch HDFS

MapReduce

Stopping the cluster

Test

华山野老

引用和评论

在Google Cloud platform上创建Kubernetes cluster并使用

vue2 devtools导致页面卡顿，解决方案

Spark与Hive的完美结合：如何在Spark上部署Hive