Hadoop principle and source code

hadoop overview

Hadoop history

Hadoop originated in Nutch. Nutch’s design goal is a web crawler engine, but as the amount of web crawling data increases, Nutch has encountered serious performance scaling problems. In 2003, Google released two papers in 2004 to provide a solution to this problem. One is the predecessor of HDFS, GFS, which is used for the storage of massive web pages; the other is the distributed computing framework MAPERDUCE. The founder of Nutch took two years to implement the HDFS and MapReduce code according to the guidance of the paper. And it was separated from Nutch and became an independent project Hadoop. In 2008, Hadoop became the top Apache project.
Early Hadoop was not the Hadoop distributed open source software that everyone is familiar with now, but refers to an ecosystem of big data, including many other software. Around 2010, Hbase, HIVE, Zookeeper, etc. separated from the Hadoop project and became Apache's top projects. In 2011, Hadoop 2.0 was released, and a major update was made in the architecture. The introduction of the Yarn framework to focus on resource management streamlined the responsibilities of MapReduce. At the same time, the Yarn framework, as a general resource scheduling and management module, supports a variety of other programming models, such as the most famous Spark.
Due to the complexity of version management of HADOOP, the deployment, installation and configuration of complex clusters requires writing a large number of configuration files and then distributing them to each node, which is prone to errors and inefficient. So many companies will redistribute the basic Hadoop after commercialization. At present, there are many Hadoop distributions, including Huawei distribution, Intel distribution, Cloudera distribution (CDH), etc.

The composition of Hadoop (2.0)

HDFS

The composition of HDFS

NameNode: It is the manager of the entire HDFS cluster,

Manage the HDFS namespace
Manage copy strategy
Manage data block location mapping information in DataNode
Interact with the client to handle read and write operations from the client.

DataNode: The store of actual files.

Store the actual data block
Perform data block read and write operations
After the DataNode starts, it registers with the NameNode and reports all block information to the NameNode every 6 hours
The heartbeat of the DataNode is every 3 seconds, and the heartbeat returns the result with the command given by the NameNode to the DataNode, such as: copy data to another machine; delete a data block.
If the NameNode does not receive a heartbeat from a DataNode for more than 10 minutes, the node is considered unavailable.

Client

The toolkit provided by HDFS is for developers and encapsulates the operation calls to HDFS.
Responsible for file segmentation: When a file is uploaded to HDFS, the Client is responsible for interacting with the NameNode and DataNode, dividing the file into blocks for uploading.

Secondary NameNode: Secondary NameNode, sharing its workload.

Fsimage and Edits are merged regularly and pushed to NameNode.
Assist in the recovery of the NameNode.
It is not a hot standby of the NameNode. When the NameNode is down, it cannot immediately replace the NameNode and provide services

HDFS read and write process

HDFS upload file

The client requests the NameNode to upload files, and the NameNode performs compliance checks and creates corresponding directory metadata. And return whether it can be uploaded.
The client splits the file and asks again which DataNode server the first block of the NameNode needs to be uploaded. NameNode returns 3 DataNode nodes, namely dn1, dn2, and dn3.
The client uploads the first block data to dn1, dn1 will continue to call dn2 upon receiving the request, and then dn2 will call dn3 to complete the establishment of the communication channel.
The client starts to upload the first block to dn1 (first read the data from the disk and put it into a local memory cache), with Packet as the unit. When dn1 receives a Packet, it will pass it to dn2, and dn2 will pass it to dn3.
After the transmission of a block is completed, the client requests the NameNode to upload the second block again, and then repeats steps 1~4.

NameNode node selection strategy-node distance calculation
If the Client and HADOOP are in the same cluster, the NameNode will select the DataNode closest to the data to be uploaded to receive the data. Node distance: the sum of the distances between two nodes to the nearest common ancestor.
If the Client and HADOOP are not in the same cluster, the NameNode randomly selects a node on a rack. The second copy is on a random node in another rack, and the third copy is on a random node in the rack where the second copy is located.

HDFS read process

The client requests the NameNode to download the file, and the NameNode returns the address of the DataNode where the file block is located.
The client selects the nearest DataNode server based on the node distance, and then starts to read data. DataNode uses Packet to transmit data.
After the client receives a packet of data, it checks it. After the verification is passed, the packet is written to the target file, and then the second packet is requested. The whole process is a serial process. (Because IO itself is the slowest process)
In the process of reading data, if an error occurs when the client communicates with the dn data node, it will try to connect to the next dn data node that contains this data block. The failed dn data node will be recorded by the client and will not be connected in the future.

SecondaryNameNode

NameNode is the file management of the machine, which can easily cause single-point read and write performance problems and data storage security problems.

SecondaryNameNode and Name assist in solving read and write performance issues: NameNode data is stored in both memory and disk

Fsimage file: A permanent checkpoint of the metadata of the HDFS file system, which contains the serialization information of all directories and file inodes of the HDFS file system.
Edits file: The path to store all update operations of the HDFS file system. All write operations performed by the file system client will first be recorded in the Edits file. The Edits file only performs append operations, which is very efficient. Whenever the metadata is updated or added, the metadata in the memory is modified and appended to Edits.
When the NameNode starts, it will read the Fsimage file into the memory, load the update operation in Edits, and ensure that the metadata information in the memory is up to date and synchronized.
Adding data to the Edits for a long time will cause the file data to be too large and the efficiency will be reduced, and once the power is off, it will take too long to restore the metadata. Therefore, 2NN is specifically used for the merger of FsImage and Edits.

SecondaryNameNode working mechanism

After the NameNode is started, FsImage and Edits files will be created. If it is not the first time to start, load the FsImage and Edits files directly.
1. fsimage_0000000000000000002 The latest FsImage of the file.
2. edits_inprogress_0000000000000000003 Edits in progress.
3. seen_txid is a txt text file, and the record is the number at the end of the latest edits_inprogress file.
Secondary NameNode work
1. 2NN asks NN if it needs CheckPoint.
2. NN copies the current edits file and the latest fsimage file to 2NN and updates the number in seen_txid, and then regenerates the edits file.
3. 2NN loads the edit log and mirror file into the memory, and merges to generate a new mirror file fsimage.chkpoint. Then copy it back to NN. NN renamed fsimage.chkpoint to fsImage to complete a rollover.

HDFS summary

advantage

1. 高容错性：数据自动保存多个副本，通过增加副本的形式，提高容错性。某一个副本丢失后，可以自动恢复。
2. 适合处理大数据：能够处理的数据规模达到GB，TB甚至PB级别，文件数量可以达到百万规模以上。
3. 可构建在廉价的机器上：通过多副本机制，提高可靠性。

shortcoming

 1. 不适合低时延的数据访问
 2. 无法高效的对大量小文件进行存储，大量的小文件会占用NameNode大量的内存来存储文件目录和块信息，同时小文件的寻址时间会超过读取时间，它违反了HDFS的设计目标
 3. 不支持并发写入，文件随机修改。仅支持数据追加的。

DataNode and NameNode source code guide

Preparation before code reading: Hadoop Rpc Framework Guide

Rpc protocol


public interface MyInterface {
    Object versionID = 1;

    boolean demo();

}

Rpc provider


public class MyHadoopServer implements MyInterface {
    @Override
    public boolean demo() {
        return false;
    }

    public static void main(String[] args) {
        Server server = new RPC.Builder(new Configuration())
                .setBindAddress("localhost")
                .setPort(8888)
                .setProtocol(MyInterface.class)
                .setInstance(new MyHadoopServer())
                .build();

        server.start();
    }
}

Rpc consuemr

public class MyHadoopClient {

    public static void main(String[] args) throws Exception {
        MyInterface client = RPC.getProxy(
                MyInterface.class,
                MyInterface.versionID,
                new InetSocketAddress("localhost", 8888),
                new Configuration());
        
        client.demo();
    }
    
}

NameNode startup source code

Start port 9870 service
Load image file and edit log
Initialize the RPC server of NN: used to receive RPC requests from DataNode
NN starts resource detection
NN judges the heartbeat timeout (start a thread to judge whether the DataNode has timed out)
1. HDFS default DataNode offline tolerance timeout timeout = 2 heartbeat.recheck.interval + 10 dfs.heartbeat.interval (2*5+30) beyond this time will be considered DataNode timeout

DataNode startup source code

work process

Source code icon

MapReduce

MapReduce example

Requirements: There is a stored text file with a size of 300M. Count the total number of occurrences of each letter. Requirements: [ap] a result file, [qz] a result file.

Implementation:

public class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
  private Text outK = new Text();
  private IntWritable outV = new IntWritable(1);

  @Override
  protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

      // 1 获取一行
      String line = value.toString();

      // 2 切割
      String[] words = line.split(" ");

      // 3 循环写出
      for (String word : words) {
          // 封装outk
          outK.set(word);

          // 写出
          context.write(outK, outV);
      }
  }
}

public class WordCountReducer extends Reducer<Text, IntWritable,Text,IntWritable> {
    private IntWritable outV = new IntWritable();

    @Override
    protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {

        int sum = 0;
        
        // 累加
        for (IntWritable value : values) {
            sum += value.get();
        }

        outV.set(sum);

        // 写出
        context.write(key,outV);
    }
}

The slicing rule determines the number of MapTasks.

MapReduce abstracts the reading of data as an InputFormat. The commonly used FileInputFormat is a specific implementation for file reading.
The default fragmentation rule of FileInputFormat is: logically fragment the data file to be processed, and each 128M is a data slice. A data slice is handed over to a MapTask for parallel processing.
The number of shards is too large, and too many MapTasks are opened, wasting resources. The number of shards is too small, and the processing of the MapTask phase is slow.
The number of ReduceTasks needs to be manually specified.
In the Map phase, each output data needs to be partitioned by a partition algorithm.
If ReduceTask=0, it means that there is no Reduce phase, and the number of output files is the same as the number of Maps.
If ReduceTask=1, all output files are one.
The number of ReduceTasks should be greater than the number of different values of the results of the partition, otherwise the data cannot be consumed and an exception will occur.
If the number of ReduceTasks is greater than the number of partitions, some of the ReduceTasks will be idle.

The detailed workflow of MapReduce.

Map stage

Read stage: Read data from InputFormat slices through RecordReader and parse the data into key/value.
map phase: the user-defined Mapper.map method is executed. Convert the input key/value to the output key/value.
Collect stage: Receive the output Key/Value, call the partition algorithm, and write the output data into the ring memory buffer of the corresponding partition.
- Spill stage: When the usage of the memory buffer area exceeds a certain threshold, the overflow write thread will be triggered. The thread now performs quick sorting in memory, and then overwrites the data to disk.
Combine stage: When all data processing is completed, MapTask merges all temporary files once to ensure that only one data file will be generated in the end.

Reduce phase

Copy phase: ReduceTask copies the data of the same partition from all MapTasks, and each ReduceTask is responsible for processing a partition without affecting each other. If the file size exceeds a certain threshold, it will be overwritten to the disk, otherwise it will be stored in the memory.
Merge stage: ReduceTask merges all the data of the same partition into one large file,
Sort stage: merge and sort the merged large files. Since the mapTask itself guarantees the order of the area within the partition, the ReduceTask only needs to merge and sort all the data once.
Reduce phase: Execute the user's reduce method and write the result to HDFS.

iamge

Advantages and disadvantages of MapReduce

advantage

The realization is simple and the package degree is high.
Strong scalability: machines can be quickly added to expand its computing power.
High fault tolerance: When a node goes down, it will automatically transfer the task to the leading node to run, without manual involvement in the middle.
It is suitable for offline processing of PB-level data with a large amount of data.

shortcoming

Not good at real-time calculations.
Not good at streaming computing: The input data set of MapReduce is static.
Heavy IO: The output of each MapReduce job will be written to disk, which will cause a lot of disk IO.

MapTask source code guide

The MapTask.run method is the entrance to MapTask.
1. Read the configuration, initialize MapTask to generate jobId
2. Determine the api used to choose runNewMapper or runOldMapper, and execute it.
3. After the execution of MapTask is over, do some cleanup work.
runNewMapper
1. Instantiate the default inputFormat and calculate the slice. The output object is instantiated according to the number of reduceTasks set. Instantiate the input object.
2. mapper.run(mapperContext) execution
  1. Loop to confirm each group of kv, and execute the user's map logic.
  2. Call the collector.collect method in the map method. Write data to the ring buffer.
3. output.close(mapperContext) executes and finally calls the close method of MapTask
  1. Call the flush method of MapTask.
    1. sortAndSpill sorts the memory and writes it to the file. One temporary file for each partition. Orderly in the area
    2. mergeParts merge and sort, merge multiple temporary files into one file.
  2. Call the close method of MapTask, which is an empty method.

Guide to ReduceTask source code: Its entrance is the run method of RecuceTask.

First initialize the state machines for copy, sort, and reduce.
initialize initializes outputformat to TextOutputFormat.
The shuffleConsumerPlugin.init(shuffleContext) method is executed. Initialize inMemoryMerger and onDiskMerger.
shuffleConsumerPlugin.run();
1. Create a Fetcher to capture data. After the data capture is complete, switch the state to sort
2. The finanlMerge in merger.close(); is executed to merge the data in the memory with the data in the disk.
Cut the state to reduce. runNewReducer. User-defined Reduce method execution.
1. The user-defined reduce method is executed, and the context.write method is called to write data. Finally, the write method of TextOutputFormat is called.
  1. Write key first
  2. Write value again
  3. Write a newline at the end

Yarn

Yarn composition

ResourceManager (RM): Manager of global resources
1. It consists of two parts: one is a pluggable scheduling Scheduler, the other is ApplicationManager
2. Scheduler: A pure scheduler, not responsible for application monitoring
3. ApplicationManager: Mainly responsible for receiving job submission requests, assigning the first Container to the application to run the ApplicationMaster, and also responsible for monitoring the ApplicationMaster and restarting the Container that the ApplicationMaster runs when it fails.
4. NodeManager(NM)
5. Receive a request from ResourceManager and assign Container to a certain task of the application
6. Exchange information with ResourceManager to ensure the smooth operation of the entire cluster. ResourceManager tracks the health status of the entire cluster by collecting the report information of each NodeManager, and NodeManager is responsible for monitoring its own health status.
7. Manage the life cycle of each Container
8. Manage logs on each node
9. Execute some additional services applied on Yarn, such as the shuffle process of MapReduce
10. Container
  1. It is the computing unit of the Yarn framework, which executes the application task
  2. Is a set of allocated system resources memory, cpu, disk, network, etc.
  3. Each application starts from the ApplicationMaster, which itself is a container (the 0th). Once started, the ApplicationMaster will negotiate more containers with the Resourcemanager according to the task requirements. During the running process, the container can be dynamically released and applied for.
11. ApplicationMaster(AM)
  1. ApplicationMaster is responsible for negotiating the appropriate container with the scheduler, tracking the status of the application, and monitoring their progress
  2. Each application has its own ApplicationMaster, responsible for negotiating resources (container) with ResourceManager and NodeManager to work together to perform and monitor tasks
  3. When an ApplicationMaster is started, it will periodically send a heartbeat report to the resourcemanager to confirm its health and required resources

Yarn execution process

The client program submits the application to the ResourceManager and requests an ApplicationMaster instance, and the ResourceManager gives an applicationID in the response
ResourceManager finds a NodeManager that can run a Container, and starts the ApplicationMaster instance in this Container
ApplicationMaster registers with ResourceManager. After registration, the client can query ResourceManager to obtain detailed information of its ApplicationMaster.
In normal operation, ApplicationMaster sends a resource-request request to ResourceManager according to the resource-request protocol. ResourceManager allocates container resources to ApplicationMaster as best as possible according to the scheduling strategy, and sends it to ApplicationMaster as a response to the resource request.
ApplicationMaster starts Container by sending container-launch-specification information to NodeManager
The application code runs in the started Container, and sends the running progress and status information to the ApplicationMaster through the application-specific protocol. As the job is executed, the ApplicationMaster sends the heartbeat and progress information to the ResourceManager. In these heartbeat information , ApplicationMaster can also request and release some containers.
During the running of the application, the client submitting the application actively communicates with the ApplicationMaster to obtain information such as the running status and progress update of the application. The communication protocol is also an application-specific protocol.

Yarn scheduler and scheduling algorithm

Yarn Scheduler strategy

FIFO: Put all Applications in the queue, and allocate resources for each app according to the priority of the job first, and then according to the order of arrival time
1. Advantages: simple, no configuration required
2. Disadvantages: not suitable for shared clusters
Capacity Scheduler: Used in the case of multiple applications running in a cluster, the goal is to maximize throughput and cluster utilization
1. CapacityScheduler allows the resources of the entire cluster to be divided into multiple parts, and each organization uses a part of them, that is, each organization has a dedicated queue, and each organization's queue can be further divided into hierarchical structures (Hierarchical Queues), thereby allowing organizations The use of different internal user groups. Each queue specifies the range of resources that can be used.
2. Inside each queue, the Application is scheduled in a FIFO manner. When the resources of a certain queue are free, its remaining resources can be shared with other queues.

Hadoop principle and source code

hadoop overview

Hadoop history

The composition of Hadoop (2.0)

HDFS

The composition of HDFS

NameNode: It is the manager of the entire HDFS cluster,

DataNode: The store of actual files.

Client

Secondary NameNode: Secondary NameNode, sharing its workload.

HDFS read and write process

HDFS upload file

NameNode node selection strategy-node distance calculation

HDFS read process

SecondaryNameNode

SecondaryNameNode and Name assist in solving read and write performance issues: NameNode data is stored in both memory and disk

SecondaryNameNode working mechanism

HDFS summary

DataNode and NameNode source code guide

Preparation before code reading: Hadoop Rpc Framework Guide

Rpc protocol

Rpc provider

Rpc consuemr

NameNode startup source code

DataNode startup source code

work process

Source code icon

MapReduce

MapReduce example

The slicing rule determines the number of MapTasks.

The number of ReduceTasks needs to be manually specified.

The detailed workflow of MapReduce.

Map stage

Reduce phase

Advantages and disadvantages of MapReduce

advantage

shortcoming

MapTask source code guide

Guide to ReduceTask source code: Its entrance is the run method of RecuceTask.

Yarn

Yarn composition

Yarn execution process

Yarn scheduler and scheduling algorithm

Yarn Scheduler strategy

Yarn source code

refer to

The composition of Hadoop (2.0)

jian

引用和评论

代码之外系列第一：索证思维与索进思维

Java8的新特性

Java11的新特性

Java5的新特性

Java9的新特性

Java13的新特性

Java7的新特性