Teach you to write Hadoop MapReduce programs in Python

Abstract: Hadoop Streaming uses the MapReduce framework, which can be used to write applications to process massive amounts of data.

This article is shared from the Huawei Cloud Community " Hadoop Streaming: Writing Hadoop MapReduce Program in Python", author: Donglian Lin.

With the advent of the development of digital media and the Internet of Things, the amount of digital data generated every day has increased exponentially. This situation presents challenges for creating next-generation tools and technologies to store and manipulate this data. This is where Hadoop Streaming comes in! The chart given below depicts the annual growth in data generated worldwide since 2013. IDC estimates that by 2025, the amount of data generated each year will reach 180 Zettabytes!

According to IBM, nearly 25 million bytes of data are created every day, and 90% of the world's data is created in the past two years! Storing such a huge amount of data is a challenging task. Hadoop can process large amounts of structured and unstructured data more efficiently than traditional enterprise data warehouses. It stores these huge data sets across distributed computer clusters. Hadoop Streaming uses the MapReduce framework, which can be used to write applications to process massive amounts of data.

Since the MapReduce framework is based on Java, you may want to know how a developer can work if he/she has no Java experience. Well, developers can write mapper/Reducer applications in their favorite language without having to master too much Java knowledge, and use Hadoop Streaming instead of switching to new tools or technologies such as Pig and Hive.

What is Hadoop streaming?

Hadoop Streaming is a utility included with the Hadoop distribution. It can be used to execute big data analysis programs. Hadoop streams can be executed in languages such as Python, Java, PHP, Scala, Perl, UNIX, etc. This utility allows us to use any executable file or script as a mapper and/or reducer to create and run Map/Reduce jobs. E.g:

$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar
-input myInputDirs
-输出我的输出目录
-文件夹/垃圾箱/猫
-减速器/bin/wc

Parameter Description:

Python MapReduce code:

mapper.py
#!/usr/bin/python
import sys
#Word Count Example
# input comes from standard input STDIN
for line in sys.stdin:
line = line.strip() #remove leading and trailing whitespaces
words = line.split() #split the line into words and returns as a list
for word in words:
#write the results to standard output STDOUT
print'%s    %s' % (word,1) #Emit the word

reducer.py

#!/usr/bin/python
import sys
from operator import itemgetter
# using a dictionary to map words to their counts
current_word = None
current_count = 0
word = None
# input comes from STDIN
for line in sys.stdin:
line = line.strip()
word,count = line.split('   ',1)
try:
count = int(count)
except ValueError:
continue
if current_word == word:
current_count += count
else:
if current_word:
print '%s   %s' % (current_word, current_count)
current_count = count
current_word = word
if current_word == word:
print '%s   %s' % (current_word,current_count)

run:

Create a file with the following content and name it word.txt.

Cat mouse lion deer tiger lion elephant lion deer

Copy the mapper.py and reducer.py scripts to the same folder where the above files are located.
Open the terminal and find the directory where the file is located. Command: ls: list all files in the directory cd: change directory/folder
View the contents of the file.

Command: cat file_name

Contents of mapper.py

Command: cat mapper.py

Contents of reducer.py

Command: cat reducer.py

We can run mapper and reducer on local files (for example: word.txt). In order to run Map and Reduce on the Hadoop Distributed File System (HDFS), we need the Hadoop Streaming jar. So before we run the scripts on HDFS, let's run them locally to make sure they are working properly.

Run the mapper

Command: cat word.txt | python mapper.py

Run reducer.py

Command: cat word.txt | python mapper.py | sort -k1,1 | python reducer.py

We can see that the mapper and reducer are working as expected, so we will not face any further issues.

Run Python code on Hadoop

Before we run the MapReduce task on Hadoop, copy the local data (word.txt) to HDFS

Example: hdfs dfs -put source_directory hadoop_destination_directory

Command: hdfs dfs -put /home/edureka/MapReduce/word.txt /user/edureka

Copy the path of the jar file

The path of the Hadoop Streaming jar based on the jar version is:

/usr/lib/hadoop-2.2.X/share/hadoop/tools/lib/hadoop-streaming-2.2.X.jar

So, find the Hadoop Streaming jar on your terminal and copy the path.

Order:

ls /usr/lib/hadoop-2.2.0/share/hadoop/tools/lib/hadoop-streaming-2.2.0.jar

Run MapReduce job

Order:

hadoop jar /usr/lib/hadoop-2.2.0/share/hadoop/tools/lib/hadoop-streaming-2.2.0.jar -file /home/edureka/mapper.py -mapper mapper.py -file /home/ edureka/reducer.py -reducer reducer.py -input /user/edureka/word -output /user/edureka/Wordcount

Hadoop provides a basic web interface for statistics and information. When the Hadoop cluster is running, open http://localhost:50070 in the browser. This is a Hadoop Web interface.

Now browse the file system and find the generated wordcount file to see the output. Below is a screenshot.

We can use this command to see the output on the terminal

Command: hadoop fs -cat /user/edureka/Wordcount/part-00000

You have now learned how to use Hadoop Streaming to execute MapReduce programs written in Python!

Click to follow and learn about Huawei Cloud's fresh technology for the first time~

Teach you to write Hadoop MapReduce programs in Python

What is Hadoop streaming?

Parameter Description:

Python MapReduce code:

run:

Run Python code on Hadoop

Copy the path of the jar file

Run MapReduce job

华为云开发者联盟

引用和评论

华为云开发者联盟入选 2023 中国技术品牌影响力企业榜，深耕开发者生态

Anaconda安装教程以及Anaconda和pip配置国内镜像

如何减少跨团队交付摩擦？——基于 DevOps 与敏捷的最佳实践

【Hadoop】HBase系统解析及适用场景

科学计算编程涉及到的技术栈简介

使用 chardet 判断文件编码需要注意的坑——过大的文件会导致高耗时

Python3 格式化时间（qbit）