Abstract: Hadoop Streaming uses the MapReduce framework, which can be used to write applications to process massive amounts of data.

This article is shared from the Huawei Cloud Community " Hadoop Streaming: Writing Hadoop MapReduce Program in Python", author: Donglian Lin.

With the advent of the development of digital media and the Internet of Things, the amount of digital data generated every day has increased exponentially. This situation presents challenges for creating next-generation tools and technologies to store and manipulate this data. This is where Hadoop Streaming comes in! The chart given below depicts the annual growth in data generated worldwide since 2013. IDC estimates that by 2025, the amount of data generated each year will reach 180 Zettabytes!

According to IBM, nearly 25 million bytes of data are created every day, and 90% of the world's data is created in the past two years! Storing such a huge amount of data is a challenging task. Hadoop can process large amounts of structured and unstructured data more efficiently than traditional enterprise data warehouses. It stores these huge data sets across distributed computer clusters. Hadoop Streaming uses the MapReduce framework, which can be used to write applications to process massive amounts of data.

Since the MapReduce framework is based on Java, you may want to know how a developer can work if he/she has no Java experience. Well, developers can write mapper/Reducer applications in their favorite language without having to master too much Java knowledge, and use Hadoop Streaming instead of switching to new tools or technologies such as Pig and Hive.

What is Hadoop streaming?

Hadoop Streaming is a utility included with the Hadoop distribution. It can be used to execute big data analysis programs. Hadoop streams can be executed in languages such as Python, Java, PHP, Scala, Perl, UNIX, etc. This utility allows us to use any executable file or script as a mapper and/or reducer to create and run Map/Reduce jobs. E.g:

$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar
-input myInputDirs
-输出我的输出目录
-文件夹/垃圾箱/猫
-减速器/bin/wc

Parameter Description:

Python MapReduce code:

mapper.py
#!/usr/bin/python
import sys
#Word Count Example
# input comes from standard input STDIN
for line in sys.stdin:
line = line.strip() #remove leading and trailing whitespaces
words = line.split() #split the line into words and returns as a list
for word in words:
#write the results to standard output STDOUT
print'%s    %s' % (word,1) #Emit the word

image.png

reducer.py

#!/usr/bin/python
import sys
from operator import itemgetter
# using a dictionary to map words to their counts
current_word = None
current_count = 0
word = None
# input comes from STDIN
for line in sys.stdin:
line = line.strip()
word,count = line.split('   ',1)
try:
count = int(count)
except ValueError:
continue
if current_word == word:
current_count += count
else:
if current_word:
print '%s   %s' % (current_word, current_count)
current_count = count
current_word = word
if current_word == word:
print '%s   %s' % (current_word,current_count)

image.png

run:

  • Create a file with the following content and name it word.txt.

Cat mouse lion deer tiger lion elephant lion deer

  • Copy the mapper.py and reducer.py scripts to the same folder where the above files are located.
    image.png
  • Open the terminal and find the directory where the file is located. Command: ls: list all files in the directory cd: change directory/folder
    image.png
  • View the contents of the file.

Command: cat file_name
image.png

Contents of mapper.py

Command: cat mapper.py
image.png

Contents of reducer.py

Command: cat reducer.py
image.png
image.png

We can run mapper and reducer on local files (for example: word.txt). In order to run Map and Reduce on the Hadoop Distributed File System (HDFS), we need the Hadoop Streaming jar. So before we run the scripts on HDFS, let's run them locally to make sure they are working properly.

Run the mapper

Command: cat word.txt | python mapper.py
image.png

Run reducer.py

Command: cat word.txt | python mapper.py | sort -k1,1 | python reducer.py
image.png

We can see that the mapper and reducer are working as expected, so we will not face any further issues.

Run Python code on Hadoop

Before we run the MapReduce task on Hadoop, copy the local data (word.txt) to HDFS

Example: hdfs dfs -put source_directory hadoop_destination_directory

Command: hdfs dfs -put /home/edureka/MapReduce/word.txt /user/edureka
image.png

Copy the path of the jar file

The path of the Hadoop Streaming jar based on the jar version is:

/usr/lib/hadoop-2.2.X/share/hadoop/tools/lib/hadoop-streaming-2.2.X.jar

So, find the Hadoop Streaming jar on your terminal and copy the path.

Order:

ls /usr/lib/hadoop-2.2.0/share/hadoop/tools/lib/hadoop-streaming-2.2.0.jar
image.png

Run MapReduce job

Order:

hadoop jar /usr/lib/hadoop-2.2.0/share/hadoop/tools/lib/hadoop-streaming-2.2.0.jar -file /home/edureka/mapper.py -mapper mapper.py -file /home/ edureka/reducer.py -reducer reducer.py -input /user/edureka/word -output /user/edureka/Wordcount
image.png
image.png

Hadoop provides a basic web interface for statistics and information. When the Hadoop cluster is running, open http://localhost:50070 in the browser. This is a Hadoop Web interface.
image.png

Now browse the file system and find the generated wordcount file to see the output. Below is a screenshot.
image.png

We can use this command to see the output on the terminal

Command: hadoop fs -cat /user/edureka/Wordcount/part-00000
image.png

You have now learned how to use Hadoop Streaming to execute MapReduce programs written in Python!

Click to follow and learn about Huawei Cloud's fresh technology for the first time~


华为云开发者联盟
1.4k 声望1.8k 粉丝

生于云,长于云,让开发者成为决定性力量