Abstract: In the sorting and reducer stage, the reduce side connection process will generate huge network I/O traffic. In this stage, the values of the same key are gathered together.
This article is shared from the Huawei Cloud Community " MapReduce Example: Reducing the Side Connection in Hadoop MapReduce", author: Donglian Lin.
In this blog, we will use MapReduce examples to explain to you how to perform reduction side joins in Hadoop MapReduce. Here, I assume that you are already familiar with the MapReduce framework and know how to write basic MapReduce programs. The topics discussed in this blog are as follows:
• What is joining?
• Connection in MapReduce
• What is the Reduce side connection?
• MapReduce example to reduce side connections
• in conclusion
What is a connection?
The join operation is used to merge two or more database tables based on foreign keys. Typically, companies maintain separate tables for customers and transaction records in their database. Moreover, many times these companies need to use the data in these separate tables to generate analysis reports. Therefore, they use common columns (foreign keys) (such as customer ID, etc.) to perform join operations on these separate tables to generate a combined table. Then, they analyze this combination table to obtain the required analysis report.
Connection in MapReduce
Just like SQL join, we can also join different data sets in MapReduce. There are two types of join operations in MapReduce:
• Map Side Join: name implies, the join operation is performed in the map phase itself. Therefore, in the map side join, the mapper performs the join and the input of each map must be partitioned and sorted according to the key.
• reduce the deputy to join: name implies, is added on the reducing side, and the deceleration is responsible for the execution of the connection operation. Since the sorting and shuffling phases send values with the same key to the same reducer, it is relatively simpler and easier to implement than map side join. Therefore, by default, the data is organized for us.
Now, let us learn more about reduce side join.
What is reducing the side connection?
As mentioned earlier, reduce side join is the process of performing join operations in the reducer phase. Basically, the reduce side join occurs in the following way:
• Mapper reads the input data to be combined according to the common column or connection key.
• The mapper processes the input and adds tags to the input to distinguish inputs belonging to different sources or data sets or databases.
• The mapper outputs intermediate key-value pairs, where the key is only the concatenated key.
• After the sorting and reorganization phase, a list of keys and values is generated for the reducer.
• Now, the reducer connects the values and keys present in the list to give the final aggregate output.
MapReduce example to reduce edge joins
Suppose I have two separate sports field data sets:
• cust_details: It contains the detailed information of the customer.
• transaction_details: contains customer transaction records.
Using these two data sets, I want to know the lifetime value of each customer. In doing so, I will need the following:
• The person's name and how often the person visits.
• The total amount he/she spent on purchasing the equipment.
The above figure just shows you the schema of the two datasets on which we will perform the reduce side join operation. Click the button below to download the entire project containing the source code and input files of this MapReduce example:
When adding the above MapReduce sample project to Eclipse on the reduce side, please keep the following in mind:
• The input files are located in the input_files directory of the project. Load these into your HDFS.
• Don't forget to build the path of Hadoop Reference Jars (exist in the reduce side join project lib directory) according to your system or VM.
Now, let us understand what happens inside the map and reduce phases of this MapReduce example about the reduce side join:
1. Map stage:
I will set up a separate mapper for each of the two data sets, one for the cust_details input and the other for the transaction_details input.
cust_details mapper:
public static class CustsMapper extends Mapper <Object, Text, Text, Text>
{
public void map(Object key, Text value, Context context) throws IOException, InterruptedException
{String record = value.toString();
String[] parts = record.split(",");
context.write(new Text(parts[0]), new Text("cust " + parts[1]));
}
}
o I will read the input one tuple at a time.
o Then, I will tokenize each word in the tuple and use the name together to get the caster ID individual Ø n.
o T ħ È Ç Uss ID will be my key-value pair key, and my mapper will eventually generate it.
o I will also add a label "Çus" to indicate that the input tuple is of type cust_details.
o Therefore, my cust_details mapper will generate the following intermediate key-value pairs:
key-value pair: [Customer ID, Customer Name]
For example: [4000001, C Us Cristina], [4000002, Castelpage] etc.
transaction_details mapper:
public static class TxnsMapper extends Mapper <Object, Text, Text, Text>
{
public void map(Object key, Text value, Context context) throws IOException, InterruptedException
{
String record = value.toString();
String[] parts = record.split(",");
context.write(new Text(parts[2]), new Text("tnxn " + parts[3]));
}
}
• Just like the mapper for cust_details, I will follow similar steps here. However, there will be some differences:
o I will get the value of the amount instead of the name of the person.
o In this case, we will use "tnxn" as the label.
• Therefore, the customer ID will be the my key of the key-value pair finally generated by the mapper.
• Finally, the output of the transaction_details mapper will be in the following format:
key-value pair: [Customer ID, tnxn amount]
Examples: [4000001, tnxn 40.33], [4000002, tnxn 198.44], etc.
2. Sorting and shuffling stage
The sorting and shuffling phase will generate an array list of values corresponding to each key. In other words, it puts together all the values corresponding to each unique key in the intermediate key-value pair. The output of the sorting and shuffling phase will be in the following format:
key-list of values:
• {cust ID1 – [(cust name1), (tnxn amount1), (tnxn amount2), (tnxn amount3),.....]}
• {Customer ID2 – [(Customer name2), (tnxn amount1), (tnxn amount2), (tnxn amount3),.....]}
• ……
example:
• {4000001 – [(cust kristina), (tnxn 40.33), (tnxn 47.05),…]};
• {4000002 – [(cust paige), (tnxn 198.44), (tnxn 5.58),…]};
• ……
Now, the framework will call the reduce() method (reduce(Text key, Iterable<Text> values, Context context)) for each unique connection key (cust id) and corresponding value list. Then, the reducer will perform a concatenation operation on the values present in the corresponding value list to finally calculate the required output. Therefore, the number of reducer tasks executed will be equal to the number of unique customer IDs.
Now let us understand how the reducer performs the join operation in this MapReduce example.
3. Reducer stage
If you remember, the main goal of performing this side-connection reduction operation is to find out the number of times a particular customer visits the general stadium and the total amount that customer spends on different sports. Therefore, my final output should be in the following format:
Key-Value Pair: [Customer Name] (Key)-[Total Amount, Frequency of Access] (Value)
reducer code:
public static class ReduceJoinReducer extends Reducer <Text, Text, Text, Text>
{
public void reduce(Text key, Iterable<Text> values, Context context)
throws IOException, InterruptedException
{
String name = "";
double total = 0.0;
int count = 0;
for (Text t : values)
{
String parts[] = t.toString().split(" ");
if (parts[0].equals("tnxn"))
{
count++;
total += Float.parseFloat(parts[1]);
}
else if (parts[0].equals("cust"))
{
name = parts[1];
}
}
String str = String.format("%d %f", count, total);
context.write(new Text(name), new Text(str));
}
}
Therefore, the following steps will be taken in each reducer to achieve the desired output:
• In each reducer, I will have a list of keys and values, where the key is just the customer ID. The list of values will have inputs from two data sets, namely the amount from transaction_details and the name from cust_details.
• Now, I will iterate over the values that exist in the list of values in the reducer.
• Then, I will split the list of values and check whether the value is of type transaction_details or cust_details.
• If it is transaction_details type, I will perform the following steps:
o I add one to the counter value to calculate how often this person visits.
o I will accumulate the updated amount value to calculate the total amount spent by the person.
• On the other hand, if the value is of type cust_details, I will store it in a string variable. Later, I will specify the name as the key in my output key-value pair.
• Finally, I will write output key-value pairs in the output folder of my HDFS.
Therefore, the final output that my reducer will generate is as follows:
Christina, 651.05 8
Page, 706.97 6
…..
Moreover, the entire process we did above is called Reduce Side Join in MapReduce.
source code:
The source code of the MapReduce example to reduce the side connection above is as follows:
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.MultipleInputs;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class ReduceJoin {
public static class CustsMapper extends Mapper <Object, Text, Text, Text>
{
public void map(Object key, Text value, Context context)
throws IOException, InterruptedException
{
String record = value.toString();
String[] parts = record.split(",");
context.write(new Text(parts[0]), new Text("cust " + parts[1]));
}
}
public static class TxnsMapper extends Mapper <Object, Text, Text, Text>
{
public void map(Object key, Text value, Context context)
throws IOException, InterruptedException
{
String record = value.toString();
String[] parts = record.split(",");
context.write(new Text(parts[2]), new Text("tnxn " + parts[3]));
}
}
public static class ReduceJoinReducer extends Reducer <Text, Text, Text, Text>
{
public void reduce(Text key, Iterable<Text> values, Context context)
throws IOException, InterruptedException
{
String name = "";
double total = 0.0;
int count = 0;
for (Text t : values)
{
String parts[] = t.toString().split(" ");
if (parts[0].equals("tnxn"))
{
count++;
total += Float.parseFloat(parts[1]);
}
else if (parts[0].equals("cust"))
{
name = parts[1];
}
}
String str = String.format("%d %f", count, total);
context.write(new Text(name), new Text(str));
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf, "Reduce-side join");
job.setJarByClass(ReduceJoin.class);
job.setReducerClass(ReduceJoinReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
MultipleInputs.addInputPath(job, new Path(args[0]),TextInputFormat.class, CustsMapper.class);
MultipleInputs.addInputPath(job, new Path(args[1]),TextInputFormat.class, TxnsMapper.class);
Path outputPath = new Path(args[2]);
FileOutputFormat.setOutputPath(job, outputPath);
outputPath.getFileSystem(conf).delete(outputPath);
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Run this program
Finally, the command to run the above MapReduce sample program on the reduce side join is as follows:
hadoop jar reducejoin.jar ReduceJoin /sample/input/cust_details /sample/input/transaction_details /sample/output
in conclusion:
In the sorting and reducer stages, the reduce side connection process will generate huge network I/O traffic. In this stage, the values of the same key are gathered together. Therefore, if you have a large number of different data sets with millions of values, you will most likely encounter an OutOfMemory exception, that is, your RAM is full and therefore overflowing. In my opinion, the advantages of using reduce side join are:
- This is easy to implement because we utilize the built-in sorting and shuffling algorithm in the MapReduce framework, which combines the values of the same key and sends them to the same reducer.
- In reduce side join, your input does not need to follow any strict format, so you can also perform join operations on unstructured data.
Generally speaking, people prefer Apache Hive, which is part of the Hadoop ecosystem to perform connection operations. Therefore, if you come from a SQL background, you don't need to worry about writing MapReduce Java code to perform join operations. You can use Hive as an alternative.
Click to follow and learn about Huawei Cloud's fresh technology for the first time~
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。