头图

ABAP is an enterprise application programming language, and its 740 version was released in 2013, adding many new syntax and keywords:

One of the highlights is the newly introduced REDUCE keyword. The function of this keyword is similar to the Reduce operation in the Map-Reduce programming model widely used in the field of parallel computing of large-scale data sets, and can be literally understood as reduction.

What is the Map-Reduce idea?

Map-Reduce is a programming model and associated implementation for generating and processing large-scale datasets using parallel distributed algorithms on clusters.

A Map-Reduce program consists of a Map procedure and a Reduce method. The Map procedure is responsible for performing filtering and sorting, such as sorting students by name into queues, each maintained by a queue.

The Reduce method is responsible for performing aggregation operations, such as counting the number of students.
The Map-Reduce system orchestrates distributed servers to run various tasks in parallel, manage all communications and data transfers between various parts of the system, and provide data redundancy to achieve fault tolerance.

The figure below shows the working steps of the Map Reduce framework, which counts the number of word occurrences in a massive input data set (such as greater than 1TB). The work steps include Splitting, Mapping, Shuffling, Reducing to get the final output.

The Map-Reduce programming model has been widely used in tools and frameworks in the field of big data processing, such as Hadoop.

A Practical Application of Map-Reduce in CRM System

Let's look at an actual task in the author's work. I need to do a statistic on a CRM test system to list the number of rows in the table where the two columns OBTYP(Object Type) and STSMA(Status Schema) have the same value in the database table CRM_JSTO. You can compare the description of the inner table row with the same value in the two columns OBTYP and STSMA to the repeated words in the above figure.

The following figure is a partial row of the database table CRM_JSTO in the system:

The following figure is the final statistical result completed by the author:

The total number of rows in the database table on the test system exceeds 550,000 rows, of which 90,279 rows are maintained. Only OBTYP is maintained as TGP, but STSMA is not maintained.

In second place was the combination of COH and CRMLEAD, which appeared 78,722 times.

How did the above result come out?

Friends who have done a little ABAP development will definitely write the following code immediately:

Use SELECT COUNT to complete statistical work directly in the database layer. This is also the practice recommended by SAP, that is, the so-called Code pusudown guideline, that is, operations that can be performed at the HANA database level should be included as much as possible to make full use of the powerful computing power of HANA. On the premise that the database can complete the calculation logic, try to avoid putting the calculation logic in the Netweaver ABAP application layer.

However, we also need to be aware of the limitations of this approach. SAP CTO once had a famous saying:

There is no future with ABAP alone
There is no future in SAP without ABAP

In the future, ABAP will move towards an open and interconnected road. Back to the requirement itself, assuming that the input data to be retrieved is not from the ABAP database table, but from the HTTP request, or the IDOC sent by the third-party system, we can no longer use the SELECT COUNT operation of OPEN SQL itself, and This problem can only be solved at the ABAP application layer.

Two solutions for accomplishing this requirement using the ABAP programming language are described below.

The first way is more traditional, implemented in the method get_result_traditional_way:

ABAP's LOOP AT GROUP BY keyword combination is just like tailor-made for this requirement: specify the obtyp and stsma columns for GROUP BY, and then LOOP AT will automatically record the rows entered into the inner table according to the two columns. The number of row records in each group is automatically calculated by the keyword GROUP SIZE. The respective obtyp and stsma values of each group, as well as the number of entries in the row records in the group, are stored in the variable group_ref specified by REFERENCE INTO. All the ABAP developer needs to do is simply store these results in the output internal table.

The second method, as described in the title of this article, uses the newly introduced REDUCE keyword in ABAP 740:

REPORT zreduce1.

DATA: lt_status TYPE TABLE OF crm_jsto.

SELECT * INTO TABLE lt_status FROM crm_jsto.

DATA(lo_tool) = NEW zcl_status_calc_tool( ).

lo_tool = REDUCE #( INIT  o = lo_tool
                          local_item = VALUE zcl_status_calc_tool=>ty_status_result( )
                     FOR GROUPS <group_key> OF <wa> IN lt_status
                      GROUP BY ( obtyp = <wa>-obtyp stsma = <wa>-stsma )
       ASCENDING NEXT local_item = VALUE #( obtyp = <group_key>-obtyp
                                             stsma = <group_key>-stsma
       count = REDUCE i( INIT sum = 0 FOR m IN GROUP <group_key>
               NEXT sum = sum + 1 ) )
       o = o->add_result( local_item ) ).

DATA(ls_result) = lo_tool->get_result( ).

The above code may seem a bit obscure at first glance, but after reading it carefully, it is found that this method essentially adopts the same grouping strategy as method 1 LOOP AT GROUP BY - grouping according to obtyp and stsma, and these subgroups are identified by the variable group_key , and then through the REDUCE keyword in line 10, manually calculate the number of entries in this group by means of accumulation - reduce a large input set into smaller subsets according to the conditions specified by GROUP BY, and then Calculate separately for subsets - this is the processing idea that the REDUCE keyword literally conveys to the ABAP developer.

Summarize and compare these three implementation methods: when the data source to be counted is an ABAP database table, the OPEN SQL method must be preferred, so that the calculation logic is completed at the database layer to obtain the best performance.

When the data source is not an ABAP database table, and the requirement for group statistics is a simple counting operation (COUNT), LOOP AT ... GROUP BY ... GROUP SIZE is preferred, so that the counting operation is completed in the ABAP kernel through GROUP SIZE, and the get better performance.

When the data source is not an ABAP database table, and the requirement for grouping statistics is custom logic, use the third REDUCE solution introduced in this article to write the custom statistical logic after the NEXT keyword in line 11.

Performance evaluation of three solutions

I wrote a simple report for performance evaluation:

DATA: lt_status TYPE zcl_status_calc_tool=>tt_raw_input.

SELECT * INTO TABLE lt_status FROM crm_jsto.

DATA(lo_tool) = NEW zcl_status_calc_tool( ).

zcl_abap_benchmark_tool=>start_timer( ).
DATA(lt_result1) = lo_tool->get_result_traditional_way( lt_status ).
zcl_abap_benchmark_tool=>stop_timer( ).

zcl_abap_benchmark_tool=>start_timer( ).
lo_tool = REDUCE #( INIT  o = lo_tool
                          local_item = VALUE zcl_status_calc_tool=>ty_status_result( )
                     FOR GROUPS <group_key> OF <wa> IN lt_status
                      GROUP BY ( obtyp = <wa>-obtyp stsma = <wa>-stsma )
       ASCENDING NEXT local_item = VALUE #( obtyp = <group_key>-obtyp
                                             stsma = <group_key>-stsma
       count = REDUCE i( INIT sum = 0 FOR m IN GROUP <group_key>
               NEXT sum = sum + 1 ) )
       o = o->add_result( local_item ) ).

DATA(lt_result2) = lo_tool->get_result( ).
zcl_abap_benchmark_tool=>stop_timer( ).

ASSERT lt_result1 = lt_result2.

The test data is as follows:

The performance of these three solutions decreases in order, but the applicable occasions and the degree of flexibility increase in order.

The solution of LOOP AT ... GROUP BY ... GROUP SIZE , on the ABAP test server where the author works, processes 550,000 records and takes 0.3 seconds, while REDUCE takes 0.8 seconds, and the performance of the two solutions is within the same order of magnitude.

Summarize

Map-Reduce is a programming model and associated implementation for generating and processing large-scale datasets using parallel distributed algorithms on clusters. The ABAP programming language supports REDUCE operations on large-scale data at the language level. This article shares a practical case of using the Map-Reduce idea to process large-scale data sets in the author's work, and compares it with the other two traditional solutions. On the premise that the performance is not inferior to the traditional solution, the solution based on Map-Reduce has a wider range of applications and scalability. I hope the content shared in this article will inspire you to use ABAP to deal with similar problems, thank you for reading.


注销
1k 声望1.6k 粉丝

invalid