Introduction to DataWorks function practice series, to help you analyze the pain points in the process of business realization and improve the efficiency of business functions! Through previous introductions, you have learned about the most critical knowledge points for task operation on DataWorks. The last parameter transparent transmission introduced you to the special nodes that can transparently transmit upstream node parameters to downstream nodes—— The assignment node, combined with the assignment node and other nodes, can realize the task of reading and processing data in a loop or traversal. This issue introduces you how to implement loop and traversal tasks on DataWorks.
Past review:
Through previous introductions, you have learned about the most critical knowledge points for task operation on DataWorks. The last parameter transparent transmission introduced you to the special nodes that can transparently transmit upstream node parameters to downstream nodes—— The assignment node, combined with the assignment node and other nodes, can realize the task of reading and processing data in a loop or traversal. This issue introduces you how to implement loop and traversal tasks on DataWorks.
Feature recommendation: loop node and traverse node
In the process of compiling data development tasks, sometimes we may encounter task scenarios that need to be looped or traversed. DataWorks provides you with two types of special nodes to meet the needs of such scenarios.
comparison items | <span> cycle node (do-while node) </ span> | <span> node traversal (for-each node) </ span> |
Application scenario | reads one by one according to the number of object sets and determines whether the loop condition is met. If it is satisfied, the loop will continue, and if it is not satisfied, the loop will be exited. The number of loops is determined by the judgment result of each loop and is not fixed. | reads (traverses) one by one according to the number of object collections, and the number of cycles is known. |
node application | <span>You can re-arrange the business process inside the do-while node, write the logic that needs to be executed in a loop in the node, and edit the end loop judgment node to control whether to exit the loop. At the same time, you can also combine the assignment node to loop through the result set passed by the assignment node. </span> | <span>You can loop through the result set passed by the assignment node through the for-each node. At the same time, you can also re-arrange the internal business process of the for-each node. </span> |
Usually loop nodes (do-while nodes) and traversal nodes (for-each nodes) are used in conjunction with assignment nodes to pass the output of the upstream node to the downstream node through the assignment node, and loop the output result of the upstream node in the downstream node Or traverse.
At the same time, loop nodes (do-while nodes) and traversal nodes (for-each nodes) are inconsistent with other simple nodes in that such logical nodes themselves contain internal nodes. Take the do-while node as an example. After a do-while node is created, 3 internal nodes are usually created automatically for you. At the same time, you can also recompile internal business processes and node contents for the internal nodes.
# Part1: Loop node (do-while node)
## 1.1 Node composition
The do-while node of DataWorks is a special node that contains internal nodes. When you create a do-while node, you will also automatically create three internal nodes: start node, (cycle start node), sql node (cyclic task node), end node (cycle end judgment node), through internal nodes organized into internal node processes, to achieve the cyclic operation of tasks.
As shown in FIG:
* start node is the start node of the internal node and does not carry specific task codes.
* sql node DataWorks creates a SQL-type internal task running node for you by default. You can also delete the default sql node to customize the running node of the internal cyclic task.
Your cyclic task is a SQL type task, you can directly double-click the default sql node to enter the code development page of the node to develop the cyclic task code.
* Your cyclic task is more complicated. You can create other task nodes in the internal node process, and rebuild the running process of the node according to the actual situation. Business process node will be assigned with the task of usual cycle, branch nodes, merge node combination, typical application scenarios illustrate see Typical applications: used in conjunction with the assignment node .
explains that when customizes the cyclic task node, you can delete the dependencies between internal nodes and reorganize the internal business process of the cyclic node, but you need to set the start node and end node as the internal business process of the do-while node. The first and last nodes.
* end node
end node is the loop judgment node of the do-while node to control the number of cycles of the do-while node, which is essentially an assignment node, and outputs true and false two strings, which respectively represent continue One cycle and no longer continue to cycle.
* end node supports the use of ODPS SQL, SHELL and Python (Python2) three languages for loop judgment code development, and the do-while node provides you with convenient built-in variables to facilitate your end code development. For the introduction of built-in variables, please refer to built-in variables and variable value case . For sample codes developed in different languages, please refer to case 1: end node code example .
## 1.2 Restrictions and precautions
* Loop support
only supports the use of do-while nodes in DataWorks Standard Edition and above.
* The do-while node supports up to 128 cycles. When the end node controls the number of cycles, if it exceeds 128 times, an error will be reported.
* Internal node
customizing the cyclic task node, you can delete the dependencies between internal nodes and reorganize the internal business process of the cyclic node, but you need to set the start node and end node as the internal business process of the do-while node. The first and last nodes.
* When using branch nodes for logical judgment or result traversal in the internal nodes of the do-while node, the merge node needs to be used at the same time.
* The internal node end node do-while node does not support adding comments during code development.
* Commissioning and running
DataWorks is in standard mode, it does not support to directly test and run the do-while node on the DataStudio interface. If you want to test and verify the running results of the do-while node, you need to publish the task containing the do-while node to the operation and maintenance center, and run the do-while node task on the operation and maintenance center page. If you use the value passed by the assignment node in the do-while node, please run the assignment node and the loop node at the same time during the operation and maintenance center test.
* When viewing the execution log of the do-while node in the operation and maintenance center, you need to right-click the instance and click view the internal node to view the execution log of the internal node.
## 1.3 Typical application: used in conjunction with assignment node
The do-while node is often used in conjunction with the assignment node, as shown in the figure below.
When used in conjunction with the assignment node:
* You need to use the output of the assignment node as the input of the assignment node, and configure the upstream and downstream dependencies with the assignment node. For other configuration considerations, please refer to conjunction with the assignment node.
* When used in conjunction with the assignment node, some built-in variables can be used to obtain the current cycle times, assigned parameter values and other loop variable values. For details, please refer to built-in variables .
## 1.4 Built-in variables
The do-while node of DataWorks implements loop running tasks through internal nodes. Each time the task loops, you can use some built-in variables to get the current number of loops and offsets.
<span> built-in variables </ span> | <span> meaning </ span> | <span> value </ span> |
<span> $ { dag.loopTimes}</span></strong></td><td><span>Current loop times</span></td><td><span>The first loop is 1, the second Is 2, the third time is 3...the nth time is n. </span></td></tr><tr><td><strong><span>${dag.offset}</span> | <span>Offset</span> | < span>The first loop is 0, the second is 1, the third is 2...the nth is n-1. </span> |
If you use assignment nodes in combination, you can also obtain assignment parameter values and loop variable parameters in the following ways.
DESCRIPTION the following example, the variable, _ INPUT _ is a do-while node custom node input parameters name, actual use, the need to replace your real name.
<span>Built-in variables</span> | <span>Meaning</span> |
<strong><span>input</span></strong></em><strong><span>}</span></strong></td><td><span>Data passed by upstream assignment node set. </span></td></tr><tr><td><strong><span>${dag.</span> <span>input</span> <span>[ ${dag.offset}]}</span></strong></td><td><span>Get the current loop data row inside the loop node. </span></td></tr><tr><td><strong><span>${dag.</span> <span>input</span> <span>. length}</span> | <span>The length of the data set obtained inside the loop node. </span> |
##
## 1.5 Variable value case
* Case 1
The upstream assignment node is a shell node, and the last output is 2021-03-28, , 2021-03-30, 2021-03-31, 2021-04-01 16170ecd2159aa, at this time, the value of each variable The values are as follows:
<span>Built-in variables</span> | <span>The value of the first cycle | |
<span>${dag.</span></strong><em><strong><span>input</span></strong></em><strong><span>} </span></strong></td><td colspan="2"><strong><span>2021-03-28,2021-03-29,2021-03-30,2021-03-31, 2021-04-01</span></strong></td></tr><tr><td><strong><span>${dag.</span> <span>input</span> span> <span>[${dag.offset}]}</span></strong></td><td><strong><span>2021-03-28</span></strong ></td><td><strong><span>2021-03-29</span></strong></td></tr><tr><td><strong><span>${dag .</span> <span>input</span> <span>.length}</span> | <span>5</span> |
<span>${dag.loop<strong></span>Timesd></span> <span>1</span></td><td><span>2</span></td></tr><tr><td><strong><span>${dag.offset}< /span> | <span>0</span> | <span>1</span> |
* Case 2
The upstream assignment node is the ODPS SQL node, and the last select statement queries two pieces of data:
+----------------------------------------------+
| uid | region | age_range | zodiac |
+----------------------------------------------+
| 0016359810821 | Hubei | 30 ~ 40 years old | Cancer |
| 0,016,359,814,159 | unknown | 30 ~ 40 years old | Cancer |
+----------------------------------------------+
At this time, the values of each variable are as follows:
<span> built-in variables </ span> | value </ span> when <span> 1st cycle | at values of <span> second cycle </ span> |
<span>${dag.</span></strong><em><strong><span>input</span></strong></em><strong><span>} </span></strong></td><td colspan="2"><code>+-------------------------- --------------------+</code><code>| uid | region | age_range | zodiac |< /code><code>+------------------------------------------- ---+</code><code>| 0016359810821 | Hubei Province| 30~40 years old | Cancer|</code><code>| 0016359814159 | unknown | 30~40 years old | Cancer|</code><code>+--------------------------------------- -------+</code></td></tr><tr><td><strong><span>${dag.</span> <span>input</span > <span>[${dag.offset}]}</span></strong></td><td><strong><span>0016359810821, Hubei Province, 30-40 years old, Cancer</span> </strong></td><td><strong><span>0016359814159, unknown, 30-40 years old, Cancer</span></strong></td></tr><tr><td>< strong> <span> $ {DAG. </ span> <span> INPUT </ span> <span> .length} </ span> | <span> 2 </ span> < span>Description</span> <span> The number of rows in the two-dimensional array is the length of the data set, and the number of rows in the two-dimensional array output by the current assignment node is 2. </span> |
<span>${dag.</span></strong><em><strong><span>input</span></strong></em><strong>< span>[0][1]</span></strong><strong><span>Description</span></strong><span> The value of the first row and first column of the two-dimensional array. </span></td><td colspan="2"><strong><span>0016359810821</span></strong></td></tr><tr><td><strong><span >} $ {dag.loopTimes </ span> | <span>. 1 </ span> | <span> 2 </ span> |
<span> $ {dag.offset} </ span> | <span>0</span> | <span>1</span> |
#
# Part2: Traverse nodes (for-each node)
## 2.1 Node composition
The for-each node of DataWorks is a special node that contains internal nodes. When you create the for-each node, three internal nodes are automatically created at the same time: start node, (cycle start node), sql node (cyclic task node), end node and (circle end judgment node) are organized into internal node processes through internal nodes to realize the cyclic traversal of the output results of upstream assignment nodes.
As shown in FIG:
* sql node DataWorks creates a SQL type internal task running node for you by default. You can also delete the default sql node to customize the running node of the internal loop traversal task.
Your loop traversal task is a SQL type task, you can directly double-click the default sql node to enter the code development page of the node to develop the task code.
* Your loop traversal task is more complicated, you can create other task nodes in the internal node process, and rebuild the running process of the node according to the actual situation.
explains that when customizes the cyclic task node, you can delete the dependencies between internal nodes and reorganize the internal business process of the cyclic node, but you need to set the start node and end node as the internal business process of the for-each node respectively. The first and last nodes.
* start node and end node is the start node and end node of each cycle traversed by the internal node business process, and does not carry specific task codes.
shows that for-each node does not control the number of loop traversals, and the number of loop traversals of the for-each node is controlled by the actual output of the upstream assignment node.
## 2.2 Usage restrictions and precautions
* Upstream and downstream dependencies
The for-each traversal node needs to traverse the value passed by the assignment node, so the assignment node needs to be the upstream node of the for-each node, and the for-each node needs to rely on the assignment node.
* Loop support
only supports the use of for-each nodes in DataWorks Standard Edition and above.
* The for-each node supports up to 128 cycles. If it exceeds 128 times, an error will be reported during operation. The actual number of loop traversals is controlled by the actual output of the upstream assignment node.
* For the output of one-dimensional array type, the number of loop traversals is the number of one-dimensional array elements. For example, when the assignment language of the assignment node is SEHLL or Python (Python2), the output result is a one-dimensional array: 2021-03-28,2021-03-29,2021-03-30,2021-03-31,2021- 04-01 , the for-each node will loop 5 times to complete the traversal.
* For the output of the two-dimensional array type, the number of loop traversal is the number of rows of the two-dimensional array element. For example, when the assignment language of the assignment node is OdpsSQL, the output result is a two-dimensional array:
+----------------------------------------------+
| uid | region | age_range | zodiac |
+----------------------------------------------+
| 0016359810821 | Hubei | 30 ~ 40 years old | Cancer |
| 0016359814159 | unknown | 30 ~ 40 years old | Cancer |
+----------------------------------------------+
Then the for-each node will loop twice to complete the traversal.
* Internal node
You can delete the dependencies between the internal nodes of the for-each node and reorganize the internal business process, but you need to set the start node and end node as the first and last nodes of the for-each node internal business process.
* When using branch nodes for logical judgment or result traversal in the internal nodes of the for-each node, the merge node needs to be used at the same time.
* Commissioning and running
DataWorks is in standard mode, it does not support to directly test and run the for-each node on the DataStudio interface. If you want to test and verify the results of the for-each node, you need to publish the task that contains the for-each node and submit it to the operation and maintenance center, and run the for-each node task on the operation and maintenance center page.
* When viewing the execution log of the for-each node in the operation and maintenance center, you need to right-click the instance and click view the internal node to view the execution log of the internal node.
## 2.3 Typical Application
The for-each node of DataWorks is mainly used in scenarios with loop traversal and needs to be used in conjunction with the assignment node. The assignment node is used as the upstream node of the for-each node. After the output result of the assignment node is assigned to the for-each node, once Loop to traverse the output result of the assignment node.
## 2.4 Built-in variables
Each time the for-each node of DataWorks loops through the output result of the assignment node, you can get the current number of loops and offset through some built-in variables.
<span> built-in variables </ span> | <span> meaning </ span> | <span> for loop Comparative </ span> |
<span> ${dag.loopDataArray}</span></strong></td><td><span>Get the data set of the assignment node</span></td><td><span> is equivalent to the for loop Code result:</span><span>data=[]</span></td></tr><tr><td><strong><span>${dag.foreach.current}</span> | <span>Get the current traverse value</span> | <span>Take the following for loop code as an example: </span> <span>for(int i=0;i<data.length;i++) { print(data[i]); }</span> <ul><li> <span>data[i]</span > <span>equivalent to</span> <span>${dag.foreach.current}</span></strong><span>. </span></li><li><strong><span>i</span></strong><span>equivalent to</span><strong><span>${dag.offset}</span > <span>. </span></li></ul> |
<span>${dag.offset}</span></strong></td><td><span>Current offset</span> span><span>(The offset of each traversal relative to the first time)</span></td></tr><tr><td><strong><span>${dag.loopTimes}< /span> | <span>Get the current number of traversals</span> | <span>-</span> |
When you understand the structure of your output table, you can use the following variable methods to obtain the values of other variables.
<span> Other variables </ span> | <span> meaning </ span> |
<span> $ {dag.foreach.current [n-]} </ span> </strong></td><td><span>When the output result of the upstream assignment node is a two-dimensional array, the data of a certain column of the current data row is obtained each time it is traversed. </span></td></tr><tr><td><strong><span>${dag.loopDataArrayi}</span> | <span>The output result of the upstream assignment node is a two-dimensional array When, obtain the data of the specific i row and j column of the data set. </span> |
<span>${dag.foreach.current[n]}</span> | <span> The output of a specific one-dimensional data node is obtained upstream . </span> |
## 2.5 Examples of built-in variable values
* Case 1
The upstream assignment node is a shell node, and the last output is 2021-03-28, , 2021-03-30, 2021-03-31, 2021-04-01 16170ecd216716. At this time, the value of each variable The values are as follows:
explains that is a one-dimensional array, the number of array elements is 5 (each element is separated by a comma), so the total number of for-each traversals is 5.
<span> built-in variables </ span> | <span> 1st cycle through the values </ span> | <span> 2nd loop through values </ span> |
<span>${dag.loopDataArray}</span></strong></td><td colspan="2"><strong><span>2021-03-28,2021 -03-29,2021-03-30,2021-03-31,2021-04-01</span></strong></td></tr><tr><td><strong><span> dag.foreach.current} {$ </ span> | <span> 2021-03-28 </ span> | <span> 2021-03-29 </ span> |
<span>${dag.offset}</span></strong></td><td><span>0</span></td><td><span>1</span></td ></tr><tr><td><strong><span>${dag.loopTimes}</span> | <span>. 1 </ span> | <span> 2 </ span> |
<span> $ {dag.foreach.current [. 3]} </ span> | <span> 2021- 03-30</span> |
* Case 2
The upstream assignment node is the ODPS SQL node, and the last select statement queries two pieces of data:
+----------------------------------------------+
| uid | region | age_range | zodiac |
+----------------------------------------------+
| 0,016,359,810,821 | Hubei | 30 ~ 40 years old | Cancer |
| 0,016,359,814,159 | unknown | 30 ~ 40 years old | Cancer |
+----------------------------------------------+
At this time, the values of each variable are as follows:
explains that is a two-dimensional array, the number of rows in the array is 2, so the total number of for-each traversals is 2.
<span>Built-in variables</span> | <span>The value of the 1st loop traversal | span> |
<span>${dag.loopDataArray}</span></strong></td><td colspan="2"><code>+-------- -------------------------------------+</code><code> | uid   ; | region | age_range | zodiac | </code><code>+---------------------- ------------------------+</code><code> | 0016359810821 | Hubei Province| 30~40 years old | Cancer | </code><code>| 0016359814159 | Unknown | 30~40 years old | Cancer| </code><code>+------------ ----------------------------------+</code></td></tr><tr> <td><strong><span>${dag.foreach.current}</span> | <span>0016359810821, Hubei Province, 30-40 years old, giant Crab</span> | <span>0016359814159, unknown, 30-40 years old, Cancer</span> |
16170ecd216d58 16170offecd216d58 <span>set</span> </td><td><span>0</span></td><td><span>1</span></td></tr><tr><td><strong><span> dag.loopTimes} {$ </ span> | <span>. 1 </ span> | <span> 2 </ span> |
<span> $ {dag.foreach.current [0]} </span></strong></td><td><strong><span>0016359810821</span></strong></td><td><strong><span>0016359814159</span></span> strong> </ TD> </ TR> <TR> <TD> <strong> <span> $ {dag.loopDataArray1} </ span> | <span> 0016359814159 </ span> |
> Copyright statement: content of this article is contributed spontaneously by Alibaba Cloud real-name registered users. The copyright belongs to the original author. The Alibaba Cloud Developer Community does not own its copyright and does not assume corresponding legal responsibilities. For specific rules, please refer to the "Alibaba Cloud Developer Community User Service Agreement" and the "Alibaba Cloud Developer Community Intellectual Property Protection Guidelines". If you find suspected plagiarism in this community, fill in the infringement complaint form to report it. Once verified, the community will immediately delete the suspected infringing content.
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。