PingCode Flow technical architecture revealed

Author: PingCode R&D VP@Xu Ziyan
This article is the fourth article in the PingCode Flow series, and if nothing else, the last article. Since the official launch of PingCode Flow in May last year, I have conceived the sequence and general content of the four documents in my mind, from introducing the current situation, pain points and solutions of R&D automation to showing how to use PingCode Flow to realize R&D automation. This last article is to show how PingCode Flow works internally from a purely technical point of view, and how to ensure that 99% of the rules can be completed within 1 second under the pressure of nearly 4,000 rule executions per day, while supporting Complex operation logic such as sequential, parallel, judgment, loop, etc.

At the same time, we also hope to share in this article how we analyzed and thought, and finally arrived at the current architecture. Therefore, this article will not directly show the final design results, but will explain why we have such a design, the trade-offs of advantages and disadvantages, and the refactoring in the development process.

PingCode Flow series of articles Portal

What is the essence of PingCode Flow

As we mentioned in the previous articles, PingCode Flow is an R&D automation tool. The so-called automation refers to completing a series of operations according to a predefined process after an event occurs. So, in essence, PingCode Flow is a TAP (Trigger Action Platform) system. It consists of a trigger and multiple actions to form an ordered execution rule, and then executes in sequence according to this rule.

Therefore, when designing the technical architecture of PingCode Flow, it is to ensure that such a process can run smoothly.

Data Structures: How to Define a Rule

Once the core purpose of the product has been identified, the first thing to do is to clarify how the data is defined. According to the above diagram, users may define multiple rules in a team, and each rule contains a trigger and multiple subsequent actions. Based on this simple requirement, the data structure of the rules can be defined as follows.

Thus, a rule contains a trigger and the actions it contains. The sequence number of the actions determines the order in which they are executed. This design seems to basically meet the current product needs.
But we know that the current PingCode Flow not only supports the above single-line sequential execution process, but also supports the execution process responsible for conditions, parallelism, judgment, and looping. And more than that, we need to freely combine the above processes. For example, there is a judgment inside the parallel, there is a loop in the judgment, and there is parallel in the loop... Through such an almost infinite combination, a rule can realize almost arbitrary flow.

Therefore, the above simple data structure cannot meet the requirements at all. And how to design a rule that can support various scenarios is the first problem facing our PingCode Flow team.
If we think in terms of "a rule is a set of actions", it is difficult to design a relatively general data structure. Because the actions within the rules are determined by the user, it is impossible to exhaustively list all possible structures. But you can try to think about this problem in a different way, that is, instead of thinking of rules as an ordered list of triggers and actions, we define them as a linked list. Then a rule is

Trigger and next action
A collection of current and next actions.
Suppose we combine "trigger" and "action" into "step" again. Then a rule is
first step
The next step of the current step In this way, our definitions of rules and steps can be unified as

That is, the rule does not care what the actions inside it are and the sequence, it only cares what the first action is. And each action only cares what the next action is.
For complex processes such as parallelism, looping, and judgment, we only need to expand the data structure of the corresponding action to achieve different permutations and combinations. For example, for parallel, its data structure is like this.

The first step ID of each branch inside "parallel" is stored in an array, indicating what is the first step to be executed by this branch. "Parallel" itself does not care what the specific process in each branch looks like. It only cares what the next step is when all branches have been executed.
Based on such a structure, for the above complex rule

Our data roughly looks like this. For 步骤1 it only sets the next step to be 步骤2 .

步骤2 ， 步骤3 步骤4 ，下一个步骤是整个分支全部执行完毕的步骤10 。 So its data is like this.

For 步骤4 it is a loop step. The first step into the loop body is 步骤6 , and when it completes the loop, it will end the current branch operation. So its data is like this.

下一个步骤ID is empty, indicating the end of the current branch.

Execution logic: how to support various types of steps

When we determine the data structure, the execution of the steps within the rules is also determined. Similar to the structure, when a rule is fired (let's not consider how the rule is fired), it first finds the ID of the first action. Readers who are familiar with PingCode Flow know that there are many preset actions in our system, such as setting the person in charge of the work item, creating a page, changing the state of the test case, and so on. So how are these actions performed?
First, every action will have a globally unique name. When the rule executes to this step, we will find its action name by the ID of the step. Locate the corresponding actual execution logic in the code by the name of the action.

For example, for the action "Set Work Item Responsible", its connector name is project and the action name is set_assignee . The code is roughly as follows.

 @action({
    name: "set_assignee",
    displayName: "设置工作项负责人",
    description: "设置当前一个或多个工作项的负责人。",
    isEnabled: Is.yes,
    allowCreation: Is.yes,
    allowDeletion: Is.yes
})
export class AgileActionSetWorkItemsAssignee extends AgileWorkItemsAction<AgileActionSetAssigneeDynamicPropertiesSchema, AgileActionSetAssigneeDirectives, AgileActionSetAssigneeRuleStepEntity> {
    constructor() {
        super(AgileActionSetAssigneeRuleStepEntity, undefined, /* ... */);
    }

    protected onGetDirectivesMetadata(): DirectivesMetadata<Omit<AgileActionSetAssigneeDirectives, keyof AgileWorkItemsActionDirectives>> {
        /* ... */
    };

    protected onGetDynamicPropertiesMetadata(): PropertiesMetadata<Omit<AgileActionSetAssigneeDynamicPropertiesSchema, keyof AgileWorkItemsDynamicPropertiesSchema>> {
        /* ... */
    };

    protected async onExecute(context: ExecuteContextWrapper<AgileActionSetAssigneeDynamicPropertiesSchema, AgileActionSetAssigneeDirectives, AgileActionSetAssigneeRuleStepEntity>): Promise<RuleStepResult<AgileActionSetAssigneeDynamicPropertiesSchema>> {
        /* ... */
    }
}

The main piece of code is onExecute which will be called when this step is executed. When the operation is completed, it will return the 下一个步骤ID saved in the database, and the rule execution engine will call the subsequent steps. This is the simplest action step, which is called by the system, performs a specific operation, and then returns the ID of the next step.
In addition to ordinary actions, PingCode Flow also supports complex flow control such as conditional, parallel, judgment, and loop. Like the actions just mentioned, it is achieved by rewriting the method onExecute . Taking "condition" as an example, it needs to continue to execute subsequent steps when the judgment is true, and stop the current step when it is false. Then its onExecute is like this.

 export abstract class Condition<D extends Directives, T extends RuleStepsConditionEntity<D>> extends Element<EmptyPropertiesSchema, D, T> {

    constructor(ruleStepCtor: new (...args: any) => T, contracts: ElementContract[]) {
        /* ... */
    }

    protected abstract predicate(context: ExecuteContextWrapper<EmptyPropertiesSchema, D, T>): Promise<boolean>;

    protected async onExecute(context: ExecuteContextWrapper<EmptyPropertiesSchema, D, T>): Promise<RuleStepResult<EmptyPropertiesSchema>> {
        if (await this.predicate(context)) {
            return {
                properties: undefined,
                nextStepId: context.getRuleStepEntity().next_step_id
            };
        }
        else {
            return {
                properties: undefined,
                nextStepId: undefined
            };
        }
    }

    public getDynamicPropertiesMetadata(): PropertiesMetadata<EmptyPropertiesSchema> {
        return {};
    }

}

We define an abstract method predicate to implement specific judgment logic for derived classes. onExecute method will call this predicate . If the result is TRUE , then it will return the ID of the next step defined in the database, and the rule will continue to execute; if the result is FALSE , then it will return undefined , indicating that there are no subsequent steps, and the execution process ends here.
For "judgment", "parallel", "loop" and other types of steps, it may contain very complex processes, and it can also be decoupled through the existing data structure and execution process, so that each step only needs to be Focus on your work.
Taking "parallel" as an example, we know that its data structure contains

First step ID for each branch

next step id after all branches end
Therefore, the execution logic of the "parallel" steps is to start the first step of each branch at the same time. Then wait for the operation of all branches to end, and then return to the ID of the next step.

 @control({
  name: "parallel",
  displayName: "并行（Parallel）",
  description: "并行执行步骤。",
  isEnabled: Is.yes,
  allowCreation: Is.yes,
  allowDeletion: Is.yes
})
export class ControlParallel extends ControlAction<EmptyPropertiesSchema, RuleStepsControlParallelEntity> {
  constructor() {
      /* ... */
  }

  public getDynamicPropertiesMetadata(): PropertiesMetadata<EmptyPropertiesSchema> {
      /* ... */
  }

  protected async onExecute(context: ExecuteContextWrapper<EmptyPropertiesSchema, EmptyDirectives, RuleStepsControlParallelEntity>): Promise<RuleStepResult<EmptyPropertiesSchema>> {
      const entity = context.getRuleStepEntity();
      const contexts = await Promise.all(_.map(entity.parallel_next_step_ids, id => new Promise<ExecuteContext>((resolve, reject) => {
          const ctx = context.getRawContext(true);
          Executor.create().execute(entity._id, id, ctx)
              .then(() => {
                  return resolve(ctx);
              })
              .catch(error => {
                  return reject(error);
              });
      })));
      context
          .mergeProperties(contexts)
          .mergeTargets(contexts, false);
      return {
          properties: undefined,
          nextStepId: entity.next_step_id
      };
  }

}

Note that in the onExecute method, we convert the array of branch step IDs defined in the database parallel_next_step_ids into an asynchronous operation Executor.create().execute and let them execute in their respective contexts. Then wait for all branch operations to be executed, that is await Promise.all , and then return to the ID of the next step. In this way, for "parallel" itself, you don't have to care about the execution logic in each branch. And when the rules are executed into a branch, they are completely unaware that they are in a "parallel" "branch".

Module splitting: how rules are scheduled

Just now we introduced how the data of rules and steps are saved, and how the steps in a rule are executed. But how is the rule triggered? At present, PingCode Flow supports three kinds of rules: automatic, manual and real-time. At the same time, the automatic rules can be divided into the following three startup scenarios:

Started by other sub-products of PingCode
Initiated by a third-party subproduct call
Started by a custom webhook call

As can be seen from the above figure, for a rule, it does not need to care about which way it is triggered. It just needs to know that at some point, there is a rule that needs to be enforced. Therefore, we separate a separate module for this part of executing the rules, namely "Flow Engine", its responsibility is very simple, that is, "start a certain rule".
As for the modules responsible for receiving rule start requests, they have a general responsibility to notify Flow Engine to start rules according to their own needs. The five modules that trigger the rules in the above figure have the following responsibilities:

Through such a split, the execution of the rule and the triggering of the rule can be completely isolated. In the early stage of PingCode Flow development, we only supported rules triggered from other PingCode sub-products. However, with the continuous enhancement of product functions, we have successively implemented access to third-party products (GitHub, GitLab, Jenkins, etc.), instant rules (manual triggering) and timing rules (timing triggering). These new triggering methods do not affect other previous modules at all, so the quality of the product is guaranteed to the greatest extent.

Deployment method: let all nodes support horizontal scaling

The core demands of enterprise-level SaaS products are data security and service stability. For stability, on the one hand, our output (that is, code) is required to be of high quality, and on the other hand, our service is required to support horizontal expansion in all links, and there will be no increase in the amount of requests and execution. system performance issues and stability issues. The single-responsibility module division allows us to have better choices when designing the PingCode Flow deployment method, and it is easier to achieve stability requirements.
Specifically, the five receiving modules and the rule execution module (Flow Engine) introduced before are stateless in their own business logic, so they can support independent horizontal expansion.

The arrows in the above figure indicate that the call relationship is initiated by each trigger module, and the rules of the Flow Engine are started on demand. Our original design is to use the RPC function of the basic framework to achieve. That is, when an event occurs, for example, the user modifies the status of the work item, the trigger module "PingCode Sub-Product" will synchronously call the Flow Engine interface through RPC (HTTP or TCP request) to activate the corresponding rules.
But PingCode Flow is different from other PingCode sub-products. The execution frequency and execution time of PingCode Flow are based on customer-defined rules and are driven by operations and events within the PingCode system and various external systems. Once started, the request volume and execution volume will be very large. Therefore, the Flow Engine at the backend is required to be flexible enough to execute each rule smoothly and buffer a large number of operations in a short period of time. Therefore, the direct use of RPC was ultimately rejected in the architecture review meeting.
Since the architectural goal is to require the PingCode Flow system to protect the backend Engine module during peak periods, we decided to use a message queue between the calling layer and the actual execution layer.

Through the message queue, all rule execution requests will be queued, and then read and processed by multiple Flow Engine instances listening to the queue. The advantage of this is that, first of all, once the execution volume is too large in a short time, the execution request will be buffered in the message queue and will not cause an impact on the Flow Engine. Secondly, the caller and the executor interact completely through data, and the two are completely decoupled. Finally, when the operation volume fluctuates, we can connect the new Flow Engine to the message queue to complete the expansion without additional configuration of IP, port number, request forwarding and load balancing.
Finally, the overall architecture of our PingCode Flow is shown below.

Written at the end: Reasonable architecture is constantly evolving

When communicating with our PingCode customers, a question that is often asked is, how did the R&D team come up with a good architectural design? It not only meets the needs of future expansion, but also avoids excessive design. In this regard, my personal opinion is that there is no so-called good design in the world, only reasonable design. A reasonable design does not come from the imagination of an architect, but is gradually discovered based on existing business requirements and foreseeable scenarios.
Especially in the big scenario of agile development, each iteration is to complete a user story that can reflect customer value. Therefore, architecture design is not done overnight, but requires continuous thinking, design, practice, feedback and modification in each iteration, and finally obtains the most reasonable answer at present.