Dolphinscheduler Worker模块源码剖析

1、Worker接收Master RPC请求架构图

Worker阐述.png

Worker服务的Netty提供和Master JDK 动态代理接口调用，请参考 Dolphinscheduler 告警模块解说，不再重复的说

简说 :
org.apache.dolphinscheduler.extract.worker.ITaskInstanceOperator

@RpcService
public interface ITaskInstanceOperator {

    @RpcMethod
    TaskInstanceDispatchResponse dispatchTask(TaskInstanceDispatchRequest taskInstanceDispatchRequest);

    @RpcMethod
    TaskInstanceKillResponse killTask(TaskInstanceKillRequest taskInstanceKillRequest);

    @RpcMethod
    TaskInstancePauseResponse pauseTask(TaskInstancePauseRequest taskPauseRequest);

    @RpcMethod
    UpdateWorkflowHostResponse updateWorkflowInstanceHost(UpdateWorkflowHostRequest updateWorkflowHostRequest);
}

对实现了@RpcService的接口和@RpcMethod的方法，进行Worker的Netty handler注入和Master动态代理实现

2、分发任务(TaskInstanceDispatchOperationFunction)

TaskInstanceDispatchOperationFunction流程.png

2.1、WorkerConfig

WorkerConfig : 其实就是从Worker模块下 application.yaml 下读取 worker 开头的配置

2.2、WorkerTaskExecutorFactoryBuilder

WorkerTaskExecutorFactoryBuilder : 是任务执行器工厂的构造器，里面封装了 DefaultWorkerTaskExecutorFactory(默认Worker任务执行器工厂) ，DefaultWorkerTaskExecutorFactory工厂又封装了 DefaultWorkerTaskExecutor 的创建。DefaultWorkerTaskExecutor 的父类是WorkerTaskExecutor，WorkerTaskExecutor又是一个线程。好玩不？

2.3、WorkerTaskExecutorThreadPool

WorkerTaskExecutorThreadPool : 其实就是Fixed线程池的封装而已

2.4、从operator开始说

public TaskInstanceDispatchResponse operate(TaskInstanceDispatchRequest taskInstanceDispatchRequest) {
    log.info("Receive TaskInstanceDispatchRequest: {}", taskInstanceDispatchRequest);
    // TODO 任务执行上下文
    TaskExecutionContext taskExecutionContext = taskInstanceDispatchRequest.getTaskExecutionContext();
    try {
        // TODO 设置worker地址
        taskExecutionContext.setHost(workerConfig.getWorkerAddress());
        // TODO 设置task日志存放路径
        taskExecutionContext.setLogPath(LogUtils.getTaskInstanceLogFullPath(taskExecutionContext));

        // TODO MDC中设置流程实例id和任务实例id，好像只是put，没有get使用
        LogUtils.setWorkflowAndTaskInstanceIDMDC(
                taskExecutionContext.getProcessInstanceId(),
                taskExecutionContext.getTaskInstanceId());

        // check server status, if server is not running, return failed to reject this task
        if (!ServerLifeCycleManager.isRunning()) {
            log.error("server is not running. reject task: {}", taskExecutionContext.getProcessInstanceId());
            return TaskInstanceDispatchResponse.failed(taskExecutionContext.getTaskInstanceId(),
                    "server is not running");
        }

        TaskMetrics.incrTaskTypeExecuteCount(taskExecutionContext.getTaskType());

        // TODO 通过WorkerTaskExecutorFactoryBuilder创建了一个WorkerTaskExecutor
        WorkerTaskExecutor workerTaskExecutor = workerTaskExecutorFactoryBuilder
                .createWorkerTaskExecutorFactory(taskExecutionContext)
                .createWorkerTaskExecutor();
        // todo: hold the workerTaskExecutor
        // TODO 直接进行任务的提交
        if (!workerTaskExecutorThreadPool.submitWorkerTaskExecutor(workerTaskExecutor)) {
            log.info("Submit task: {} to wait queue failed", taskExecutionContext.getTaskName());
            return TaskInstanceDispatchResponse.failed(taskExecutionContext.getTaskInstanceId(),
                    "WorkerManagerThread is full");
        } else {
            log.info("Submit task: {} to wait queue success", taskExecutionContext.getTaskName());
            return TaskInstanceDispatchResponse.success(taskExecutionContext.getTaskInstanceId());
        }
    } finally {
        LogUtils.removeWorkflowAndTaskInstanceIdMDC();
    }
}

LogUtils.getTaskInstanceLogFullPath(taskExecutionContext) 解析
org.apache.dolphinscheduler.plugin.task.api.utils.LogUtils#getTaskInstanceLogFullPath : 获取任务日志的全路径

/**
     * Get task instance log full path.
     *
     * @param taskExecutionContext task execution context.
     * @return task instance log full path.
     */
    public static String getTaskInstanceLogFullPath(TaskExecutionContext taskExecutionContext) {
        return getTaskInstanceLogFullPath(
                DateUtils.timeStampToDate(taskExecutionContext.getFirstSubmitTime()),
                taskExecutionContext.getProcessDefineCode(),
                taskExecutionContext.getProcessDefineVersion(),
                taskExecutionContext.getProcessInstanceId(),
                taskExecutionContext.getTaskInstanceId());
    }

org.apache.dolphinscheduler.plugin.task.api.utils.LogUtils#getTaskInstanceLogFullPath : 拼接出任务日志的全路径

/**
     * todo: Remove the submitTime parameter?
     * The task instance log full path, the path is like:{log.base}/{taskSubmitTime}/{workflowDefinitionCode}/{workflowDefinitionVersion}/{}workflowInstance}/{taskInstance}.log
     *
     * @param taskFirstSubmitTime       task first submit time
     * @param workflowDefinitionCode    workflow definition code
     * @param workflowDefinitionVersion workflow definition version
     * @param workflowInstanceId        workflow instance id
     * @param taskInstanceId            task instance id.
     * @return task instance log full path.
     */
    public static String getTaskInstanceLogFullPath(Date taskFirstSubmitTime,
                                                    Long workflowDefinitionCode,
                                                    int workflowDefinitionVersion,
                                                    int workflowInstanceId,
                                                    int taskInstanceId) {
        if (TASK_INSTANCE_LOG_BASE_PATH == null) {
            throw new IllegalArgumentException(
                    "Cannot find the task instance log base path, please check your logback.xml file");
        }
        final String taskLogFileName = Paths.get(
                String.valueOf(workflowDefinitionCode),
                String.valueOf(workflowDefinitionVersion),
                String.valueOf(workflowInstanceId),
                String.format("%s.log", taskInstanceId)).toString();
        return TASK_INSTANCE_LOG_BASE_PATH
                .resolve(DateUtils.format(taskFirstSubmitTime, DateConstants.YYYYMMDD, null))
                .resolve(taskLogFileName)
                .toString();
    }

org.apache.dolphinscheduler.plugin.task.api.utils.LogUtils#getTaskInstanceLogBasePath : 读取logback-spring.xml中的配置，获取任务实例日志的基础路径，其实就是获取根目录下/logs为基础路径

/**
     * Get task instance log base absolute path, this is defined in logback.xml
     *
     * @return
     */
    public static Path getTaskInstanceLogBasePath() {
        return Optional.of(LoggerFactory.getILoggerFactory())
                .map(e -> (AppenderAttachable<ILoggingEvent>) (e.getLogger("ROOT")))
                .map(e -> (SiftingAppender) (e.getAppender("TASKLOGFILE")))
                .map(e -> ((TaskLogDiscriminator) (e.getDiscriminator())))
                .map(TaskLogDiscriminator::getLogBase)
                .map(e -> Paths.get(e).toAbsolutePath())
                .orElse(null);
    }

worker的 logback-spring.xml :

<configuration scan="true" scanPeriod="120 seconds">
  <property name="log.base" value="logs"/>
  ...
  <appender name="TASKLOGFILE" class="ch.qos.logback.classic.sift.SiftingAppender">
          <filter class="org.apache.dolphinscheduler.plugin.task.api.log.TaskLogFilter"/>
          <Discriminator class="org.apache.dolphinscheduler.plugin.task.api.log.TaskLogDiscriminator">
              <key>taskInstanceLogFullPath</key>
              <logBase>${log.base}</logBase>
          </Discriminator>
          <sift>
              <appender name="FILE-${taskInstanceLogFullPath}" class="ch.qos.logback.core.FileAppender">
                  <file>${taskInstanceLogFullPath}</file>
                  <encoder>
                      <pattern>
                          [%level] %date{yyyy-MM-dd HH:mm:ss.SSS Z} - %message%n
                      </pattern>
                      <charset>UTF-8</charset>
                  </encoder>
                  <append>true</append>
              </appender>
          </sift>
      </appender>
  ...
  <root level="INFO">
      <appender-ref ref="STDOUT"/>
      <appender-ref ref="TASKLOGFILE"/>
  </root>

</configuration>

最终地址是 :

/opt/dolphinscheduler/worker-server/logs/20240615/13929490938784/1/1815/1202.log

示例参考 : https://segmentfault.com/a/1190000044967912

2.5、DefaultWorkerTaskExecutor解说

org.apache.dolphinscheduler.server.worker.runner.operator.TaskInstanceDispatchOperationFunction#operate

...
// TODO 通过WorkerTaskExecutorFactoryBuilder创建了一个WorkerTaskExecutor
            WorkerTaskExecutor workerTaskExecutor = workerTaskExecutorFactoryBuilder
                    .createWorkerTaskExecutorFactory(taskExecutionContext)
                    .createWorkerTaskExecutor();
            // todo: hold the workerTaskExecutor
            // TODO 直接进行任务的提交
            if (!workerTaskExecutorThreadPool.submitWorkerTaskExecutor(workerTaskExecutor)) {
                log.info("Submit task: {} to wait queue failed", taskExecutionContext.getTaskName());
                return TaskInstanceDispatchResponse.failed(taskExecutionContext.getTaskInstanceId(),
                        "WorkerManagerThread is full");
            } else {
                log.info("Submit task: {} to wait queue success", taskExecutionContext.getTaskName());
                return TaskInstanceDispatchResponse.success(taskExecutionContext.getTaskInstanceId());
            }
...

直接使用 workerTaskExecutorThreadPool.submitWorkerTaskExecutor(workerTaskExecutor) 进行任务的提交

WorkerTaskExecutor 是一个线程，既然是线程，是不是要看一下run :

public void run() {
        try {
            // TODO MDC中设置流程实例和任务实例，其实就相当于是ThreadLocal使用一样
            LogUtils.setWorkflowAndTaskInstanceIDMDC(
                    taskExecutionContext.getProcessInstanceId(),
                    taskExecutionContext.getTaskInstanceId());

            // TODO MDC中设置任务的日志路径
            LogUtils.setTaskInstanceLogFullPathMDC(taskExecutionContext.getLogPath());

            // TODO 打印任务的头部
            TaskInstanceLogHeader.printInitializeTaskContextHeader();

            // TODO 进行任务的初始化，其实就是做了任务的开始时间和taskAppId(流程实例id + 任务实例id)
            initializeTask();

            // TODO DRY_RUN其实就是空跑，其实就是直接设置状态为成功
            if (DRY_RUN_FLAG_YES == taskExecutionContext.getDryRun()) {
                taskExecutionContext.setCurrentExecutionStatus(TaskExecutionStatus.SUCCESS);
                taskExecutionContext.setEndTime(System.currentTimeMillis());
                WorkerTaskExecutorHolder.remove(taskExecutionContext.getTaskInstanceId());
                // TODO 通过worker消息发送器将结果信息发送过去
                workerMessageSender.sendMessageWithRetry(taskExecutionContext,
                        ITaskInstanceExecutionEvent.TaskInstanceExecutionEventType.FINISH);
                log.info(
                        "The current execute mode is dry run, will stop the subsequent process and set the taskInstance status to success");
                return;
            }
            // TODO 打印任务插件的头部
            TaskInstanceLogHeader.printLoadTaskInstancePluginHeader();

            // TODO 执行之前
            beforeExecute();

            // TODO 回调函数
            TaskCallBack taskCallBack = TaskCallbackImpl.builder()
                    .workerMessageSender(workerMessageSender)
                    .taskExecutionContext(taskExecutionContext)
                    .build();

            TaskInstanceLogHeader.printExecuteTaskHeader();
            // TODO 执行
            executeTask(taskCallBack);

            TaskInstanceLogHeader.printFinalizeTaskHeader();

            // TODO 执行之后
            afterExecute();

            closeLogAppender();
        } catch (Throwable ex) {
            log.error("Task execute failed, due to meet an exception", ex);
            afterThrowing(ex);
            closeLogAppender();
        } finally {
            LogUtils.removeWorkflowAndTaskInstanceIdMDC();
            LogUtils.removeTaskInstanceLogFullPathMDC();
        }
    }

重点分析 :

2.5.1、空跑

如果是空跑，任务直接成功，不执行

// TODO DRY_RUN其实就是空跑，其实就是直接设置状态为成功
            if (DRY_RUN_FLAG_YES == taskExecutionContext.getDryRun()) {
                taskExecutionContext.setCurrentExecutionStatus(TaskExecutionStatus.SUCCESS);
                taskExecutionContext.setEndTime(System.currentTimeMillis());
                WorkerTaskExecutorHolder.remove(taskExecutionContext.getTaskInstanceId());
                // TODO 通过worker消息发送器将结果信息发送过去
                workerMessageSender.sendMessageWithRetry(taskExecutionContext,
                        ITaskInstanceExecutionEvent.TaskInstanceExecutionEventType.FINISH);
                log.info(
                        "The current execute mode is dry run, will stop the subsequent process and set the taskInstance status to success");
                return;
            }

2.5.2、 beforeExecute()

执行之前的准备工作，比如说给Master汇报说自己正在运行、创建租户(linux上用户)、创建工作路径、下载所需资源文件、任务初始化**

protected void beforeExecute() {
        // TODO 先设置为RUNNING状态
        taskExecutionContext.setCurrentExecutionStatus(TaskExecutionStatus.RUNNING_EXECUTION);
        // TODO 向Master发送消息，告诉Master这个任务正在运行
        workerMessageSender.sendMessageWithRetry(taskExecutionContext,
                ITaskInstanceExecutionEvent.TaskInstanceExecutionEventType.RUNNING);
        log.info("Send task status {} master: {}", TaskExecutionStatus.RUNNING_EXECUTION.name(),
                taskExecutionContext.getWorkflowInstanceHost());

        // In most of case the origin tenant is the same as the current tenant
        // Except `default` tenant. The originTenant is used to download the resources
        // TODO 租户信息
        String originTenant = taskExecutionContext.getTenantCode();
        String tenant = TaskExecutionContextUtils.getOrCreateTenant(workerConfig, taskExecutionContext);
        taskExecutionContext.setTenantCode(tenant);
        log.info("TenantCode: {} check successfully", taskExecutionContext.getTenantCode());

        // TODO 创建工作路径
        TaskExecutionContextUtils.createTaskInstanceWorkingDirectory(taskExecutionContext);
        log.info("WorkflowInstanceExecDir: {} check successfully", taskExecutionContext.getExecutePath());

        TaskChannel taskChannel =
                Optional.ofNullable(taskPluginManager.getTaskChannelMap().get(taskExecutionContext.getTaskType()))
                        .orElseThrow(() -> new TaskPluginException(taskExecutionContext.getTaskType()
                                + " task plugin not found, please check the task type is correct."));

        log.info("Create TaskChannel: {} successfully", taskChannel.getClass().getName());

        // TODO 下载资源
        ResourceContext resourceContext = TaskExecutionContextUtils.downloadResourcesIfNeeded(originTenant, taskChannel,
                storageOperate, taskExecutionContext);

        taskExecutionContext.setResourceContext(resourceContext);
        log.info("Download resources successfully: \n{}", taskExecutionContext.getResourceContext());

        TaskFilesTransferUtils.downloadUpstreamFiles(taskExecutionContext, storageOperate);
        log.info("Download upstream files: {} successfully",
                TaskFilesTransferUtils.getFileLocalParams(taskExecutionContext, Direct.IN));

        // TODO 创建任务
        task = taskChannel.createTask(taskExecutionContext);
        log.info("Task plugin instance: {} create successfully", taskExecutionContext.getTaskType());

        // todo: remove the init method, this should initialize in constructor method
        // TODO 任务进行初始化
        task.init();
        log.info("Success initialized task plugin instance successfully");

        task.getParameters().setVarPool(taskExecutionContext.getVarPool());
        log.info("Set taskVarPool: {} successfully", taskExecutionContext.getVarPool());
    }

1、日志打印

log.info("Send task status {} master: {}", TaskExecutionStatus.RUNNING_EXECUTION.name(),
                taskExecutionContext.getWorkflowInstanceHost());
这里需要打印的是 taskExecutionContext.getWorkflowInstanceHost()，不应该是taskExecutionContext.getHost()。就是说你给Master汇报信息的呢，打印自己Worker节点的host干啥(自己肯定知道啊)，有用的是当前Worker节点是给哪个Master节点汇报自己的任务状态的

2、创建租户
org.apache.dolphinscheduler.server.worker.utils.TaskExecutionContextUtils#getOrCreateTenant

public static String getOrCreateTenant(WorkerConfig workerConfig, TaskExecutionContext taskExecutionContext) {
        try {
            TenantConfig tenantConfig = workerConfig.getTenantConfig();

            String tenantCode = taskExecutionContext.getTenantCode();
            if (TenantConstants.DEFAULT_TENANT_CODE.equals(tenantCode) && tenantConfig.isDefaultTenantEnabled()) {
                log.info("Current tenant is default tenant, will use bootstrap user: {} to execute the task",
                        TenantConstants.BOOTSTRAPT_SYSTEM_USER);
                return TenantConstants.BOOTSTRAPT_SYSTEM_USER;
            }
            boolean osUserExistFlag;
            // if Using distributed is true and Currently supported systems are linux,Should not let it
            // automatically
            // create tenants,so TenantAutoCreate has no effect
            if (tenantConfig.isDistributedTenantEnabled() && SystemUtils.IS_OS_LINUX) {
                // use the id command to judge in linux
                osUserExistFlag = OSUtils.existTenantCodeInLinux(tenantCode);
            } else if (OSUtils.isSudoEnable() && tenantConfig.isAutoCreateTenantEnabled()) {
                // if not exists this user, then create
                // TODO 默认走的是这里的分支，直接通过 sudo useradd -g %s %s 进行创建
                OSUtils.createUserIfAbsent(tenantCode);
                osUserExistFlag = OSUtils.getUserList().contains(tenantCode);
            } else {
                osUserExistFlag = OSUtils.getUserList().contains(tenantCode);
            }
            if (!osUserExistFlag) {
                throw new TaskException(String.format("TenantCode: %s doesn't exist", tenantCode));
            }
            return tenantCode;
        } catch (TaskException ex) {
            throw ex;
        } catch (Exception ex) {
            throw new TaskException(
                    String.format("TenantCode: %s doesn't exist", taskExecutionContext.getTenantCode()));
        }
    }

3、TaskChannel
TaskPluginManager Master启动的时候通 google的 @AutoService来完成SPI注册，参考 : https://segmentfault.com/a/1190000044973137

Master启动时候TaskPluginManager初始化
org.apache.dolphinscheduler.server.master.MasterServer#run

@PostConstruct
    public void run() throws SchedulerException {
        ......

        // install task plugin
        // TODO 是通过 google的 @AutoService来进行SPI注册的
        this.taskPluginManager.loadPlugin();

       ......
    }

org.apache.dolphinscheduler.plugin.task.api.TaskPluginManager#loadPlugin

public void loadPlugin() {
        if (!loadedFlag.compareAndSet(false, true)) {
            log.warn("The task plugin has already been loaded");
            return;
        }

        // TODO 实例化的时候是通过SPI进行加载的
        PrioritySPIFactory<TaskChannelFactory> prioritySPIFactory = new PrioritySPIFactory<>(TaskChannelFactory.class);
        for (Map.Entry<String, TaskChannelFactory> entry : prioritySPIFactory.getSPIMap().entrySet()) {
            String factoryName = entry.getKey();
            TaskChannelFactory factory = entry.getValue();

            log.info("Registering task plugin: {} - {}", factoryName, factory.getClass().getSimpleName());

            taskChannelFactoryMap.put(factoryName, factory);
            taskChannelMap.put(factoryName, factory.create());

            log.info("Registered task plugin: {} - {}", factoryName, factory.getClass().getSimpleName());
        }

    }

核心逻辑其实就是

TaskChannelFactory 接口 :

public interface TaskChannelFactory extends UiChannelFactory, PrioritySPI {

    TaskChannel create();

    default SPIIdentify getIdentify() {
        return SPIIdentify.builder().name(getName()).build();
    }
}

Task插件都实现了TaskChannelFactory接口并使用@AutoService注解 :
以ShellTaskChannelFactory为例 :

@AutoService(TaskChannelFactory.class)
public class ShellTaskChannelFactory implements TaskChannelFactory {

    @Override
    public TaskChannel create() {
        return new ShellTaskChannel();
    }

    @Override
    public String getName() {
        return "SHELL";
    }

    @Override
    public List<PluginParams> getParams() {
        List<PluginParams> paramsList = new ArrayList<>();

        InputParam nodeName = InputParam.newBuilder("name", "$t('Node name')")
                .addValidate(Validate.newBuilder()
                        .setRequired(true)
                        .build())
                .build();

        RadioParam runFlag = RadioParam.newBuilder("runFlag", "RUN_FLAG")
                .addParamsOptions(new ParamsOptions("NORMAL", "NORMAL", false))
                .addParamsOptions(new ParamsOptions("FORBIDDEN", "FORBIDDEN", false))
                .build();

        paramsList.add(nodeName);
        paramsList.add(runFlag);
        return paramsList;
    }
}

在这里创建了 ShellTaskChannel，也就是TaskChannel

4、下载所需资源
org.apache.dolphinscheduler.server.worker.utils.TaskExecutionContextUtils#downloadResourcesIfNeeded

public static ResourceContext downloadResourcesIfNeeded(String tenant,
                                                            TaskChannel taskChannel,
                                                            StorageOperate storageOperate,
                                                            TaskExecutionContext taskExecutionContext) {
        AbstractParameters abstractParameters = taskChannel.parseParameters(
                ParametersNode.builder()
                        .taskType(taskExecutionContext.getTaskType())
                        .taskParams(taskExecutionContext.getTaskParams())
                        .build());

        // TODO 其实这里如果要是Sql，这里直接 ArrayList<>()了，下面就不走了
        List<ResourceInfo> resourceFilesList = abstractParameters.getResourceFilesList();
        if (CollectionUtils.isEmpty(resourceFilesList)) {
            log.debug("There is no resource file need to download");
            return new ResourceContext();
        }

        ResourceContext resourceContext = new ResourceContext();
        String taskWorkingDirectory = taskExecutionContext.getExecutePath();

        for (ResourceInfo resourceInfo : resourceFilesList) {
            // TODO 在存储中的路径，比如说hdfs上的文件路径
            String resourceAbsolutePathInStorage = resourceInfo.getResourceName();
            // TODO 文件名称
            String resourceRelativePath = storageOperate.getResourceFileName(tenant, resourceAbsolutePathInStorage);
            // TODO 本地的绝对路径
            String resourceAbsolutePathInLocal = Paths.get(taskWorkingDirectory, resourceRelativePath).toString();
            File file = new File(resourceAbsolutePathInLocal);
            if (!file.exists()) {
                try {
                    long resourceDownloadStartTime = System.currentTimeMillis();
                    // TODO 资源进行下载
                    storageOperate.download(resourceAbsolutePathInStorage, resourceAbsolutePathInLocal, true);
                    log.debug("Download resource file {} under: {} successfully", resourceAbsolutePathInStorage,
                            resourceAbsolutePathInLocal);
                    FileUtils.setFileTo755(file);
                    WorkerServerMetrics
                            .recordWorkerResourceDownloadTime(System.currentTimeMillis() - resourceDownloadStartTime);
                    WorkerServerMetrics
                            .recordWorkerResourceDownloadSize(Files.size(Paths.get(resourceAbsolutePathInLocal)));
                    WorkerServerMetrics.incWorkerResourceDownloadSuccessCount();
                } catch (Exception ex) {
                    WorkerServerMetrics.incWorkerResourceDownloadFailureCount();
                    throw new TaskException(
                            String.format("Download resource file: %s error", resourceAbsolutePathInStorage), ex);
                }
            }

            // TODO 封装resourceContext
            ResourceContext.ResourceItem resourceItem = ResourceContext.ResourceItem.builder()
                    .resourceAbsolutePathInStorage(resourceAbsolutePathInStorage)
                    .resourceRelativePath(resourceRelativePath)
                    .resourceAbsolutePathInLocal(resourceAbsolutePathInLocal)
                    .build();
            resourceContext.addResourceItem(resourceItem);
        }
        return resourceContext;
    }

5、下载上游文件(上下游文件的传递)
示例如下 :

upTask ：

downTask :

核心逻辑 : 上下游文件传递其实也很简单，就是针对本节点来说就是在本地生成对应的文件，然后上传到比如说HDFS类型的资源中心，然后下游节点会跟进上游taskName.输出变量进行指定资源中心文件的下载

downTask中的downloadUpstreamFiles逻辑 :
org.apache.dolphinscheduler.server.worker.utils.TaskFilesTransferUtils#downloadUpstreamFiles

public static void downloadUpstreamFiles(TaskExecutionContext taskExecutionContext, StorageOperate storageOperate) {
        // TODO 上游传递过来的变量池
        List<Property> varPools = getVarPools(taskExecutionContext);

        // get map of varPools for quick search
        Map<String, Property> varPoolsMap = varPools.stream().collect(Collectors.toMap(Property::getProp, x -> x));

        // get "IN FILE" parameters
        // TODO 其实就是看localParams的参数有没有为IN的FILE的本地参数
        List<Property> localParamsProperty = getFileLocalParams(taskExecutionContext, Direct.IN);

        // TODO 一般情况下，就是这里就结束了
        if (localParamsProperty.isEmpty()) {
            return;
        }

        String executePath = taskExecutionContext.getExecutePath();
        // data path to download packaged data
        // TODO 下载的临时目录
        String downloadTmpPath = String.format("%s/%s", executePath, DOWNLOAD_TMP);

        log.info("Download upstream files...");
        for (Property property : localParamsProperty) {
            // TODO 这里其实就是获取
            /**
             * varPoolsMap 如下 :
             * {"prop":"upTask.file-data","direct":"IN","type":"FILE","value":"DATA_TRANSFER/20240624/13978983404960/2_1893/upTask_1320_text.txt"}
             * {"prop":"upTask.dir-data","direct":"IN","type":"FILE","value":"DATA_TRANSFER/20240624/13978983404960/2_1893/upTask_1320_data_ds_pack.zip"}
             *
             * localParamsProperty 如下 :
             * {"prop":"input_dir","direct":"IN","type":"FILE","value":"upTask.dir-data"}
             */
            // TODO 所以这里是不为null的
            Property inVarPool = varPoolsMap.get(property.getValue());
            if (inVarPool == null) {
                log.error("{} not in  {}", property.getValue(), varPoolsMap.keySet());
                throw new TaskException(String.format("Can not find upstream file using %s, please check the key",
                        property.getValue()));
            }

            String resourcePath = inVarPool.getValue();
            // TODO 其实就是在封装本地的路径
            // TODO 这里注意啊，比如说脚本中 cat input_dir/test1/text.txt，input_dir这个东西是下载路径拼接上的
            String targetPath = String.format("%s/%s", executePath, property.getProp());

            String downloadPath;
            // If the data is packaged, download it to a special directory (DOWNLOAD_TMP) and unpack it to the
            // targetPath
            // TODO 判断是否是zip压缩
            boolean isPack = resourcePath.endsWith(PACK_SUFFIX);
            if (isPack) {
                downloadPath = String.format("%s/%s", downloadTmpPath, new File(resourcePath).getName());
            } else {
                downloadPath = targetPath;
            }

            try {
                // TODO 资源中心路径
                String resourceWholePath =
                        storageOperate.getResourceFullName(taskExecutionContext.getTenantCode(), resourcePath);
                log.info("{} --- Remote:{} to Local:{}", property, resourceWholePath, downloadPath);
                // TODO 系在到本地
                storageOperate.download(resourceWholePath, downloadPath, true);
            } catch (IOException ex) {
                throw new TaskException("Download file from storage error", ex);
            }

            // unpack if the data is packaged
            if (isPack) {
                File downloadFile = new File(downloadPath);
                log.info("Unpack {} to {}", downloadPath, targetPath);
                // TODO 如果是zip就是将本地临时目录下的压缩文件解压到目标路径下
                ZipUtil.unpack(downloadFile, new File(targetPath));
            }
        }

        // delete DownloadTmp Folder if DownloadTmpPath exists
        try {
            // TODO 临时目录下文件删除掉
            org.apache.commons.io.FileUtils.deleteDirectory(new File(downloadTmpPath));
        } catch (IOException e) {
            log.error("Delete DownloadTmpPath {} failed, this will not affect the task status", downloadTmpPath, e);
        }
    }

6、创建任务并初始化
其实就是步骤3中，创建完毕TaskChannel，然后调用createTask，返回AbstractTask，然后调用init方法

......
// TODO 创建任务
task = taskChannel.createTask(taskExecutionContext);
log.info("Task plugin instance: {} create successfully", taskExecutionContext.getTaskType());

// todo: remove the init method, this should initialize in constructor method
// TODO 任务进行初始化
task.init();
log.info("Success initialized task plugin instance successfully");
......

7、给AbstractParameters设置变量池

// TODO 给任务设置变量池
// TODO 一般情况下 taskExecutionContext.getVarPool()这里就为null
task.getParameters().setVarPool(taskExecutionContext.getVarPool());
log.info("Set taskVarPool: {} successfully", taskExecutionContext.getVarPool());

注意 : 默认情况下，这个taskExecutionContext.getVarPool()是空的，除非上游有OUT变量

2.5.3、任务执行

// TODO 回调函数，这个还是很关键的把workerMessageSender、taskExecutionContext以构造函数放到了TaskCallBack中
// TODO 所以taskExecutionContext里面是有之前的内容的
TaskCallBack taskCallBack = TaskCallbackImpl.builder()
        .workerMessageSender(workerMessageSender)
        .taskExecutionContext(taskExecutionContext)
        .build();
.......

// TODO 执行
executeTask(taskCallBack);

executeTask(taskCallBack); 是核心代码，封装了Worker任务的真正的执行逻辑，参数传递的TaskCallBack，用于任务状态的回报(向Master)

下面就来细说executeTask(taskCallBack)的逻辑 :

public void executeTask(TaskCallBack taskCallBack) throws TaskException {
    if (task == null) {
        throw new IllegalArgumentException("The task plugin instance is not initialized");
    }

    // TODO 这里会进行真正的任务处理
    task.handle(taskCallBack);
}

其中的task其实就是AbstractTask，在beforeExecute中 taskChannel.createTask。是Task抽象父类(以ShellTask为例展开说明，其他任务类型类似)

org.apache.dolphinscheduler.plugin.task.shell.ShellTask#handle

public void handle(TaskCallBack taskCallBack) throws TaskException {
    try {

        IShellInterceptorBuilder<?, ?> shellActuatorBuilder = ShellInterceptorBuilderFactory.newBuilder()
                .properties(ParameterUtils.convert(taskExecutionContext.getPrepareParamsMap())) // TODO 这里就是要进行变量的替换
                .appendScript(shellParameters.getRawScript());

        // TODO shell执行
        TaskResponse commandExecuteResult = shellCommandExecutor.run(shellActuatorBuilder, taskCallBack);
        // TODO 执行结果，退出状态码
        setExitStatusCode(commandExecuteResult.getExitStatusCode());
        // TODO 设置进程ID
        setProcessId(commandExecuteResult.getProcessId());
        // TODO shellCommandExecutor.getTaskOutputParams()这返回的是 output -> 123
        shellParameters.dealOutParam(shellCommandExecutor.getTaskOutputParams());
    } catch (InterruptedException e) {
        Thread.currentThread().interrupt();
        log.error("The current Shell task has been interrupted", e);
        setExitStatusCode(EXIT_CODE_FAILURE);
        throw new TaskException("The current Shell task has been interrupted", e);
    } catch (Exception e) {
        log.error("shell task error", e);
        setExitStatusCode(EXIT_CODE_FAILURE);
        throw new TaskException("Execute shell task error", e);
    }
}

org.apache.dolphinscheduler.plugin.task.api.shell.ShellInterceptorBuilderFactory
默认走的是 BashShellInterceptorBuilder

public class ShellInterceptorBuilderFactory {
    private final static String INTERCEPTOR_TYPE = PropertyUtils.getString("shell.interceptor.type", "bash");

    @SuppressWarnings("unchecked")
    public static IShellInterceptorBuilder newBuilder() {
        // TODO 默认的走的是这个逻辑
        if (INTERCEPTOR_TYPE.equalsIgnoreCase("bash")) {
            return new BashShellInterceptorBuilder();
        }
        if (INTERCEPTOR_TYPE.equalsIgnoreCase("sh")) {
            return new ShShellInterceptorBuilder();
        }
        if (INTERCEPTOR_TYPE.equalsIgnoreCase("cmd")) {
            return new CmdShellInterceptorBuilder();
        }
        throw new IllegalArgumentException("not support shell type: " + INTERCEPTOR_TYPE);
    }
}

.properties(ParameterUtils.convert(taskExecutionContext.getPrepareParamsMap()))

是向BaseShellInterceptorBuilder的propertyMap中进行taskExecutionContext.getPrepareParamsMap()参数的设置(注意 : taskExecutionContext.getPrepareParamsMap()这玩意是在Master中进行的封装，参考https://segmentfault.com/a/1190000044982236)

.appendScript(shellParameters.getRawScript())

是向BaseShellInterceptorBuilder的scripts进行设置值

org.apache.dolphinscheduler.plugin.task.api.AbstractCommandExecutor#run

public TaskResponse run(IShellInterceptorBuilder iShellInterceptorBuilder,
                            TaskCallBack taskCallBack) throws Exception {
    TaskResponse result = new TaskResponse();
    // todo: we need to use state like JDK Thread to make sure the killed task should not be executed
    iShellInterceptorBuilder = iShellInterceptorBuilder
            // TODO 设置执行路径
            .shellDirectory(taskRequest.getExecutePath())
            // TODO 这里设置shell 名字
            .shellName(taskRequest.getTaskAppId());

    // Set system env
    // TODO 在这里是设置默认的，比如说可以设置为 /etc/profile
    if (CollectionUtils.isNotEmpty(ShellUtils.ENV_SOURCE_LIST)) {
        // TODO 这里其实就是向 systemEnvs 中加入ENV_SOURCE_LIST中配置的环境文件的列表
        ShellUtils.ENV_SOURCE_LIST.forEach(iShellInterceptorBuilder::appendSystemEnv);
    }

    // Set custom env
    // TODO 设置自定义的env
    if (StringUtils.isNotBlank(taskRequest.getEnvironmentConfig())) {
        // TODO 向 customEnvScripts 中加入
        iShellInterceptorBuilder.appendCustomEnvScript(taskRequest.getEnvironmentConfig());
    }

    // Set k8s config (This is only work in Linux)
    if (taskRequest.getK8sTaskExecutionContext() != null) {
        iShellInterceptorBuilder.k8sConfigYaml(taskRequest.getK8sTaskExecutionContext().getConfigYaml());
    }

    // Set sudo (This is only work in Linux)
    // TODO 设置sudo为true的模式
    iShellInterceptorBuilder.sudoMode(OSUtils.isSudoEnable());

    // Set tenant (This is only work in Linux)
    // TODO 设置租户
    iShellInterceptorBuilder.runUser(taskRequest.getTenantCode());

    // Set CPU Quota (This is only work in Linux)
    if (taskRequest.getCpuQuota() != null) {
        iShellInterceptorBuilder.cpuQuota(taskRequest.getCpuQuota());
    }

    // Set memory Quota (This is only work in Linux)
    if (taskRequest.getMemoryMax() != null) {
        iShellInterceptorBuilder.memoryQuota(taskRequest.getMemoryMax());
    }

    // TODO 这个是重点
    IShellInterceptor iShellInterceptor = iShellInterceptorBuilder.build();
    // TODO 使用ProcessBuilder进行执行，支持sudo模式，和bash模式
    process = iShellInterceptor.execute();

    // parse process output
    // TODO 这里解析到进程的输出
    parseProcessOutput(this.process);

    // collect pod log
    collectPodLogIfNeeded();

    int processId = getProcessId(this.process);

    result.setProcessId(processId);

    // cache processId
    taskRequest.setProcessId(processId);

    // print process id
    log.info("process start, process id is: {}", processId);

    // if timeout occurs, exit directly
    long remainTime = getRemainTime();

    // update pid before waiting for the run to finish
    if (null != taskCallBack) {
        // TODO 更新任务实例信息
        taskCallBack.updateTaskInstanceInfo(processId);
    }

    // waiting for the run to finish
    boolean status = this.process.waitFor(remainTime, TimeUnit.SECONDS);

    TaskExecutionStatus kubernetesStatus =
            ProcessUtils.getApplicationStatus(taskRequest.getK8sTaskExecutionContext(), taskRequest.getTaskAppId());

    if (taskOutputFuture != null) {
        try {
            // Wait the task log process finished.
            taskOutputFuture.get();
        } catch (ExecutionException e) {
            log.error("Handle task log error", e);
        }
    }

    if (podLogOutputFuture != null) {
        try {
            // Wait kubernetes pod log collection finished
            podLogOutputFuture.get();
            // delete pod after successful execution and log collection
            ProcessUtils.cancelApplication(taskRequest);
        } catch (ExecutionException e) {
            log.error("Handle pod log error", e);
        }
    }

    // if SHELL task exit
    if (status && kubernetesStatus.isSuccess()) {

        // SHELL task state
        result.setExitStatusCode(this.process.exitValue());

    } else {
        log.error("process has failure, the task timeout configuration value is:{}, ready to kill ...",
                taskRequest.getTaskTimeout());
        result.setExitStatusCode(EXIT_CODE_FAILURE);
        cancelApplication();
    }
    int exitCode = this.process.exitValue();
    String exitLogMessage = EXIT_CODE_KILL == exitCode ? "process has killed." : "process has exited.";
    log.info("{} execute path:{}, processId:{} ,exitStatusCode:{} ,processWaitForStatus:{} ,processExitValue:{}",
            exitLogMessage, taskRequest.getExecutePath(), processId, result.getExitStatusCode(), status, exitCode);
    return result;
}

设置默认的环境变量 :

// Set system env
// TODO 在这里是设置默认的，比如说可以设置为 /etc/profile
if (CollectionUtils.isNotEmpty(ShellUtils.ENV_SOURCE_LIST)) {
    // TODO 这里其实就是向 systemEnvs 中加入ENV_SOURCE_LIST中配置的环境文件的列表 ShellUtils.ENV_SOURCE_LIST.forEach(iShellInterceptorBuilder::appendSystemEnv);
}

org.apache.dolphinscheduler.plugin.task.api.utils.ShellUtils

public List<String> ENV_SOURCE_LIST = Arrays.stream(
            Optional.ofNullable(PropertyUtils.getString("shell.env_source_list")).map(s -> s.split(","))
                    .orElse(new String[0]))
            .map(String::trim)
            .filter(StringUtils::isNotBlank)
            .collect(Collectors.toList());

读取的是 common.properties，这里可以配置默认的环境变量

# The default env list will be load by Shell task, e.g. /etc/profile,~/.bash_profile
# 默认是空，比如说可以是
shell.env_source_list=/etc/profile

// TODO 这个是重点
IShellInterceptor iShellInterceptor = iShellInterceptorBuilder.build();

org.apache.dolphinscheduler.plugin.task.api.shell.bash.BashShellInterceptorBuilder#build

public BashShellInterceptor build() throws FileOperateException, IOException {
    // TODO 这里是生成shell脚本的核心点，写到指定目录下
    generateShellScript();
    // TODO 封装命令
    List<String> bootstrapCommand = generateBootstrapCommand();
    // TODO 实例化BashShellInterceptor
    return new BashShellInterceptor(bootstrapCommand, shellDirectory);
}

org.apache.dolphinscheduler.plugin.task.api.shell.BaseLinuxShellInterceptorBuilder#bootstrapCommandInSudoMode
注意 : 这个方法里面有两层含义，如果是资源限制走的是bootstrapCommandInResourceLimitMode(参考https://segmentfault.com/a/1190000044974900，其实这里还蕴藏着一个大大的BUG(我只修改了ShellTask)，针对其他类型的Shell封装的任务，比如说MR、Spark、Flink等等，如果走资源限制，这里就有问题，因为这些任务在页面上不能设置CPU和内存的Quota)，否则走的是sudo -u 租户 -i /opt/xx.sh

private List<String> bootstrapCommandInSudoMode() {
    // TODO 如果task.resource.limit.state为false，这里的逻辑不会走，也不会走CPU和内存的限制
    if (PropertyUtils.getBoolean(AbstractCommandExecutorConstants.TASK_RESOURCE_LIMIT_STATE, false)) {
        return bootstrapCommandInResourceLimitMode();
    }
    List<String> bootstrapCommand = new ArrayList<>();
    bootstrapCommand.add("sudo");
    if (StringUtils.isNotBlank(runUser)) {
        bootstrapCommand.add("-u");
        bootstrapCommand.add(runUser);
    }
    bootstrapCommand.add("-i");
    bootstrapCommand.add(shellAbsolutePath().toString());
    return bootstrapCommand;
}

// TODO 使用ProcessBuilder进行执行，支持sudo模式，和bash模式
process = iShellInterceptor.execute();

org.apache.dolphinscheduler.plugin.task.api.shell.BaseShellInterceptor#execute

public Process execute() throws IOException {
    // init process builder
    ProcessBuilder processBuilder = new ProcessBuilder();
    // setting up a working directory
    // TODO 设置工作路径，目的其实就是在执行脚本的时候，可以在该目录的位置来加载比如说jar包什么的
    processBuilder.directory(new File(workingDirectory));
    // merge error information to standard output stream
    processBuilder.redirectErrorStream(true);
    processBuilder.command(executeCommands);
    log.info("Executing shell command : {}", String.join(" ", executeCommands));
    return processBuilder.start();
}

其实就是使用 ProcessBuilder 进行任务的提交

org.apache.dolphinscheduler.plugin.task.api.AbstractCommandExecutor#parseProcessOutput

// TODO 解析输出
private void parseProcessOutput(Process process) {
    // todo: remove this this thread pool.
    ExecutorService getOutputLogService = ThreadUtils
            .newSingleDaemonScheduledExecutorService("ResolveOutputLog-thread-" + taskRequest.getTaskName());
    getOutputLogService.execute(() -> {
        TaskOutputParameterParser taskOutputParameterParser = new TaskOutputParameterParser();
        // TODO 这里正好的读取process.getInputStream()的输入
        try (BufferedReader inReader = new BufferedReader(new InputStreamReader(process.getInputStream()))) {
            // TODO 这里设置了任务的日志路径
            LogUtils.setTaskInstanceLogFullPathMDC(taskRequest.getLogPath());
            String line;
            while ((line = inReader.readLine()) != null) {
                // TODO 日志缓冲区
                logBuffer.add(line);
                // TODO 这里解析taskOutputParams，解析比如说 echo '${setValue(output=1)}'。其实就是字符串 ${setValue(output=1)}
                taskOutputParameterParser.appendParseLog(line);
            }
            processLogOutputIsSuccess = true;
        } catch (Exception e) {
            log.error("Parse var pool error", e);
            processLogOutputIsSuccess = true;
        } finally {
            // TODO 在这里的时候就将 taskInstanceLogFullPath 删除了
            LogUtils.removeTaskInstanceLogFullPathMDC();
        }
        taskOutputParams = taskOutputParameterParser.getTaskOutputParams();
    });

    getOutputLogService.shutdown();

    ExecutorService parseProcessOutputExecutorService = ThreadUtils
            .newSingleDaemonScheduledExecutorService("TaskInstanceLogOutput-thread-" + taskRequest.getTaskName());
    taskOutputFuture = parseProcessOutputExecutorService.submit(() -> {
        try {
            LogUtils.setTaskInstanceLogFullPathMDC(taskRequest.getLogPath());
            // TODO 对于非pod(k8s)的任务，其实就是processLogOutputIsSuccess这个标识，这个标识是在上面，就是任务运行完毕了
            while (logBuffer.size() > 1 || !processLogOutputIsSuccess || !podLogOutputIsFinished) {
                if (logBuffer.size() > 1) {
                    logHandler.accept(logBuffer);
                    logBuffer.clear();
                    logBuffer.add(EMPTY_STRING);
                } else {
                    // TODO 如果没有日志输出，默认等待1s
                    Thread.sleep(TaskConstants.DEFAULT_LOG_FLUSH_INTERVAL);
                }
            }
        } catch (Exception e) {
            log.error("Output task log error", e);
        } finally {
            LogUtils.removeTaskInstanceLogFullPathMDC();
        }
    });
    parseProcessOutputExecutorService.shutdown();
}

解说里面核心的两个逻辑
1、结果日志打印

protected LinkedBlockingQueue<String> logBuffer;

public AbstractCommandExecutor(Consumer<LinkedBlockingQueue<String>> logHandler,
                                   TaskExecutionContext taskRequest) {
        this.logHandler = logHandler;
        this.taskRequest = taskRequest;
        this.logBuffer = new LinkedBlockingQueue<>();
        this.logBuffer.add(EMPTY_STRING);

        if (this.taskRequest != null) {
            // set logBufferEnable=true if the task uses logHandler and logBuffer to buffer log messages
            this.taskRequest.setLogBufferEnable(true);
        }
    }

通过 logBuffer 临时存放日志，供parseProcessOutputExecutorService现成消费

日志的生产端 :

while ((line = inReader.readLine()) != null) {
    // TODO 日志缓冲区
    logBuffer.add(line);
    // TODO 这里解析taskOutputParams，解析比如说 echo '${setValue(output=1)}'。其实就是字符串 ${setValue(output=1)}
    taskOutputParameterParser.appendParseLog(line);
}

日志的消费端 :

this.shellCommandExecutor = new ShellCommandExecutor(this::logHandle, taskExecutionContext);

public void logHandle(LinkedBlockingQueue<String> logs) {

    StringJoiner joiner = new StringJoiner("\n\t");
    while (!logs.isEmpty()) {
        joiner.add(logs.poll());
    }
    log.info(" -> {}", joiner);
}

while (logBuffer.size() > 1 || !processLogOutputIsSuccess || !podLogOutputIsFinished) {
    if (logBuffer.size() > 1) {
        logHandler.accept(logBuffer);
        logBuffer.clear();
        logBuffer.add(EMPTY_STRING);
    } else {
        // TODO 如果没有日志输出，默认等待1s
        Thread.sleep(TaskConstants.DEFAULT_LOG_FLUSH_INTERVAL);
    }
}

所以查看如果是Shell输出日志都是 -> 开头的，比如说

[INFO] 2024-06-24 09:35:44.678 +0800 -  -> 
    .
    ├── 1893_1321.sh
    └── input_dir
        ├── test1
        │   └── text.txt
        └── test2
            └── text.txt
    
    3 directories, 3 files
    test1 message
    test2 message

2、解析变量池

while ((line = inReader.readLine()) != null) {
    // TODO 日志缓冲区
    logBuffer.add(line);
    // TODO 这里解析taskOutputParams，解析比如说 echo '${setValue(output=1)}'。其实就是字符串 ${setValue(output=1)}
    taskOutputParameterParser.appendParseLog(line);
}

org.apache.dolphinscheduler.plugin.task.api.parser.TaskOutputParameterParser#appendParseLog

public void appendParseLog(String logLine) {
    if (logLine == null) {
        return;
    }

    // TODO 刚开始进来，是不会走这里的
    if (currentTaskOutputParam != null) {
        if (currentTaskOutputParam.size() > maxOneParameterRows
                || currentTaskOutputParamLength > maxOneParameterLength) {
            log.warn(
                    "The output param expression '{}' is too long, the max rows is {}, max length is {}, will skip this param",
                    String.join("\n", currentTaskOutputParam), maxOneParameterLength, maxOneParameterRows);
            currentTaskOutputParam = null;
            currentTaskOutputParamLength = 0;
            return;
        }
        // continue to parse the rest of line
        int i = logLine.indexOf(")}");
        if (i == -1) {
            // the end of var pool not found
            currentTaskOutputParam.add(logLine);
            currentTaskOutputParamLength += logLine.length();
        } else {
            // the end of var pool found
            currentTaskOutputParam.add(logLine.substring(0, i + 2));
            Pair<String, String> keyValue = parseOutputParam(String.join("\n", currentTaskOutputParam));
            if (keyValue.getKey() != null && keyValue.getValue() != null) {
                // TODO 解析完毕就放入到taskOutputParams中
                taskOutputParams.put(keyValue.getKey(), keyValue.getValue());
            }
            currentTaskOutputParam = null;
            currentTaskOutputParamLength = 0;
            // continue to parse the rest of line
            if (i + 2 != logLine.length()) {
                appendParseLog(logLine.substring(i + 2));
            }
        }
        return;
    }

    int indexOfVarPoolBegin = logLine.indexOf("${setValue(");
    if (indexOfVarPoolBegin == -1) {
        indexOfVarPoolBegin = logLine.indexOf("#{setValue(");
    }
    if (indexOfVarPoolBegin == -1) {
        return;
    }
    currentTaskOutputParam = new ArrayList<>();
    appendParseLog(logLine.substring(indexOfVarPoolBegin));
}

解析完毕就放入到taskOutputParams中

更新Pid(向Master汇报)

// update pid before waiting for the run to finish
if (null != taskCallBack) {
    // TODO 更新任务实例信息
    taskCallBack.updateTaskInstanceInfo(processId);
}

这里的优化参考 : https://segmentfault.com/a/1190000044988696

超时判断

long remainTime = getRemainTime();

private long getRemainTime() {
    long usedTime = (System.currentTimeMillis() - taskRequest.getStartTime()) / 1000;
    long remainTime = taskRequest.getTaskTimeout() - usedTime;

    if (remainTime < 0) {
        throw new RuntimeException("task execution time out");
    }

    return remainTime;
}

......
// waiting for the run to finish
// TODO 这里其实就是一个超时等待，其实就是说如果不设置超时等待时间，无限等待
boolean status = this.process.waitFor(remainTime, TimeUnit.SECONDS);

// TODO 设置退出码
// if SHELL task exit
if (status && kubernetesStatus.isSuccess()) {

    // SHELL task state
    result.setExitStatusCode(this.process.exitValue());

} else {
    log.error("process has failure, the task timeout configuration value is:{}, ready to kill ...",
            taskRequest.getTaskTimeout());
    result.setExitStatusCode(EXIT_CODE_FAILURE);
    cancelApplication();
}
int exitCode = this.process.exitValue();
String exitLogMessage = EXIT_CODE_KILL == exitCode ? "process has killed." : "process has exited.";
log.info("{} execute path:{}, processId:{} ,exitStatusCode:{} ,processWaitForStatus:{} ,processExitValue:{}",
        exitLogMessage, taskRequest.getExecutePath(), processId, result.getExitStatusCode(), status, exitCode);

// TODO 执行结果，退出状态码
setExitStatusCode(commandExecuteResult.getExitStatusCode());
// TODO 设置进程ID
setProcessId(commandExecuteResult.getProcessId());
// TODO shellCommandExecutor.getTaskOutputParams()这返回的是比如说 output -> 123
shellParameters.dealOutParam(shellCommandExecutor.getTaskOutputParams());

org.apache.dolphinscheduler.plugin.task.api.parameters.AbstractParameters#dealOutParam

public void dealOutParam(Map<String, String> taskOutputParams) {
    // TODO 其实就是说如果localParams不存在，就算设置了输出也不管用
    if (CollectionUtils.isEmpty(localParams)) {
        return;
    }

    // TODO 这里其实就是过滤出来localParams为OUT的参数
    List<Property> outProperty = getOutProperty(localParams);
    if (CollectionUtils.isEmpty(outProperty)) {
        return;
    }


    // TODO 如果taskOutputParams为空，输出参数会放入到varPool中
    if (MapUtils.isEmpty(taskOutputParams)) {
        outProperty.forEach(this::addPropertyToValPool);
        return;
    }

    // TODO 这里其实就是想说，找到outProperty和taskOutputParams相同的key，然后把对应的value换成taskOutputParams中的value
    // TODO 最终放到变量池中
    for (Property info : outProperty) {
        String propValue = taskOutputParams.get(info.getProp());
        if (StringUtils.isNotEmpty(propValue)) {
            info.setValue(propValue);
            addPropertyToValPool(info);
        } else {
            log.warn("Cannot find the output parameter {} in the task output parameters", info.getProp());
        }
    }
}

这里其实就是想说，找到outProperty和taskOutputParams相同的key，然后把对应的value换成taskOutputParams中的value，等待向Master汇报存在TaskInstance的变量池中

2.5.4、任务执行之后(收尾工作)

protected void afterExecute() throws TaskException {
    if (task == null) {
        throw new TaskException("The current task instance is null");
    }
    // TODO 是否要发送告警，使用JDK动态代理 RPC通信调用alert模块AlertBootstrapService
    sendAlertIfNeeded();

    // TODO 发送结果
    sendTaskResult();

    WorkerTaskExecutorHolder.remove(taskExecutionContext.getTaskInstanceId());

    // TODO common.properties development.state=false，默认是false。如果设置true
    // TODO 就会开发模式，意味着Dolpinscheduler封装的脚本、jar包不清理
    log.info("Remove the current task execute context from worker cache");
    clearTaskExecPathIfNeeded();
}

发送结果

protected void sendTaskResult() {
    taskExecutionContext.setCurrentExecutionStatus(task.getExitStatus());
    taskExecutionContext.setProcessId(task.getProcessId());
    taskExecutionContext.setAppIds(task.getAppIds());

    // TODO 其实就是发送变量池，这里是变量池
    taskExecutionContext.setVarPool(JSONUtils.toJsonString(task.getParameters().getVarPool()));
    taskExecutionContext.setEndTime(System.currentTimeMillis());

    // upload out files and modify the "OUT FILE" property in VarPool
    // TODO 上传输出文件并修改输出文件到变量池中
    TaskFilesTransferUtils.uploadOutputFiles(taskExecutionContext, storageOperate);
    log.info("Upload output files: {} successfully",
            TaskFilesTransferUtils.getFileLocalParams(taskExecutionContext, Direct.OUT));

    // TODO 发送任务的结果
    workerMessageSender.sendMessageWithRetry(taskExecutionContext,
            ITaskInstanceExecutionEvent.TaskInstanceExecutionEventType.FINISH);
    log.info("Send task execute status: {} to master : {}", taskExecutionContext.getCurrentExecutionStatus().name(),
            taskExecutionContext.getWorkflowInstanceHost());
}

Shell状态码小插曲

[root@node opt]# vim test.sh 
[root@node opt]# sh test.sh 
me is journey
[root@node opt]# echo $?
0
[root@node opt]# vim test.sh 
[root@node opt]# sh test.sh 
test.sh: line 2: echo1: command not found
[root@node opt]# echo $?
127
[root@node opt]# vim test.sh 
[root@node opt]# sh test.sh 
me is 10.253.26.85
Killed
[root@node opt]# echo $?
137

总结 : 其实就是想说SHELL任务正常的退出码为0，被kill掉的状态码为137。其他为异常

任务状态码判断逻辑 :

taskExecutionContext.setCurrentExecutionStatus(task.getExitStatus());

org.apache.dolphinscheduler.plugin.task.api.AbstractTask#getExitStatus
// 其实就是说如果状态码返回为0，任务为成功；状态码为137为KILL。其他状态为失败。而task.getExitStatus()状态是由executeTask中设置完成的
public TaskExecutionStatus getExitStatus() {
    switch (getExitStatusCode()) {
        case TaskConstants.EXIT_CODE_SUCCESS:
            return TaskExecutionStatus.SUCCESS;
        case TaskConstants.EXIT_CODE_KILL:
            return TaskExecutionStatus.KILL;
        default:
            return TaskExecutionStatus.FAILURE;
    }
}

上传输出文件到资源中心 :
org.apache.dolphinscheduler.server.worker.utils.TaskFilesTransferUtils#uploadOutputFiles

public static void uploadOutputFiles(TaskExecutionContext taskExecutionContext,
                                         StorageOperate storageOperate) throws TaskException {
    List<Property> varPools = getVarPools(taskExecutionContext);
    // get map of varPools for quick search
    Map<String, Property> varPoolsMap = varPools.stream().collect(Collectors.toMap(Property::getProp, x -> x));

    // get OUTPUT FILE parameters
    List<Property> localParamsProperty = getFileLocalParams(taskExecutionContext, Direct.OUT);

    if (localParamsProperty.isEmpty()) {
        return;
    }

    log.info("Upload output files ...");
    for (Property property : localParamsProperty) {
        // get local file path
        String path = String.format("%s/%s", taskExecutionContext.getExecutePath(), property.getValue());

        // TODO packIfDir 和 crc其实就是想说，如果是目录，就对目录进行打zip包，然后生成crc。如果是文件就对文件生成crc
        String srcPath = packIfDir(path);

        // get crc file path
        String srcCRCPath = srcPath + CRC_SUFFIX;
        try {
            FileUtils.writeContent2File(FileUtils.getFileChecksum(path), srcCRCPath);
        } catch (IOException ex) {
            throw new TaskException(ex.getMessage(), ex);
        }

        // get remote file path
        // TODO DATA_TRANSFER/DATE/ProcessDefineCode/ProcessDefineVersion_ProcessInstanceID/TaskName_TaskInstanceID_FileName
        String resourcePath = getResourcePath(taskExecutionContext, new File(srcPath).getName());
        String resourceCRCPath = resourcePath + CRC_SUFFIX;
        try {
            // upload file to storage
            // TODO 以hdfs来说
            // TODO hdfs跟路径/tenantCode/resources/DATA_TRANSFER/DATE/ProcessDefineCode/ProcessDefineVersion_ProcessInstanceID/TaskName_TaskInstanceID_FileName
            String resourceWholePath =
                    storageOperate.getResourceFullName(taskExecutionContext.getTenantCode(), resourcePath);
            String resourceCRCWholePath =
                    storageOperate.getResourceFullName(taskExecutionContext.getTenantCode(), resourceCRCPath);
            log.info("{} --- Local:{} to Remote:{}", property, srcPath, resourceWholePath);
            storageOperate.upload(taskExecutionContext.getTenantCode(), srcPath, resourceWholePath, false, true);
            log.info("{} --- Local:{} to Remote:{}", "CRC file", srcCRCPath, resourceCRCWholePath);
            storageOperate.upload(taskExecutionContext.getTenantCode(), srcCRCPath, resourceCRCWholePath, false,
                    true);
        } catch (IOException ex) {
            throw new TaskException("Upload file to storage error", ex);
        }

        // update varPool
        Property oriProperty;
        // if the property is not in varPool, add it
        if (varPoolsMap.containsKey(property.getProp())) { // 理论上不会走到这个分支
            oriProperty = varPoolsMap.get(property.getProp());
        } else {
            oriProperty = new Property(property.getProp(), Direct.OUT, DataType.FILE, property.getValue());
            // TODO 添加到变量池中
            varPools.add(oriProperty);
        }

        // TODO 这里就设置了任务名称.property name
        oriProperty.setProp(String.format("%s.%s", taskExecutionContext.getTaskName(), oriProperty.getProp()));
        // TODO 这里很关键，其实就是把资源的相对路径放入到了变量池对应的value中
        oriProperty.setValue(resourcePath);
    }

    // TODO 这里是设置FILE的变量池
    taskExecutionContext.setVarPool(JSONUtils.toJsonString(varPools));
}

发送任务的结果 :

workerMessageSender.sendMessageWithRetry(taskExecutionContext,
                ITaskInstanceExecutionEvent.TaskInstanceExecutionEventType.FINISH);

3、诉说WorkerMessageSender组件作用

WorkerMessageSender诉说.png

4、Kill任务逻辑

org.apache.dolphinscheduler.server.worker.runner.operator.TaskInstanceKillOperationFunction#operate

public TaskInstanceKillResponse operate(TaskInstanceKillRequest taskInstanceKillRequest) {
    log.info("Receive TaskInstanceKillRequest: {}", taskInstanceKillRequest);

    // TODO 任务实例
    int taskInstanceId = taskInstanceKillRequest.getTaskInstanceId();
    try {
        LogUtils.setTaskInstanceIdMDC(taskInstanceId);
        // TODO Worker任务执行器
        WorkerTaskExecutor workerTaskExecutor = WorkerTaskExecutorHolder.get(taskInstanceId);
        if (workerTaskExecutor == null) {
            log.error("Cannot find WorkerTaskExecutor for taskInstance: {}", taskInstanceId);
            return TaskInstanceKillResponse.fail("Cannot find WorkerTaskExecutor");
        }

        // TODO 任务执行上下文
        TaskExecutionContext taskExecutionContext = workerTaskExecutor.getTaskExecutionContext();

        LogUtils.setTaskInstanceLogFullPathMDC(taskExecutionContext.getLogPath());

        // TODO 这里会进行kill
        boolean result = doKill(taskExecutionContext);

        // TODO 使用 Process.destroy() 是 Java 中 Process 类的一个方法，用于销毁与该 Process 对象关联的子进程
        this.cancelApplication(workerTaskExecutor);

        int processId = taskExecutionContext.getProcessId();
        // TODO 这里其实想说的是，如果processId为0，直接把该任务的状态设置为KILL，然后在Worker上报信息的时候就会把KILL状态上报上去
        // TODO 一定要注意，当前情况不一定是真正的kill掉，只是让DS里面的状态是对的
        if (processId == 0) {
            workerManager.killTaskBeforeExecuteByInstanceId(taskInstanceId);
            taskExecutionContext.setCurrentExecutionStatus(TaskExecutionStatus.KILL);
            // todo: the task might be executed, but the processId is 0
            WorkerTaskExecutorHolder.remove(taskInstanceId);
            log.info("The task has not been executed and has been cancelled, task id:{}", taskInstanceId);
            return TaskInstanceKillResponse.success(taskExecutionContext);
        }

        // TODO 这个其实就是说明，我kill掉了。成功了。然后这个时候Worker其实会感知到任务被kill掉，在他的sendResult FINISH的时候上报
        // TODO 上去就可以了
        taskExecutionContext
                .setCurrentExecutionStatus(result ? TaskExecutionStatus.SUCCESS : TaskExecutionStatus.FAILURE);

        WorkerTaskExecutorHolder.remove(taskExecutionContext.getTaskInstanceId());
        // TODO 删除重试消息
        messageRetryRunner.removeRetryMessages(taskExecutionContext.getTaskInstanceId());
        return TaskInstanceKillResponse.success(taskExecutionContext);
    } finally {
        LogUtils.removeTaskInstanceIdMDC();
        LogUtils.removeTaskInstanceLogFullPathMDC();
    }
}

杀进程和yarn上的任务 :

// TODO 这里会进行kill
boolean result = doKill(taskExecutionContext);

org.apache.dolphinscheduler.server.worker.runner.operator.TaskInstanceKillOperationFunction#doKill

private boolean doKill(TaskExecutionContext taskExecutionContext) {
    // kill system process
    // TODO 杀死Shell关联的进程
    boolean processFlag = killProcess(taskExecutionContext.getTenantCode(), taskExecutionContext.getProcessId());

    // TODO kill yarn or k8s application
    try {
        ProcessUtils.cancelApplication(taskExecutionContext);
    } catch (TaskException e) {
        return false;
    }
    return processFlag;
}

org.apache.dolphinscheduler.server.worker.runner.operator.TaskInstanceKillOperationFunction#killProcess

杀进程和子进程 : 注意，这里和官网有修改，如果有异常打印warn就好，因为有时候不能有权限杀死所有进程

protected boolean killProcess(String tenantCode, Integer processId) {
    // todo: directly interrupt the process
    if (processId == null || processId.equals(0)) {
        return true;
    }

    try {
        String pidsStr = ProcessUtils.getPidsStr(processId);
        if (!Strings.isNullOrEmpty(pidsStr)) {
            String cmd = String.format("kill -9 %s", pidsStr);
            cmd = OSUtils.getSudoCmd(tenantCode, cmd);
            log.info("process id:{}, cmd:{}", processId, cmd);
            OSUtils.exeCmd(cmd);
        }
    } catch (Exception e) {
        log.warn("kill task error", e);
    }
    return true;
}

杀死yarn上的任务
org.apache.dolphinscheduler.plugin.task.api.utils.ProcessUtils#cancelApplication

public static void cancelApplication(TaskExecutionContext taskExecutionContext) {
    try {
        // TODO k8s
        if (Objects.nonNull(taskExecutionContext.getK8sTaskExecutionContext())) {
            if (!TASK_TYPE_SET_K8S.contains(taskExecutionContext.getTaskType())) {
                // Set empty container name for Spark on K8S task
                applicationManagerMap.get(ResourceManagerType.KUBERNETES)
                        .killApplication(new KubernetesApplicationManagerContext(
                                taskExecutionContext.getK8sTaskExecutionContext(),
                                taskExecutionContext.getTaskAppId(), ""));
            }
        } else {
            // TODO YARN
            String host = taskExecutionContext.getHost();
            String executePath = taskExecutionContext.getExecutePath();
            String tenantCode = taskExecutionContext.getTenantCode();
            List<String> appIds;
            // TODO 容错的走这个逻辑
            if (StringUtils.isNotEmpty(taskExecutionContext.getAppIds())) {
                // is failover
                appIds = Arrays.asList(taskExecutionContext.getAppIds().split(COMMA));
            } else {
                String logPath = taskExecutionContext.getLogPath();
                String appInfoPath = taskExecutionContext.getAppInfoPath();
                if (logPath == null || appInfoPath == null || executePath == null || tenantCode == null) {
                    log.error(
                            "Kill yarn job error, the input params is illegal, host: {}, logPath: {}, appInfoPath: {}, executePath: {}, tenantCode: {}",
                            host, logPath, appInfoPath, executePath, tenantCode);
                    throw new TaskException("Cancel application failed!");
                }

                log.info("Get appIds from worker {}, taskLogPath: {}", host, logPath);
                // TODO 这里就是正则解析log获取appIds
                appIds = LogUtils.getAppIds(logPath, appInfoPath,
                        PropertyUtils.getString(APPID_COLLECT, DEFAULT_COLLECT_WAY));
                taskExecutionContext.setAppIds(String.join(TaskConstants.COMMA, appIds));
            }

            // TODO 如果这里说明appIds是不存在的
            if (CollectionUtils.isEmpty(appIds)) {
                log.info("The appId is empty");
                return;
            }

            ApplicationManager applicationManager = applicationManagerMap.get(ResourceManagerType.YARN);
            applicationManager.killApplication(new YarnApplicationManagerContext(executePath, tenantCode, appIds));
        }
    } catch (Exception e) {
        log.error("Cancel application failed.", e);
    }
}

task日志中使用正则表达式来解析appIds，这里默认走log，不走aop，参考 : https://segmentfault.com/a/1190000044965606

appIds = LogUtils.getAppIds(logPath, appInfoPath,
                            PropertyUtils.getString(APPID_COLLECT, DEFAULT_COLLECT_WAY));

public List<String> getAppIds(String logPath, String appInfoPath, String fetchWay) {
    if (!StringUtils.isEmpty(fetchWay) && fetchWay.equals("aop")) {
        log.info("Start finding appId in {}, fetch way: {} ", appInfoPath, fetchWay);
        // TODO 如果走aop拦截的写的日志文件中读取
        return getAppIdsFromAppInfoFile(appInfoPath);
    } else {
        log.info("Start finding appId in {}, fetch way: {} ", logPath, fetchWay);
        // TODO 从日志中进行正则匹配
        return getAppIdsFromLogFile(logPath);
    }
}

真正的来杀yarn上的任务

applicationManager.killApplication(new YarnApplicationManagerContext(executePath, tenantCode, appIds));

org.apache.dolphinscheduler.plugin.task.api.am.YarnApplicationManager#killApplication

public boolean killApplication(ApplicationManagerContext applicationManagerContext) throws TaskException {
    YarnApplicationManagerContext yarnApplicationManagerContext =
            (YarnApplicationManagerContext) applicationManagerContext;
    String executePath = yarnApplicationManagerContext.getExecutePath();
    String tenantCode = yarnApplicationManagerContext.getTenantCode();
    List<String> appIds = yarnApplicationManagerContext.getAppIds();

    try {
        String commandFile = String.format("%s/%s.kill", executePath, String.join(Constants.UNDERLINE, appIds));
        String cmd = getKerberosInitCommand() + "yarn application -kill " + String.join(Constants.SPACE, appIds);
        execYarnKillCommand(tenantCode, commandFile, cmd);
    } catch (Exception e) {
         log.warn("Kill yarn application {} failed", appIds, e);
    }
    return true;
}

execYarnKillCommand需要注意，因为使用 yarn application -kill。yarn命令可能没有。增加ENV_SOURCE_LIST

private void execYarnKillCommand(String tenantCode, String commandFile,
                                     String cmd) throws Exception {
    StringBuilder sb = new StringBuilder();
    sb.append("#!/bin/sh\n");
    sb.append("BASEDIR=$(cd `dirname $0`; pwd)\n");
    sb.append("cd $BASEDIR\n");

    // TODO 在这里是设置默认的，比如说可以设置为 /etc/profile
    if (CollectionUtils.isNotEmpty(ShellUtils.ENV_SOURCE_LIST)) {
        // TODO 这里其实就是向 systemEnvs 中加入ENV_SOURCE_LIST中配置的环境文件的列表
        ShellUtils.ENV_SOURCE_LIST.forEach(env -> sb.append("source " + env + "\n"));
    }

    sb.append("\n\n");
    sb.append(cmd);

    File f = new File(commandFile);

    if (!f.exists()) {
        org.apache.commons.io.FileUtils.writeStringToFile(new File(commandFile), sb.toString(),
                StandardCharsets.UTF_8);
    }

    String runCmd = String.format("%s %s", Constants.SH, commandFile);
    runCmd = org.apache.dolphinscheduler.common.utils.OSUtils.getSudoCmd(tenantCode, runCmd);
    log.info("kill cmd:{}", runCmd);
    org.apache.dolphinscheduler.common.utils.OSUtils.exeCmd(runCmd);
}

总结 : 如果成功把任务kill掉了，WorkerTaskExecutor会感知到的，进而进行KILL状态的FINISH汇报。如果任务已经完成，如果PID=0，将任务上下文状态设置为KILL，线程池中移除该WorkerTaskExecutor线程，WorkerTaskExecutorHolder移除该缓存

5、暂停

public class TaskInstancePauseOperationFunction
        implements
            ITaskInstanceOperationFunction<TaskInstancePauseRequest, TaskInstancePauseResponse> {

    @Override
    public TaskInstancePauseResponse operate(TaskInstancePauseRequest taskInstancePauseRequest) {
        try {
            LogUtils.setTaskInstanceIdMDC(taskInstancePauseRequest.getTaskInstanceId());
            log.info("Receive TaskInstancePauseRequest: {}", taskInstancePauseRequest);
            log.warn("TaskInstancePauseOperationFunction is not support for worker task yet!");
            return TaskInstancePauseResponse.success();
        } finally {
            LogUtils.removeTaskInstanceIdMDC();
        }
    }
}

画重点 :
其实暂停来说对于Worker来说，什么也不做。也做不了，你想想真的都能让任务暂停么？除非是引擎程序中有所控制，你像MR、SPARK、FLINK这种是不能暂停，暂停的核心逻辑是给流程实例发送一个通知，告诉流程实例我要进行流程的暂停，让正在运行任务的下一个任务进行暂停，当然比如说只有一个任务，任务暂停不了，最后只能成功。还有一种情况就是比如说是最后一个任务，也暂停不了。还有就是执行的很快，你暂停的时候，正好程序要往下执行，而下游已没有任务的情况。这种都是暂停不了的

6、更新流程实例host

这个属于容错，容错章节再详细说

如感兴趣，点赞加关注，谢谢!!!

Dolphinscheduler Worker模块源码剖析

1、Worker接收Master RPC请求架构图

2、分发任务(TaskInstanceDispatchOperationFunction)

2.1、WorkerConfig

2.2、WorkerTaskExecutorFactoryBuilder

2.3、WorkerTaskExecutorThreadPool

2.4、从operator开始说

2.5、DefaultWorkerTaskExecutor解说

2.5.1、空跑

2.5.2、 beforeExecute()

2.5.3、任务执行

2.5.4、任务执行之后(收尾工作)

3、诉说WorkerMessageSender组件作用

4、Kill任务逻辑

5、暂停

6、更新流程实例host

journey

引用和评论

go kratos 入门

Dolphinscheduler IDEA本地调试

【Hadoop】HDFS架构解析

C++ 中 VS 项目引入公共配置文件

【Hadoop】HBase系统解析及适用场景

疯狂推荐！从零开始 Dify 部署全攻略！

Cherry Studio 入门 MCP：为你的大模型插上翅膀