Apache DolphinScheduler 3.0.0 is officially released!

Light up ⭐️ Star · Light up the road to open source

GitHub: https://github.com/apache/dolphinscheduler

Version release 2022/8/10

On August 10, 2022, after the continuous verification of 3.0.0 alpha, 3.0.0-beta-1, and 3.0.0-beta-2, Apache DolphinScheduler finally ushered in the long-awaited third major version of the community !

The official version 3.0.0 has undergone the most significant changes since its release, adding many new functions and features, aiming to bring users a new experience and more value.

The iterative 3.0.0 official version is roughly the same as the main function and feature updates, optimizations, and bug fixes described in the previous 3.0.0 alpha version update, including the four "faster, stronger, more modern, and easier to maintain". summary of keywords .

For the new functions and optimizations after the version iteration, this article will make supplements.

Keywords: faster, stronger, more modern, easier to maintain

The keywords of 3.0.0 remain unchanged, and the features of "faster, stronger, more modern, and easier to maintain" are believed to be experienced by everyone in use.

Faster: Refactored the UI interface, the new UI not only increases the user response speed by dozens of times, but also increases the developer construction speed by hundreds of times;
Stronger: brings many exciting new features, such as data quality assurance, custom time zone, new multi-task support and multiple alarm plugins;
More modern : In addition to being faster, the new UI is more modern in terms of page layout and icon styles;
Easier maintenance : The splitting of back-end services is more in line with the development trend of containerization and microservices, and it can also clarify the responsibilities of each service, making maintenance easier.

New functions and new features

The new functions and features described in detail earlier include:

The biggest change in 3.0.0 is the introduction of a new UI, the language page does not need to be reloaded, and a dark theme has been added. The new UI uses Vue3, TSX, Vite related technology stack. Compared with the old UI, the new UI is not only more modern, but also more user-friendly in operation, and the front-end is more robust. Once users find problems in the code during compilation, they can verify the interface parameters, thereby making the front-end code more efficient. robust.

In addition, the new architecture and new technology stack not only allow users to respond dozens of times faster when operating Apache DolphinScheduler, but also hundreds of times faster for developers to compile and start the UI locally, which will greatly shorten the time for developers. Time required to debug and package code.

New UI experience:

Local startup time-consuming comparison

Project management page

Workflow Definition Page

shell task page

MySQL data source page

AWS Support

As the user group of Apache DolphinScheduler becomes more and more abundant, it attracts many overseas users. However, in the overseas business scenario, during the research process, users found that there are two points that affect the user's convenient experience of Apache DolphinScheduler. One is the time zone issue, and the other is the lack of support for overseas cloud vendors, especially AWS. In this version, we decided to support the more important components of AWS, which have covered Amazon EMR and Amazon Redshift two AWS task types, and implemented the resource center to support Amazon S3 storage.

For Amazon EMR , we created a new task type and provided its Run Job Flow functionality, allowing users to submit multiple steps jobs to Amazon EMR and specify the amount of resources to use.

Details can be found here: https://dolphinscheduler.apache.org/zh-cn/docs/latest/user_doc/guide/task/emr.html

Amazon EMR task definition

For Amazon Redshift , we are now extending support for Amazon Redshift datasources in the SQL task type, and users can now select the Redshift datasource in the SQL task to run Amazon Redshift tasks.

Amazon Redshift Support

For Amazon S3 , we have extended the resource center of Apache DolphinScheduler to not only support local resources, HDFS resource storage, but also support Amazon S3 as the storage of the resource center. Details can be found: https ://dolphinscheduler.apache.org/en-us/docs/latest/user_doc/guide/resource.html\`resource.storage.type\`

We will support more AWS tasks in the future, so stay tuned.

service split

The new UI is the biggest change in the front end of 3.0.0, and the biggest change in the back end is the splitting of services. Considering the growing popularity of the concept of containers and microservices, the Apache DolphinScheduler developers made a major decision: splitting the backend services. According to the function, we split the service into the following parts:

master-server: master service
worker-server: worker service
api-server: API service
alert-server: alert service
standalone-server: standalone is used to quickly experience the dolphinscheduler function
ui: UI resource
bin: Quick start script, mainly the script to start each service
tools: tools related scripts, mainly including database creation and update scripts

All services are available through

\`bin/dolphinscheduler-daemon.sh\`

way to start or stop.

Data Quality Assurance

In this version, the long-awaited data quality assurance application function has been launched since 2.0.0, which solves the problem of the accuracy of the number of data items synchronized from the source, and the alarm that the single-table or multi-table weekly average and monthly average fluctuation exceeds the threshold, etc. Data quality issues. The previous version of Apache DolphinScheduler solved the problem of running tasks in a specific order and time, but there is no more general measure of the quality of the data after the data is run, and users need to pay additional development costs.

Now, 3.0.0 has realized the native support of data quality. Users can easily realize data quality monitoring through configuration, and ensure the accuracy of the running results under the premise of ensuring the running of the workflow.

task force

The task group is mainly used to control the concurrency of task instances and clarify the priority within the group. When creating a new task definition, the user can configure the task group corresponding to the current task, and configure the priority of the task running in the task group. After a task is configured with a task group, the execution of the task must not only satisfy the success of all upstream tasks, but also satisfy that the tasks running in the current task group are smaller than the size of the resource pool. When it is greater than or equal to the resource pool size, the task will enter the waiting state for the next check. When multiple tasks in the task group enter the queue to be run at the same time, the task with higher priority will be run first.

See the link for details: https://dolphinscheduler.apache.org/zh-cn/docs/3.0.0/user_doc/guide/resource.html

custom time zone

Before version 3.0.0, the default time of Apache DolphinScheduler was UTC+8 time zone, but with the expansion of user groups, overseas users and users who conduct cross-time zone business overseas are often troubled by time zones in use. After 3.0.0 supports time zone switching, the problem of loss is easily solved, meeting the needs of overseas users and overseas business partners. For example, if the time zone involved in the business of the enterprise includes the East Eighth District and the West Fifth District, and you want to use the same DolphinScheduler cluster, you can create multiple users. Each user uses its own local time zone, and the time displayed by the corresponding DolphinScheduler object will be Switching to the local time of the corresponding time zone is more in line with the usage habits of local developers.

See the link for details: https://dolphinscheduler.apache.org/zh-cn/docs/3.0.0/user_doc/guide/howto/general-setting.html

Task Definition List

Before using Apache DolphinScheduler version 3.0.0, if users want to operate a task, they need to find the corresponding workflow first, and locate the position of the task in the workflow before editing. However, when the number of workflows increases or a single workflow has more tasks, the process of finding the corresponding tasks will become very painful, which is not the easy to use concept pursued by Apache DolphinScheduler. Therefore, we have added a task definition page in 3.0.0, allowing users to quickly locate tasks through task names, and operate tasks to easily change tasks in batches.

See the link for details: https://dolphinscheduler.apache.org/zh-cn/docs/3.0.0/user_doc/guide/project/task-instance.html

New alert type support

In 3.0.0, the alert type has also been expanded, and we have added support for Telegram and Webexteams alert types.

Python API new features

In 3.0.0, the biggest change in the Python API is to integrate the corresponding PythonGatewayServer into the API-Server service and rename it PythonGatewayService. Now the user will start the PythonGatewayService by default when starting the api-server; if you do not want to start the PythonGatewayService, you can python-gateway.enabled in application.yaml is set to false.

Additionally, the Python API adds CLI and configuration modules. The Configuration module allows users to modify the default configuration of the Python API, such as modifying the default user name of the workflow, worker grouping, etc. The values can be changed through environment variables, direct file modification, and Python dynamic modification.

 # environment variable
export PYDS_JAVA_GATEWAY_ADDRESS="192.168.1.1"
export PYDS_WORKFLOW_USER="custom-user"
# file change
Directly change ~/pydolphinscheudler/config.yaml
# CLI
pydolphinscheduler config --set java_gateway.address 192.168.1.1
pydolphinscheduler config --set java_gateway.address 192.168.1.1 --set java_gateway.port 25334

Currently, the CLI only has two subcommands, version and config, which are used to confirm the current version and add or delete configuration files. In the future, we will introduce more functions to facilitate users to operate DolphinScheduler through the command line.

 # version
pydolphinscheduler verison
# 3.0.0
# config
pydolphinscheduler config --get java_gateway.address --get java_gateway.port
# The output look like below:
# java_gateway.address = 127.0.0.1
# java_gateway.port = 25333
pydolphinscheduler config --set java_gateway.address 192.168.1.1 --set java_gateway.port 25334

It is worth noting that the Python API also supports the function of adding and uploading resource center files, which is convenient for resource management; supports writing different names for different workflows in the same project; adding integration tests to make testing more convenient.

Unannounced functionality and feature updates from previous releases

Support for Flink task types

In this release, we have extended the Flink task type to support running Flink SQL tasks, which use sql-client.sh to submit tasks. In previous versions, we only supported submitting tasks through flink cli. This method needs to combine the resource center, submit the resource file to the resource center, and then refer to the modified resource on the task definition page, which is not for versioning and user transparency. Very friendly. As flink sql gradually becomes the mainstream of flink users, and writing sql directly on the editing page is more user-transparent, we have adopted the flink sql function contributed to the community. Users in versions after 3.0.0 can use flink more conveniently task.

For more details, see: [flink sql client]( https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/sqlclient/ )

Corresponding PR: https://github.com/apache/dolphinscheduler/pull/9840

Added Zepplin task type

In this release, we have added the Zeppelin task type for creating and executing Zeppelin type tasks. When the worker executes this task, it triggers the Zeppelin Notebook section through the Zeppelin Cient API.

Corresponding PR: https://github.com/apache/dolphinscheduler/pull/9810

Bash parameter passing function

The new version also adds the function of passing parameters through bash. If you want to use bash variables instead of constant value export parameters in downstream tasks, you can do it through setValue and Bash variables. It is more flexible and allows you to dynamically obtain Existing local or HTTP resources get set variables.

Similar syntax can be used

lines\_num=$(wget https://raw.githubusercontent.com/apache/dolphinscheduler/dev/README.md -q -O - | wc -l | xargs)echo "#{setValue(set\_val\_var= ${lines\_num})}"

Allow users to upload files without a suffix

Previously, the resource center could only upload files with suffixes. After version 3.0.0, we support users to upload files without suffixes.

Other functional enhancements

In addition to the above-mentioned new functions, version 3.0.0 has also carried out many detailed function enhancements, such as refactoring task plug-ins and data source plug-in modules to make expansion easier; support for Spark SQL has been restored; E2E testing has been perfectly compatible with the new UI Wait.

Major optimizations

Task back-end plugin optimization, new plugins only need to modify the modules that come with the plugin
Validate end time and start time when submitting/creating cron under workflow
Dependent can select the global project when adding dependencies
AlertSender optimization and closing optimization, such as MasterServer
Increase the slot condition to query the database and reduce the returned data records
Thinner dist package by migrating python gateway to apiserver
[python] Migrate pythonGatewayServer to api server
[python] add missing config and connect remote server documentation
[Master/Worker] Change task ack to run callback
[Master] Add task event thread pool

Major bug fixes

Fix the problem that creating tenants with S3a Minio fails
Fix the problem of busy text files
Fix the problem of generating a duplicate authorized project when project authorization
Fix an issue where starting the server fails due to not being able to connect to postgresql
Fix the problem that the message shows that the data source plugin "Spark" cannot be found
Fix the problem that the built-in parameters of the commands generated by MapReduce are in the wrong position
Solve the problem that the queue is invalid in ProcessDefinition after changing the parameter user
Resolve processes that use dependent components cannot be migrated between test and production environments
Resolved an issue with resource file deletion conditions
Fixed an issue that affected the original node data when editing the form of a copied node
Resolved an issue where Worker resources were exhausted and caused downtime
Fixed an issue where certain types of alerts would not display item names
3.0.0 Problems with each deployment method
When the task group is empty, the page reports an error
treemap view depth error problem
The alarm information is not clear: the error information is not clear when the alarm group is empty, the error information is not clear when there is an abnormality in the batch deletion workflow, the error message of the tenant content is wrong, delete it
Parameter verification problem: The parameter verification problem in the data source center, when the password is changed, the password is inconsistent, and the alert scriptb is verified before the alarm is issued.
Python api: The release state cannot be set, the local parameter has a value but the verification fails
Token query doesn't follow timezone issue
Fix HTTPS and HTTP string recognition issues
Fix alert server health monitoring failure problem
Fix condition task branch failure problem
Fix the problem that the docker image does not support multiple platforms
Fix the problem that the database cannot be written correctly when the workflow with task group priority is created
Invalidation of the master task
Fix serial wait not running
Time zone problem: scheduling time zone error problem, log add time zone support
Re-run, pause workflow instance failure problem
Resource Center instantiation failure problem
Fix the problem of dividing line in email alert template
Fix data initialization problem in Standalone mode
Fixed the page display problem when the monitoring center DB does not exist
Fix the issue of invalid creation workflow parameters
Fixed the abnormal problem of zookeeper port during K8S deployment
Fix the problem that the service fails to start in Standalone mode
Fix LDAP login failure problem
Python api: fix the problem that the task component names of different workflows under the same project do not support the same name
Python api: fix SQL task component SQL type error problem
Fix the abnormal problem of resource file renaming form
Fix the problem of getting the executable time of the workflow according to the timing settings
Upgraded module dependencies such as Logback and Log4j
Fix mission failure issue
Fixed HDFS NPE issue
Fix the problem of master deadlock caused by task group exception
Fixed a number of stability issues

document modification

Correct deployment documentation
Repair and update some documentation: WebexTeams Chinese documentation, local parameter, global parameter documentation, Kubernetes FAQ documentation, Spark precautions documentation, DataX usage documentation, delete Flink API documentation, fix open-api errors, fix wrong documentation in data quality ;Added stand-alone switch database document; Added judging Yarn running status document in shell; Added update system screenshot; Parameter transfer, global parameter, parameter priority document, alarm component wizard, Telegram, DingTalk alarm document, alarm FAQ Documentation, Shell Component Documentation, Switch Task Component Documentation, Resource Center Configuration Details Documentation, Workflow Definition Complement Documentation
Corrected some development documents: clarify supported operating systems, fix development environment construction documents, and add self-build docker image documents

Release note

GitHub: https://github.com/apache/dolphinscheduler/releases/tag/3.0.0

Download: https://dolphinscheduler.apache.org/en-us/download/download.html

Thanks to contributors

Aaron Lin, Amy0104, Assert, BaoLiang, Benedict Jin, BenjaminWenqiYu, Brennan Fox, Dannila, Desperado2, Devosend, DingPengfei, DuChaoJiaYou, Edward Yang, Eric Gao, Frank Chen, GaoTianDuo, HanayoZz, HeChuan, HomminLee, Hua Jiang, Hwting, Ivan0626, Jeff Zhan, Jiajie Zhong, JieguangZhou, Jiezhi.G, JinYong Li, J·Y, Kerwin, Kevin.Shin, KingsleyY, Kirs, KyoYang, LinKai, LiuBodong, LongJGun, Luke Yan, Lyle Shaw, Manhua, Martin Huang, Maxwell, Molin Wang, Mr.An, OS, PJ Fanning, Paul Zhang, QuakeWang, ReonYu, SbloodyS, Sheldon, Shiwen Cheng, ShuiMuNianHuaLP, ShuoTiann, SongTao Zhuang, Stalary, Sunny Lei, Tom, Town, Tq, WangJPLeo, Wenjun Ruan, X&Z , XiaochenNan, Yanbin Lin, Yao WANG, Yiming Guo, Zonglei Dong, aCodingAddict, aaronlinv, aiwenmo, caishunfeng, calvin, calvinit, cheney, chouc, chuxing, czeming, devosend, exmy, gaojun2048, guodong, guoshupei, hjli, hstdream, huangxiaohai , janeHe13, jegger, jiachuan.zhu, jon-qj, juzimao, kezhenxu94, labbomb, leiwingqueen, lgcareer, lhjzmn, lidongdai, lifeng, lilyzhou, litiliu, liubo1990, liudi1184, longtb, lvshaokang, lyq, mans2singh, mask, mazhong, mgduoduo , m yangle1120, naziD, nobolity, ououtt, ouyangyewei, pinkhello, qianli2022, qinchaofeng, rickchengx, rockfang, ronyang1985, seagle, shuai hou, simsicon, sneh-wha, songjianet, sparklezzz, springmonster, sq-q, syyangs799, uh001, wangbowen, wangqiang ,wangxj3,wangyang,wangyizhi,wind,worry,wqxs,xiangzihao,xiaodi wang,xiaoguaiguai,xuhhui,yangyunxi,yc322,yihong,yimaixinchen,youzipi,zchong,zekai-li,zhang,zhangxinruu,zhanqian,zhuxt2015,zixi0825,zwZjut, Tian Chou, Xiao Zhang, Hong Shu, Zhang Junjie, Xu Xu, Shi Shi, Wang Yang, Wang Qiang, Bai Sui, Autumn, Luo Mingtao, Ah Fu Chris, Chen Jiaming, Chen Shuang, Feixia is picturesque

Participate and contribute

With the rapid rise of domestic open source, the Apache DolphinScheduler community has ushered in vigorous development. In order to make more usable and easy-to-use scheduling, we sincerely welcome partners who love open source to join the open source community and contribute to the rise of open source in China. , let local open source go global.

There are many ways to participate in the DolphinScheduler community, including:

Contributing the first PR (documentation, code) We also hope that it is simple, the first PR is used to get familiar with the submission process and community collaboration and to feel the friendliness of the community.

The community has put together the following list of issues for newbies: https://github.com/apache/dolphinscheduler/issues/5689

List of non-novice issues: https://github.com/apache/dolphinscheduler/issues?q=is%3Aopen+is%3Aissue+label%3A%22volunteer+wanted%22

How to contribute link: https://dolphinscheduler.apache.org/en-us/community/development/contribute.html

Come on, the DolphinScheduler open source community needs your participation and contribute to the rise of China's open source, even if it is just a small tile, the combined power is huge.

Participating in open source can learn from various experts at close range and quickly improve your skills. If you want to contribute, we have a contributor seed incubation group (Leonard-ds), which will teach you hands-on (contributors regardless of level, if you have any questions Must answer, the key is to have a willingness to contribute).

When adding a small assistant, please indicate that you want to participate in the contribution.

Come on, the open source community is looking forward to your participation.

Recommended activities

In August, the Apache DolphinScheduler community joined the Apache Kylin community to jointly hold a Meetup with the theme of "Construction and Prospect of Big Data Base, Helping Enterprises' Digital Transformation" is about to start! We are also fortunate to invite senior big data engineers and developers from companies such as Yili, T3 Travel, Beluga Open Source, Apache Kylin Community, etc. to discuss topics such as data analysis engines, data scheduling, digital transformation, and maintenance of open source in the two open source Project development practices.