Preface
- Start of the test used
MySQL 8
as a database, as of 2021.5.13, airflow 2.0.2 of this problem is not resolved, it is converted to usePostgreSQL 12
- airflow is a task management system of DAG (directed acyclic graph), simple understanding is an advanced version of crontab.
- Airflow solves the task dependency problem that crontab cannot solve.
- airflow basic architecture
- airflow + celery architecture
Environment and components
- Ubuntu 20.04
- Python-3.8(Anaconda3-2020.11-Linux-x86_64)
- PostgreSQL 12.6
- apache-airflow 2.0.2
- celery 4.4.7
Cluster planning
installation steps
Create an account (node0/node1/node2)
sudo useradd airflow -m -s /bin/bash sudo passwd airflow
Switch account (node0/node1/node2)
su airflow
Configure Anaconda environment variables (node0/node1/node2)
# /home/airflow/.bashrc export PATH=/home/airflow/anaconda3/bin:$PATH
Upgrade pip (node0/node1/node2)
pip install pip --upgrade -i https://mirrors.aliyun.com/pypi/simple/
Configure pip domestic mirror (node0/node1/node2)
pip3 config set global.index-url https://mirrors.aliyun.com/pypi/simple/
Install airflow (node0/node1/node2) dependencies: https://airflow.apache.org/docs/apache-airflow/2.0.2/extra-packages-ref.html
# 全家桶(master) pip3 install "apache-airflow[all]~=2.0.2" # OR 选择性安装 pip3 install "apache-airflow[async,postgres,mongo,redis,rabbitmq,celery,dask]~=2.0.2"
Add PATH environment variable for airflow (node0/node1/node2)
# 在 /home/airflow/.bashrc 文件尾追加以下内容: export PATH=/home/airflow/.local/bin:$PATH
Check the airflow version and create the home directory of airflow (node0/node1/node2)
# 默认 ~/airflow 目录 airflow version
Set Ubuntu system time zone (node0/node1/node2)
timedatectl set-timezone Asia/Shanghai
Modify the time zone in airflow (/home/airflow/airflow/airflow.cfg) (node0/node1/node2)
[core] # 改为 system 或 Asia/Shanghai default_timezone = system
- At this point, the installation is complete
PostgreSQL configuration
Create database
CREATE DATABASE airflow_db;
Create user
CREATE USER airflow_user WITH PASSWORD 'airflow_pass'; GRANT ALL PRIVILEGES ON DATABASE airflow_db TO airflow_user;
Modify PostgreSQL connection (/home/airflow/airflow/airflow.cfg) (node0/node1/node2)
[core] sql_alchemy_conn = postgresql+psycopg2://airflow:airflow@192.168.x.y/airflow
Initialize the database table (node0)
airflow db init
- Check whether the database is initialized successfully
WEB UI login
creates an administrator user (node0)
# 角色表: ab_role # 用户表: ab_user # 创建 Admin 角色用户 airflow users create \ --lastname user \ --firstname admin \ --username admin \ --email walkerqt@foxmail.com \ --role Admin \ --password admin123 # 创建 Viewer 角色用户 airflow users create \ --lastname user \ --firstname view \ --username view \ --email walkerqt@163.com \ --role Viewer \ --password view123
Start
webserver
(node0)airflow webserver -p 8080
- Log in with the created account in the browser
Configuration CeleryExecutor
- official document: 161adab97db308 https://docs.celeryproject.org/en/v4.4.7/userguide/configuration.html
- Create an airflow account on RabbitMQ and assign a virtual host
Modify the configuration file (/home/airflow/airflow/airflow.cfg) (node0/node1/node2)
[core] executor = CeleryExecutor [celery] broker_url = amqp://airflow:mq_pwd@192.168.y.z:5672/vhost_airflow result_backend = db+postgresql://airflow:airflow@192.168.x.y/airflow
Test Case
Create a test script (/home/airflow/airflow/dags/send_msg.py) (node0/node1/node2), and send the local IP to the enterprise WeChat.
# encoding: utf-8 # author: qbit # date: 2021-05-13 # summary: 发送/分配任务到任务结点 import os import time import json import psutil import requests from datetime import timedelta from airflow.utils.dates import days_ago from airflow.models import DAG from airflow.operators.python_operator import PythonOperator def GetLocalIPByPrefix(prefix): r""" 多网卡情况下,根据前缀获取IP 测试可用:Windows、Linux,Python 3.6.x,psutil 5.4.x ipv4/ipv6 地址均适用 注意如果有多个相同前缀的 ip,只随机返回一个 """ localIP = '' dic = psutil.net_if_addrs() for adapter in dic: snicList = dic[adapter] for snic in snicList: if not snic.family.name.startswith('AF_INET'): continue ip = snic.address if ip.startswith(prefix): localIP = ip return localIP def send_msg(msg='default msg', **context): r""" 发送 message 到企业微信 """ print(context) run_id = context['run_id'] nowTime = time.strftime('%Y-%m-%d %H:%M:%S', time.localtime()) message = '%s\n%s\n%s_%d\n%s' % ( run_id, nowTime, GetLocalIPByPrefix('192.168.'), os.getpid(), msg) print(message) ''' 发送代码(涉及账号,本段代码隐藏) ''' default_args = { 'owner': 'qbit', # depends_on_past 是否依赖于过去。 # 如果为True,那么必须要上次的 DAG 执行成功了,这次的 DAG 才能执行。 'depends_on_past': False } with DAG(dag_id='send_msg', default_args=default_args, start_date=days_ago(1), schedule_interval=timedelta(seconds=60), # catchup 是否回补(backfill)开始时间到现在的任务 catchup=False, tags=['qbit'] ) as dag: first = PythonOperator( task_id='send_msg_1', python_callable=send_msg, op_kwargs={'msg': '111'}, provide_context=True, dag=dag, ) second = PythonOperator( task_id='send_msg_2', python_callable=send_msg, op_kwargs={'msg': '222'}, provide_context=True, dag=dag, ) third = PythonOperator( task_id='send_msg_3', python_callable=send_msg, op_kwargs={'msg': '333'}, provide_context=True, dag=dag, ) [third, first] >> second
View dag information (node0)
# 打印出所有正在活跃状态的 DAGs $ airflow dags list # 打印出 'send_msg' DAG 中所有的任务 $ airflow tasks list send_msg [2021-05-13 16:00:47,123] {dagbag.py:451} INFO - Filling up the DagBag from /home/airflow/airflow/dags send_msg_1 send_msg_2 send_msg_3 # 打印出 'send_msg' DAG 的任务层次结构 $ airflow tasks list send_msg --tree
Test a single task (node0)
airflow tasks test send_msg send_msg_1 20210513
Test a single dag (node0)
airflow dags test send_msg 20210513
Cluster test
# node0 airflow webserver -p 8080 airflow scheduler airflow celery flower # 默认端口 5555 # node1/node2 airflow celery worker # 指定 hostname 启动 airflow celery worker --celery-hostname node1
references
- Related reading: Airflow2.0.0 + Celery cluster building
- Compare CeleryExecutor and DaskExecutor in airflow: A Gentle Introduction To Understand Airflow Executor
This article is from qbit snap
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。