OOZIE概览
[TOC]
调度框架:Linux Crontab,Azkaban,oozie,zeus
简介
oozie是一个工作流调度系统
- 工作流的调度是DAG
- 可扩展:一个oozie就是一个mr任务,但是仅仅是map,没有reduce
- 可靠性:任务失败后重试
- 集成了Hadoop生态系统的其他任务,如mr、pig、hive、sqoop、spark
主要组件
- tomcat(servlet进行调用并页面显示任务)
- 数据库(存储任务)
- Bundle,coordinator,workflow
架构图
三大服务模块
- Oozie V3 :a server based Bundle engine:对多个coordinator进行封装,可以启动,停止,挂起,关闭,重启一组coordinator的任务
- Oozie V2 :a server based Coordinator engine:可以运行多个workflow,结构:start->workfows->end
- Oozie V1 :a server based workflow engine,结构:start->mr->pig->fork->mr/hive->join->end
workflow
coordinator
记录下踩的
报错 Error: E0505 : E0505: App definition [hdfs://localhost:8020/tmp/oozie-app/coordinator/] does not exist
这个错误信息很坑爹,当时发现其实不是目录不对,是coordinator.xml文件名命名有问题。
准备工作:时区统一
建议采用东八区时间(GMT+0800)
在服务器上,date -R
如果显示如下信息,则表示为东八区,如果不是需要设置时区,一般采用北京或者上海的ln -sf /usr/share/zoneinfo/Asia/Shanghai /etc/localtime
Sat, 30 Sep 2017 10:26:58 +0800
接着去修改oozie-site.xml,如果没有这个属性,就增加
<property>
<name>oozie.processing.timezone</name>
<value>GMT+0800</value>
</property>
让界面的时间也显示正确
Examples
Spark Action
workflow spark on yarn
文件目录结构
├── ooziespark
│ ├── job.properties
│ ├── lib
│ │ └── spark-1.6.2-1.0-SNAPSHOT.jar
│ └── workflow.xml
workflow.xml
<?xml version="1.0" encoding="utf-8"?>
<workflow-app xmlns="uri:oozie:workflow:0.5" name="SparkWordCount">
<start to="spark-node"/>
<action name="spark-node">
<spark xmlns="uri:oozie:spark-action:0.1">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<prepare>
<delete path="${outputdir}"/>
</prepare>
<master>${master}</master>
<name>Spark-Wordcount</name>
<class>WordCount</class>
<jar>${nameNode}/user/LJK/ooziecoor/lib/spark-1.6.2-1.0-SNAPSHOT.jar</jar>
<spark-opts>--driver-memory 512M --executor-memory 512M</spark-opts>
<arg>${inputdir}</arg>
<arg>${outputdir}</arg>
</spark>
<ok to="end"/>
<error to="fail"/>
</action>
<kill name="fail">
<message>Workflow failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name="end"/>
</workflow-app>
job.properties
nameNode=hdfs://nn1:8020
jobTracker=rm:8050
master=yarn-cluster
queueName=default
inputdir=/user/LJK/hello-spark
outputdir=/user/LJK/output
oozie.use.system.libpath=true
oozie.wf.application.path=/user/LJK/ooziespark
#oozie.coord.application.path=${nameNode}/user/LJK/ooziespark
#start=2017-09-28T17:00+0800
#end=2017-09-30T17:00+0800
#workflowAppUri=${nameNode}/user/LJK/ooziespark/
打包程序拷贝到app/lib目录下,测试源码以下
object WordCount {
def main(args: Array[String]): Unit = {
val conf = new SparkConf()
// .setJars(List("/Users/LJK/Documents/code/github/study-spark1.6.2/target/spark-1.6.2-1.0-SNAPSHOT.jar"))
// .set("spark.yarn.historyServer.address", "rm:18080")
// .set("spark.eventLog.enabled", "true")
// .set("spark.eventLog.dir", "hdfs://nn1:8020/spark-history")
.set("spark.testing.memory", "1073741824")
val sc = new SparkContext(conf)
val rdd = sc.textFile(args(0))
.flatMap(_.split(" "))
.map((_, 1))
.reduceByKey(_ + _)
rdd.saveAsTextFile(args(1))
sc.stop()
}
}
把这个目录上传到HDFS目录,执行命令hdfs dfs -put ooziespark /user/LJK/
注意点:job.properties可以不用上传到HDFS,因为执行命令的时候用的是本地的不是HDFS的
oozie启动job,执行命令 oozie job -oozie http://rm:11000/oozie -config /usr/local/share/applications/ooziespark/job.properties -run
或者 oozie job -config /usr/local/share/applications/ooziespark/job.properties -run
简略版前提是你要配置the env variable 'OOZIE_URL' is used as default value for the '-oozie' option
,具体可以用oozie help
查看
在oozie界面上查看job执行
Coordinator spark on yarn
简单调度,每五分钟跑一次WordCount
文件目录结构
├── ooziecoor
│ ├── coordinator.xml
│ ├── job.properties
│ ├── lib
│ │ └── spark-1.6.2-1.0-SNAPSHOT.jar
│ └── workflow.xml
coordinator.xml
<coordinator-app name="cron-coord" frequency="${coord:minutes(5)}" start="${start}" end="${end}" timezone="GMT+0800"
xmlns="uri:oozie:coordinator:0.4">
<action>
<workflow>
<app-path>${workflowAppUri}</app-path>
<configuration>
<property>
<name>jobTracker</name>
<value>${jobTracker}</value>
</property>
<property>
<name>nameNode</name>
<value>${nameNode}</value>
</property>
<property>
<name>queueName</name>
<value>${queueName}</value>
</property>
</configuration>
</workflow>
</action>
</coordinator-app>
修改之前的job.properties,改为
nameNode=hdfs://nn1:8020
jobTracker=rm:8050
master=yarn-cluster
queueName=default
inputdir=/user/LJK/hello-spark
outputdir=/user/LJK/output
oozie.use.system.libpath=true
#oozie.wf.application.path=/user/LJK/ooziespark
oozie.coord.application.path=${nameNode}/user/LJK/ooziecoor
start=2017-09-30T9:30+0800
end=2017-09-30T17:00+0800
workflowAppUri=${nameNode}/user/LJK/ooziecoor
之前的workflow可以直接保留不改jar包位置也是可以的,但为了每个任务更加好看,修改下jar包位置即可
上传到HDFS,并执行命令 oozie job -config /usr/local/share/applications/ooziecoor/job.properties -run
可以在web上查看job
bundle spark on yarn
文件结构
├── ooziebundle
│ ├── bundle.xml
│ ├── coordinator.xml
│ ├── job.properties
│ ├── lib
│ │ └── spark-1.6.2-1.0-SNAPSHOT.jar
│ └── workflow.xml
增加bundle.xml
<bundle-app name='bundle-app' xmlns:xsi='http://www.w3.org/2001/XMLSchema-instance' xmlns='uri:oozie:bundle:0.1'>
<coordinator name='coord-1'>
<app-path>${nameNode}/user/LJK/ooziebundle/coordinator.xml</app-path>
<configuration>
<property>
<name>start</name>
<value>${start}</value>
</property>
<property>
<name>end</name>
<value>${end}</value>
</property>
</configuration>
</coordinator>
</bundle-app>
修改job.properties
nameNode=hdfs://nn1:8020
jobTracker=rm:8050
master=yarn-cluster
queueName=default
inputdir=/user/LJK/hello-spark
outputdir=/user/LJK/output
oozie.use.system.libpath=true
#oozie.wf.application.path=/user/LJK/ooziespark
#oozie.coord.application.path=${nameNode}/user/LJK/ooziecoor
oozie.bundle.application.path=${nameNode}/user/LJK/ooziebundle
start=2017-09-30T9:30+0800
end=2017-09-30T17:00+0800
workflowAppUri=${nameNode}/user/LJK/ooziebundle
上传到HDFS,并执行命令 oozie job -config /usr/local/share/applications/ooziebundle/job.properties -run
web上查看job
Java Action
文件结构,lib包不是打成一个jar包所以不列出了,你可以选择打成一个jar包
javaExample/
├── job.properties
├── lib
└── workflow.xml
注意
如果你用的是SpringBoot框架,需要在pom上加上exclusions,否则会有jar包冲突,oozie会报错
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter</artifactId>
<exclusions>
<exclusion>
<artifactId>spring-boot-starter-logging</artifactId>
<groupId>org.springframework.boot</groupId>
</exclusion>
</exclusions>
</dependency>
workflow.xml
<workflow-app name="My_Workflow" xmlns="uri:oozie:workflow:0.5">
<start to="java-2d81"/>
<kill name="Kill">
<message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<action name="java-2d81">
<java>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<main-class>com.sharing.App</main-class>
<arg>hello</arg>
<arg>springboot</arg>
</java>
<ok to="End"/>
<error to="Kill"/>
</action>
<end name="End"/>
</workflow-app>
job.properties
oozie.use.system.libpath=false
queueName=default
jobTracker=rm.ambari:8050
nameNode=hdfs://nn1.ambari:8020
oozie.wf.application.path=${nameNode}/user/LJK/javaExample
java程序源码
@SpringBootApplication
public class App {
public static void main(String[] args) {
SpringApplication.run(App.class,args);
System.out.println(args[0] + " " + args[1]);
}
}
Shell Action
文件结构
shell
├── job.properties
└── workflow.xml
workflow.xml
<workflow-app name="My_Workflow" xmlns="uri:oozie:workflow:0.5">
<start to="shell-2504"/>
<kill name="Kill">
<message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<action name="shell-2504">
<shell xmlns="uri:oozie:shell-action:0.1">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<exec>echo</exec>
<argument>hello shell</argument>
<capture-output/>
</shell>
<ok to="End"/>
<error to="Kill"/>
</action>
<end name="End"/>
</workflow-app>
job.properties
hue-id-w=50057
jobTracker=rm.ambari:8050
mapreduce.job.user.name=admin
nameNode=hdfs://nn1.ambari:8020
oozie.use.system.libpath=True
oozie.wf.application.path=hdfs://nn1.ambari:8020/user/LJK/shell
user.name=admin
Hive Action
文件结构
hiveExample/
├── hive-site.xml
├── input
│ └── inputdata
├── job.properties
├── output
├── script.q
└── workflow.xml
hive script,写一个hive脚本,文件名自定义,
script.q文件内容
DROP TABLE IF EXISTS test;
CREATE EXTERNAL TABLE test (a INT) STORED AS TEXTFILE LOCATION '${INPUT}';
INSERT OVERWRITE DIRECTORY '${OUTPUT}' SELECT * FROM test;
workflow.xml
<workflow-app name="My_Workflow" xmlns="uri:oozie:workflow:0.5">
<start to="hive-bfbc"/>
<kill name="Kill">
<message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<action name="hive-bfbc" cred="hcat">
<hive xmlns="uri:oozie:hive-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<prepare>
<delete path="${nameNode}/user/LJK/hiveExample/output"/>
<mkdir path="${nameNode}/user/LJK/hiveExample/output"/>
</prepare>
<job-xml>/user/LJK/hiveExample/hive-site.xml</job-xml>
<script>/user/LJK/hiveExample/script.q</script>
<param>INPUT=/user/LJK/hiveExample/input</param>
<param>OUTPUT=/user/LJK/hiveExample/output</param>
</hive>
<ok to="End"/>
<error to="Kill"/>
</action>
<end name="End"/>
</workflow-app>
job.properties
hue-id-w=50059
jobTracker=rm.ambari:8050
mapreduce.job.user.name=admin
nameNode=hdfs://nn1.ambari:8020
oozie.use.system.libpath=True
oozie.wf.application.path=hdfs://nn1.ambari:8020/user/LJK/hiveExample
user.name=admin
其中hdfs://nn1.ambari:8020/user/LJK/hiveExample/input要放一个文件,文件名自定义,
inputdata文件内容
1
2
3
4
6
7
8
9
执行成功后,可以看到output文件夹生成文件000000_0,内容与inputdata内容一致
Hive2 Action
跟Hive Action基本是一样的,只要改动workflow.xml就好
<workflow-app name="My_Workflow" xmlns="uri:oozie:workflow:0.5">
<start to="hive2-8f27"/>
<kill name="Kill">
<message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<action name="hive2-8f27" cred="hive2">
<hive2 xmlns="uri:oozie:hive2-action:0.1">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<prepare>
<delete path="${nameNode}/user/LJK/hiveExample/output"/>
<mkdir path="${nameNode}/user/LJK/hiveExample/output"/>
</prepare>
<job-xml>/user/LJK/hiveExample/hive-site.xml</job-xml>
<jdbc-url>jdbc:hive2://rm.ambari:10000/default</jdbc-url>
<script>/user/LJK/hiveExample/script.q</script>
<param>INPUT=/user/LJK/hiveExample/input</param>
<param>OUTPUT=/user/LJK/hiveExample/output</param>
</hive2>
<ok to="End"/>
<error to="Kill"/>
</action>
<end name="End"/>
</workflow-app>
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。