3

OOZIE概览

[TOC]

调度框架:Linux Crontab,Azkaban,oozie,zeus

三款任务调度系统比较

简介

oozie是一个工作流调度系统

  • 工作流的调度是DAG
  • 可扩展:一个oozie就是一个mr任务,但是仅仅是map,没有reduce
  • 可靠性:任务失败后重试
  • 集成了Hadoop生态系统的其他任务,如mr、pig、hive、sqoop、spark

主要组件

  • tomcat(servlet进行调用并页面显示任务)
  • 数据库(存储任务)
  • Bundle,coordinator,workflow

架构图

clipboard.png

三大服务模块

  • Oozie V3 :a server based Bundle engine:对多个coordinator进行封装,可以启动,停止,挂起,关闭,重启一组coordinator的任务
  • Oozie V2 :a server based Coordinator engine:可以运行多个workflow,结构:start->workfows->end
  • Oozie V1 :a server based workflow engine,结构:start->mr->pig->fork->mr/hive->join->end
workflow

clipboard.png

clipboard.png

coordinator

记录下踩的
报错
Error: E0505 : E0505: App definition [hdfs://localhost:8020/tmp/oozie-app/coordinator/] does not exist
这个错误信息很坑爹,当时发现其实不是目录不对,是coordinator.xml文件名命名有问题。


准备工作:时区统一

建议采用东八区时间(GMT+0800)

在服务器上,date -R如果显示如下信息,则表示为东八区,如果不是需要设置时区,一般采用北京或者上海的ln -sf /usr/share/zoneinfo/Asia/Shanghai /etc/localtime
Sat, 30 Sep 2017 10:26:58 +0800

接着去修改oozie-site.xml,如果没有这个属性,就增加

   <property>
      <name>oozie.processing.timezone</name>
      <value>GMT+0800</value>
    </property>

让界面的时间也显示正确

clipboard.png

Examples

Spark Action

workflow spark on yarn

workflow spark on yarn参考官方地址

文件目录结构

├── ooziespark
│   ├── job.properties
│   ├── lib
│   │   └── spark-1.6.2-1.0-SNAPSHOT.jar
│   └── workflow.xml

workflow.xml

<?xml version="1.0" encoding="utf-8"?>
<workflow-app xmlns="uri:oozie:workflow:0.5" name="SparkWordCount">  
  <start to="spark-node"/>  
  <action name="spark-node"> 
    <spark xmlns="uri:oozie:spark-action:0.1">  
      <job-tracker>${jobTracker}</job-tracker>  
      <name-node>${nameNode}</name-node>  
      <prepare> 
        <delete path="${outputdir}"/>
      </prepare>  
      <master>${master}</master>  
      <name>Spark-Wordcount</name>  
      <class>WordCount</class>  
      <jar>${nameNode}/user/LJK/ooziecoor/lib/spark-1.6.2-1.0-SNAPSHOT.jar</jar>  
      <spark-opts>--driver-memory 512M --executor-memory 512M</spark-opts>  
      <arg>${inputdir}</arg>  
      <arg>${outputdir}</arg> 
    </spark>  
    <ok to="end"/>  
    <error to="fail"/> 
  </action>  
  <kill name="fail"> 
    <message>Workflow failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message> 
  </kill>  
  <end name="end"/> 
</workflow-app>

job.properties

nameNode=hdfs://nn1:8020
jobTracker=rm:8050
master=yarn-cluster
queueName=default
inputdir=/user/LJK/hello-spark
outputdir=/user/LJK/output
oozie.use.system.libpath=true
oozie.wf.application.path=/user/LJK/ooziespark
#oozie.coord.application.path=${nameNode}/user/LJK/ooziespark
#start=2017-09-28T17:00+0800
#end=2017-09-30T17:00+0800
#workflowAppUri=${nameNode}/user/LJK/ooziespark/

打包程序拷贝到app/lib目录下,测试源码以下

object WordCount {

  def main(args: Array[String]): Unit = {

    val conf = new SparkConf()
//      .setJars(List("/Users/LJK/Documents/code/github/study-spark1.6.2/target/spark-1.6.2-1.0-SNAPSHOT.jar"))
//      .set("spark.yarn.historyServer.address", "rm:18080")
//      .set("spark.eventLog.enabled", "true")
//      .set("spark.eventLog.dir", "hdfs://nn1:8020/spark-history")
      .set("spark.testing.memory", "1073741824")
    val sc = new SparkContext(conf)
    val rdd = sc.textFile(args(0))
      .flatMap(_.split(" "))
      .map((_, 1))
      .reduceByKey(_ + _)
    rdd.saveAsTextFile(args(1))
    sc.stop()
  }
}

把这个目录上传到HDFS目录,执行命令hdfs dfs -put ooziespark /user/LJK/
注意点:job.properties可以不用上传到HDFS,因为执行命令的时候用的是本地的不是HDFS的

oozie启动job,执行命令
oozie job -oozie http://rm:11000/oozie -config /usr/local/share/applications/ooziespark/job.properties -run
或者
oozie job -config /usr/local/share/applications/ooziespark/job.properties -run
简略版前提是你要配置the env variable 'OOZIE_URL' is used as default value for the '-oozie' option,具体可以用oozie help查看

在oozie界面上查看job执行

clipboard.png

Coordinator spark on yarn

clipboard.png

clipboard.png

简单调度,每五分钟跑一次WordCount

文件目录结构

├── ooziecoor
│   ├── coordinator.xml
│   ├── job.properties
│   ├── lib
│   │   └── spark-1.6.2-1.0-SNAPSHOT.jar
│   └── workflow.xml

coordinator.xml

<coordinator-app name="cron-coord" frequency="${coord:minutes(5)}" start="${start}" end="${end}" timezone="GMT+0800"
              xmlns="uri:oozie:coordinator:0.4">
     <action>
     <workflow>
         <app-path>${workflowAppUri}</app-path>
         <configuration>
             <property>
                 <name>jobTracker</name>
                 <value>${jobTracker}</value>
             </property>
             <property>
                 <name>nameNode</name>
                 <value>${nameNode}</value>
             </property>
             <property>
                 <name>queueName</name>
                 <value>${queueName}</value>
             </property>
         </configuration>
     </workflow>
 </action>
</coordinator-app>

修改之前的job.properties,改为

nameNode=hdfs://nn1:8020
jobTracker=rm:8050
master=yarn-cluster
queueName=default
inputdir=/user/LJK/hello-spark
outputdir=/user/LJK/output
oozie.use.system.libpath=true
#oozie.wf.application.path=/user/LJK/ooziespark
oozie.coord.application.path=${nameNode}/user/LJK/ooziecoor
start=2017-09-30T9:30+0800
end=2017-09-30T17:00+0800
workflowAppUri=${nameNode}/user/LJK/ooziecoor

之前的workflow可以直接保留不改jar包位置也是可以的,但为了每个任务更加好看,修改下jar包位置即可

上传到HDFS,并执行命令
oozie job -config /usr/local/share/applications/ooziecoor/job.properties -run

可以在web上查看job

clipboard.png

clipboard.png

bundle spark on yarn

文件结构

├── ooziebundle
│   ├── bundle.xml
│   ├── coordinator.xml
│   ├── job.properties
│   ├── lib
│   │   └── spark-1.6.2-1.0-SNAPSHOT.jar
│   └── workflow.xml

增加bundle.xml

<bundle-app name='bundle-app' xmlns:xsi='http://www.w3.org/2001/XMLSchema-instance' xmlns='uri:oozie:bundle:0.1'>
          <coordinator name='coord-1'>
                 <app-path>${nameNode}/user/LJK/ooziebundle/coordinator.xml</app-path>
                 <configuration>
                     <property>
                         <name>start</name>
                         <value>${start}</value>
                     </property>
                     <property>
                         <name>end</name>
                         <value>${end}</value>
                     </property>
                 </configuration>
          </coordinator>
</bundle-app>

修改job.properties

nameNode=hdfs://nn1:8020
jobTracker=rm:8050
master=yarn-cluster
queueName=default
inputdir=/user/LJK/hello-spark
outputdir=/user/LJK/output
oozie.use.system.libpath=true
#oozie.wf.application.path=/user/LJK/ooziespark
#oozie.coord.application.path=${nameNode}/user/LJK/ooziecoor
oozie.bundle.application.path=${nameNode}/user/LJK/ooziebundle
start=2017-09-30T9:30+0800
end=2017-09-30T17:00+0800
workflowAppUri=${nameNode}/user/LJK/ooziebundle

上传到HDFS,并执行命令
oozie job -config /usr/local/share/applications/ooziebundle/job.properties -run

web上查看job

clipboard.png


Java Action

文件结构,lib包不是打成一个jar包所以不列出了,你可以选择打成一个jar包

javaExample/
├── job.properties
├── lib
└── workflow.xml

注意
如果你用的是SpringBoot框架,需要在pom上加上exclusions,否则会有jar包冲突,oozie会报错

<dependency>
  <groupId>org.springframework.boot</groupId>
  <artifactId>spring-boot-starter</artifactId>
  <exclusions>
      <exclusion>
          <artifactId>spring-boot-starter-logging</artifactId>
          <groupId>org.springframework.boot</groupId>
      </exclusion>
  </exclusions>
</dependency>

workflow.xml

<workflow-app name="My_Workflow" xmlns="uri:oozie:workflow:0.5">
    <start to="java-2d81"/>
    <kill name="Kill">
        <message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
    </kill>
    <action name="java-2d81">
        <java>
            <job-tracker>${jobTracker}</job-tracker>
            <name-node>${nameNode}</name-node>
            <main-class>com.sharing.App</main-class>
            <arg>hello</arg>
            <arg>springboot</arg>
        </java>
        <ok to="End"/>
        <error to="Kill"/>
    </action>
    <end name="End"/>
</workflow-app>

job.properties

oozie.use.system.libpath=false
queueName=default
jobTracker=rm.ambari:8050
nameNode=hdfs://nn1.ambari:8020
oozie.wf.application.path=${nameNode}/user/LJK/javaExample

java程序源码

@SpringBootApplication
public class App {

    public static void main(String[] args) {
        SpringApplication.run(App.class,args);
        System.out.println(args[0] + " " + args[1]);
    }
}

Shell Action

文件结构

shell
├── job.properties
└── workflow.xml

workflow.xml

<workflow-app name="My_Workflow" xmlns="uri:oozie:workflow:0.5">
    <start to="shell-2504"/>
    <kill name="Kill">
        <message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
    </kill>
    <action name="shell-2504">
        <shell xmlns="uri:oozie:shell-action:0.1">
            <job-tracker>${jobTracker}</job-tracker>
            <name-node>${nameNode}</name-node>
            <exec>echo</exec>
              <argument>hello shell</argument>
              <capture-output/>
        </shell>
        <ok to="End"/>
        <error to="Kill"/>
    </action>
    <end name="End"/>
</workflow-app>

job.properties

hue-id-w=50057
jobTracker=rm.ambari:8050
mapreduce.job.user.name=admin
nameNode=hdfs://nn1.ambari:8020
oozie.use.system.libpath=True
oozie.wf.application.path=hdfs://nn1.ambari:8020/user/LJK/shell
user.name=admin

Hive Action

文件结构

hiveExample/
├── hive-site.xml
├── input
│   └── inputdata
├── job.properties
├── output
├── script.q
└── workflow.xml

hive script,写一个hive脚本,文件名自定义,
script.q文件内容

DROP TABLE IF EXISTS test;
CREATE EXTERNAL TABLE test (a INT) STORED AS TEXTFILE LOCATION '${INPUT}';
INSERT OVERWRITE DIRECTORY '${OUTPUT}' SELECT * FROM test;

workflow.xml

<workflow-app name="My_Workflow" xmlns="uri:oozie:workflow:0.5">
    <start to="hive-bfbc"/>
    <kill name="Kill">
        <message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
    </kill>
    <action name="hive-bfbc" cred="hcat">
        <hive xmlns="uri:oozie:hive-action:0.2">
            <job-tracker>${jobTracker}</job-tracker>
            <name-node>${nameNode}</name-node>
            <prepare>
                  <delete path="${nameNode}/user/LJK/hiveExample/output"/>
                  <mkdir path="${nameNode}/user/LJK/hiveExample/output"/>
            </prepare>
              <job-xml>/user/LJK/hiveExample/hive-site.xml</job-xml>
            <script>/user/LJK/hiveExample/script.q</script>
              <param>INPUT=/user/LJK/hiveExample/input</param>
              <param>OUTPUT=/user/LJK/hiveExample/output</param>
        </hive>
        <ok to="End"/>
        <error to="Kill"/>
    </action>
    <end name="End"/>
</workflow-app>

job.properties

hue-id-w=50059
jobTracker=rm.ambari:8050
mapreduce.job.user.name=admin
nameNode=hdfs://nn1.ambari:8020
oozie.use.system.libpath=True
oozie.wf.application.path=hdfs://nn1.ambari:8020/user/LJK/hiveExample
user.name=admin

其中hdfs://nn1.ambari:8020/user/LJK/hiveExample/input要放一个文件,文件名自定义,
inputdata文件内容

1
2
3
4
6
7
8
9

执行成功后,可以看到output文件夹生成文件000000_0,内容与inputdata内容一致

Hive2 Action

跟Hive Action基本是一样的,只要改动workflow.xml就好

<workflow-app name="My_Workflow" xmlns="uri:oozie:workflow:0.5">
    <start to="hive2-8f27"/>
    <kill name="Kill">
        <message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
    </kill>
    <action name="hive2-8f27" cred="hive2">
        <hive2 xmlns="uri:oozie:hive2-action:0.1">
            <job-tracker>${jobTracker}</job-tracker>
            <name-node>${nameNode}</name-node>
            <prepare>
                  <delete path="${nameNode}/user/LJK/hiveExample/output"/>
                  <mkdir path="${nameNode}/user/LJK/hiveExample/output"/>
            </prepare>
              <job-xml>/user/LJK/hiveExample/hive-site.xml</job-xml>
            <jdbc-url>jdbc:hive2://rm.ambari:10000/default</jdbc-url>
            <script>/user/LJK/hiveExample/script.q</script>
              <param>INPUT=/user/LJK/hiveExample/input</param>
              <param>OUTPUT=/user/LJK/hiveExample/output</param>
        </hive2>
        <ok to="End"/>
        <error to="Kill"/>
    </action>
    <end name="End"/>
</workflow-app>

资源链接


小鸡
214 声望24 粉丝

1.01的365次方=37.8


引用和评论

0 条评论