1

环境

JDK    1.8.0 
Hadoop 2.6.0
Scala  2.11.8  
Spark  2.1.2
Oozie  4.1
Hue    3.9 

yarn local 模式

  • 进入 Workspace

workspace.png

  • 进入 lib 目录,并上传 jar 和 配置文件

lib.png

  • 拖拽 Spark Program

drag_spark.png

  • 选择刚才的 lib 目录

select.png

  • 填入 jar 名称,点击 add 确认

add.png

  • 填写业务主类名称,并配置参数

main_opt.png

  • 点击小齿轮,查看其他参数

enter.png


see.png

  • 保存配置

save.png

  • 提交运行

submit.png

yarn cluster 模式

  • 进入 Workspace

yarn_cluster

  • 进入 lib 目录,并上传 jar 和 配置文件

upload

  • 拖拽 Spark Program

drag_spark

  • Files 随便填,等会儿要删除,Jar name 填入完整 HDFS 路径
hdfs://localcluster/user/hue/oozie/workspaces/hue-oozie-1570773494.4/lib/DataWarehouse-1.0-SNAPSHOT.jar

add.png

  • 填写业务主类名称,点击减号删除 FILES,配置参数

image.png

hdfs://localcluster/user/hue/oozie/workspaces/hue-oozie-1570773494.4/lib/DataWarehouse-1.0-SNAPSHOT.jar
dw.user.qhy.wc.WordCount
--properties-file spark.properties
  • 点击小齿轮,查看其他参数

check

  • 将 client 改为 cluster

rename

  • 保存配置

save

  • 提交运行

submit

Oozie HTTP 接口

<workflow-app name="data_warehouse.test" xmlns="uri:oozie:workflow:0.5">
    <start to="spark-2d66"/>
    <kill name="Kill">
        <message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
    </kill>
    <action name="spark-2d66">
        <spark xmlns="uri:oozie:spark-action:0.2">
            <job-tracker>${jobTracker}</job-tracker>
            <name-node>${nameNode}</name-node>
            <master>yarn</master>
            <mode>cluster</mode>
            <name>data_warehouse.workflow</name>
              <class>dw.update.JobStream</class>
            <jar>hdfs://localcluster/user/hue/oozie/workspaces/hue-oozie-1578979482.24/lib/DataWarehouse-1.0.jar</jar>
              <spark-opts>--properties-file spark.properties</spark-opts>
              <arg>${CustomArgs}</arg>
        </spark>
        <ok to="End"/>
        <error to="Kill"/>
    </action>
    <end name="End"/>
</workflow-app>
  • 示例代码(Python)
import requests
import json
import time
from pprint import pprint

HEADER = {
    'User-Agent':
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36',
    'Content-Type':
    'application/xml;charset=UTF-8',
}
# oozie.wf.application.path 里面存放 workflow.xml
XML = '''
<configuration>
  <property>
    <name>oozie.wf.application.path</name>
    <value>hdfs://localcluster/user/hue/oozie/workspaces/hue-oozie-1578979482.24</value>
  </property>
  <property>
    <name>oozie.use.system.libpath</name>
    <value>True</value>
  </property>
  <property>
    <name>user.name</name>
    <value>walker</value>
  </property>
  <property>
    <name>jobTracker</name>
    <value>rm1</value>
  </property>
  <property>
    <name>mapreduce.job.user.name</name>
    <value>walker</value>
  </property>
  <property>
    <name>nameNode</name>
    <value>hdfs://localcluster</value>
  </property>
  <property>
    <name>CustomArgs</name>
    <value>%s</value>
  </property>
</configuration>
'''

CustomArgs = {
    'User-Agent':
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36',
    'Content-Type':
    'application/xml;charset=UTF-8',
    'hello':
    'world'
}

XML = XML % json.dumps(CustomArgs)
# 提交任务
r = requests.post('http://oozie.walker:11000/oozie/v1/jobs?action=start', data=XML, headers=HEADER)
print(r.text)   # {"id":"0000034-191012235641226-oozie-vipc-W"}
# 获取任务 ID
jobid = json.loads(r.text, encoding='utf8')['id']
# 查看运行状态
url = 'http://oozie.walker:11000/oozie/v1/job/%s?show=info&timezone=GMT' % jobid 
while True:
    time.sleep(5)
    r = requests.get(url)
    pprint(r.text)    # 查看运行状态

FAQ

  • 报类似如下错误(Attempt to add ... multiple times to the distributed cache)
Failing Oozie Launcher, Main class [org.apache.oozie.action.hadoop.SparkMain], main() threw exception, 
Attempt to add (hdfs://localcluster/user/hue/oozie/workspaces/hue-oozie-1570758098.65/lib/DataWarehouse-1.0-SNAPSHOT.jar) multiple times to the distributed cache.

可以参考这篇文章的处理方式: java.lang.IllegalArgumentException: Attempt to add (custom-jar-with-spark-code.jar) multiple times to the distributed cache


  • 报类似如下错误(kryo)
java.io.IOException: java.lang.NullPointerException
java.io.EOFException
com.esotericsoftware.kryo.KryoException

可能是因为不当的使用了 kryo 序列化器,最简单的解决方法是将

spark.serializer=org.apache.spark.serializer.KryoSerializer  

换回默认的

spark.serializer=org.apache.spark.serializer.JavaSerializer

进一步可参考这篇文章的解决方案:Spark2 的序列化(JavaSerializer/KryoSerializer)

本文出自 walker snapshot

qbit
268 声望279 粉丝