环境
JDK 1.8.0
Hadoop 2.6.0
Scala 2.11.8
Spark 2.1.2
Oozie 4.1
Hue 3.9
yarn local 模式
- 进入 Workspace
- 进入 lib 目录,并上传 jar 和 配置文件
- 拖拽 Spark Program
- 选择刚才的 lib 目录
- 填入 jar 名称,点击 add 确认
- 填写业务主类名称,并配置参数
- 点击小齿轮,查看其他参数
- 保存配置
- 提交运行
yarn cluster 模式
- 进入 Workspace
- 进入 lib 目录,并上传 jar 和 配置文件
- 拖拽 Spark Program
- Files 随便填,等会儿要删除,Jar name 填入完整 HDFS 路径
hdfs://localcluster/user/hue/oozie/workspaces/hue-oozie-1570773494.4/lib/DataWarehouse-1.0-SNAPSHOT.jar
- 填写业务主类名称,点击减号删除 FILES,配置参数
hdfs://localcluster/user/hue/oozie/workspaces/hue-oozie-1570773494.4/lib/DataWarehouse-1.0-SNAPSHOT.jar
dw.user.qhy.wc.WordCount
--properties-file spark.properties
- 点击小齿轮,查看其他参数
- 将 client 改为 cluster
- 保存配置
- 提交运行
Oozie HTTP 接口
- Oozie WebServicesAPI 官方文档
- workflow.xml
<workflow-app name="data_warehouse.test" xmlns="uri:oozie:workflow:0.5">
<start to="spark-2d66"/>
<kill name="Kill">
<message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<action name="spark-2d66">
<spark xmlns="uri:oozie:spark-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<master>yarn</master>
<mode>cluster</mode>
<name>data_warehouse.workflow</name>
<class>dw.update.JobStream</class>
<jar>hdfs://localcluster/user/hue/oozie/workspaces/hue-oozie-1578979482.24/lib/DataWarehouse-1.0.jar</jar>
<spark-opts>--properties-file spark.properties</spark-opts>
<arg>${CustomArgs}</arg>
</spark>
<ok to="End"/>
<error to="Kill"/>
</action>
<end name="End"/>
</workflow-app>
- 示例代码(Python)
import requests
import json
import time
from pprint import pprint
HEADER = {
'User-Agent':
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36',
'Content-Type':
'application/xml;charset=UTF-8',
}
# oozie.wf.application.path 里面存放 workflow.xml
XML = '''
<configuration>
<property>
<name>oozie.wf.application.path</name>
<value>hdfs://localcluster/user/hue/oozie/workspaces/hue-oozie-1578979482.24</value>
</property>
<property>
<name>oozie.use.system.libpath</name>
<value>True</value>
</property>
<property>
<name>user.name</name>
<value>walker</value>
</property>
<property>
<name>jobTracker</name>
<value>rm1</value>
</property>
<property>
<name>mapreduce.job.user.name</name>
<value>walker</value>
</property>
<property>
<name>nameNode</name>
<value>hdfs://localcluster</value>
</property>
<property>
<name>CustomArgs</name>
<value>%s</value>
</property>
</configuration>
'''
CustomArgs = {
'User-Agent':
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36',
'Content-Type':
'application/xml;charset=UTF-8',
'hello':
'world'
}
XML = XML % json.dumps(CustomArgs)
# 提交任务
r = requests.post('http://oozie.walker:11000/oozie/v1/jobs?action=start', data=XML, headers=HEADER)
print(r.text) # {"id":"0000034-191012235641226-oozie-vipc-W"}
# 获取任务 ID
jobid = json.loads(r.text, encoding='utf8')['id']
# 查看运行状态
url = 'http://oozie.walker:11000/oozie/v1/job/%s?show=info&timezone=GMT' % jobid
while True:
time.sleep(5)
r = requests.get(url)
pprint(r.text) # 查看运行状态
FAQ
- 报类似如下错误(Attempt to add ... multiple times to the distributed cache)
Failing Oozie Launcher, Main class [org.apache.oozie.action.hadoop.SparkMain], main() threw exception,
Attempt to add (hdfs://localcluster/user/hue/oozie/workspaces/hue-oozie-1570758098.65/lib/DataWarehouse-1.0-SNAPSHOT.jar) multiple times to the distributed cache.
可以参考这篇文章的处理方式: java.lang.IllegalArgumentException: Attempt to add (custom-jar-with-spark-code.jar) multiple times to the distributed cache
- 报类似如下错误(kryo)
java.io.IOException: java.lang.NullPointerException
java.io.EOFException
com.esotericsoftware.kryo.KryoException
可能是因为不当的使用了 kryo 序列化器,最简单的解决方法是将
spark.serializer=org.apache.spark.serializer.KryoSerializer
换回默认的
spark.serializer=org.apache.spark.serializer.JavaSerializer
进一步可参考这篇文章的解决方案:Spark2 的序列化(JavaSerializer/KryoSerializer)
本文出自 walker snapshot
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。