为什么会在pyspark在RDD中调用python第三方库失败?

新手上路,请多包涵

问题描述

Hi, 我在公司线上运行pyspark时调用jieba分词, 发现可以成功import, 但是在RDD中调用分词函数时却提示没有 module jieba, 在本地虚拟机时没有这些问题

问题出现的环境背景及自己尝试过哪些方法

尝试更换了root安装jieba的

相关代码

// 请把代码文本粘贴到下方(请勿用图片代替代码)
>>>import jieba
>>>[x for x in jieba.cut('这是一段测试文本')]
Building prefix dict from the default dictionary ...
Loading model from cache /tmp/jieba.cache
Loading model cost 0.448 seconds.
Prefix dict has been built succesfully.
[u'u8fd9u662f', u'u4e00u6bb5', u'u6d4bu8bd5', u'u6587u672c']
//以上是普通调用jieba 能够成功分词
>>> cut = name.map(lambda x: [y for y in jieba.cut(x)])
>>> cut.count()
//以上代码在本地虚拟机中运行不会报错,但是在线上堡垒机运行会报错。

你期待的结果是什么?实际看到的错误信息又是什么?

18/07/13 10:16:17 WARN scheduler.TaskSetManager: Lost task 6.0 in stage 1.0 (TID 16, hadoop13, executor 17): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/opt/cloudera/parcels/CDH-5.11.0-1.cdh5.11.0.p0.34/lib/spark/python/pyspark/worker.py", line 98, in main

command = pickleSer._read_with_length(infile)

File "/opt/cloudera/parcels/CDH-5.11.0-1.cdh5.11.0.p0.34/lib/spark/python/pyspark/serializers.py", line 164, in _read_with_length

return self.loads(obj)

File "/opt/cloudera/parcels/CDH-5.11.0-1.cdh5.11.0.p0.34/lib/spark/python/pyspark/serializers.py", line 422, in loads

return pickle.loads(obj)

File "/opt/cloudera/parcels/CDH-5.11.0-1.cdh5.11.0.p0.34/lib/spark/python/pyspark/cloudpickle.py", line 664, in subimport

__import__(name)

ImportError: ('No module named jieba', <function subimport at 0x27a9488>, ('jieba',))

    at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
    at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:207)
    at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
    at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
    at org.apache.spark.scheduler.Task.run(Task.scala:89)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:242)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
阅读 6.1k
1 个回答

Spark 集群的每台机器都要安装 jieba 才行

撰写回答
你尚未登录,登录后可以
  • 和开发者交流问题的细节
  • 关注并接收问题和回答的更新提醒
  • 参与内容的编辑和改进,让解决方法与时俱进
推荐问题