2018年第41周-sparkSql搭建及配置

spark搭建

下载spark-2.3.2

wget https://archive.apache.org/dist/spark/spark-2.3.2/spark-2.3.2-bin-hadoop2.7.tgz

需下载-hadoop-2.7版本的spark, 不然要自己加很多依赖进spark目录

修改配置

复制\$HADOOP_HOME/etc/hadoop/core-site.xml 至 \$SPARK_HOME/conf
复制\$HADOOP_HOME/etc/hadoop/hdfs-site.xml 至 \$SPARK_HOME/conf
复制\$HIVE_HOME/conf/hive-site.xml 至 \$SPARK_HOME/conf

修改$SPARK_HOME/conf/hive-site.xml, 将sparksql监听的端口改为10002, 免得与原hive的hiveserver2端口冲突

<property>
    <name>hive.server2.thrift.port</name>
    <value>10002</value>
    <description>Port number of HiveServer2 Thrift interface when hive.server2.transport.mode is 'binary'.</description>
  </property>

　新建启动脚本并执行

在$SPARK_HOME目录创建文件startThriftServer.sh

vim startThriftServer.sh 
添加以下内容
#!/bin/bash

./sbin/start-thriftserver.sh \
  --master yarn

执行脚本

chmod +x ./startThriftServer.sh
./startThriftServer.sh

启动测试

执行beeline连接, 在$SPARK_HOME目录

[jevoncode@s1 spark-2.3.2-bin-hadoop2.7]$ ./bin/beeline 
Beeline version 1.2.1.spark2 by Apache Hive
beeline> !connect jdbc:hive2://localhost:10002/hive_data
Connecting to jdbc:hive2://localhost:10002/hive_data
Enter username for jdbc:hive2://localhost:10002/hive_data: jevoncode
Enter password for jdbc:hive2://localhost:10002/hive_data: ***************
2018-10-14 11:15:24 INFO  Utils:310 - Supplied authorities: localhost:10002
2018-10-14 11:15:24 INFO  Utils:397 - Resolved authority: localhost:10002
2018-10-14 11:15:24 INFO  HiveConnection:203 - Will try to open client transport with JDBC Uri: jdbc:hive2://localhost:10002/hive_data
Connected to: Spark SQL (version 2.3.2)
Driver: Hive JDBC (version 1.2.1.spark2)
Transaction isolation: TRANSACTION_REPEATABLE_READ
0: jdbc:hive2://localhost:10002/hive_data>

就可以执行sql语句了

spark动态资源配置

搭建完spark之后, 发现执行sql很慢, 从其webUI来看, 只有两个Executors执行, 而yarn集群有7台服务器, 从zabbix可以看到资源资源利用率低.

webUI在yarn界面, 点击Thrift JDBC/ODBC Server的ApplicationMaster即可进入

此问题的解决方法, 启动spark动态资源功能即可, 配置如下:

1.配置\$SPARK_HOME/conf/spark-defaults.conf

spark.dynamicAllocation.enabled true
spark.shuffle.service.enabled true

2.配置\$HADOOP_HOME/etc/hadoop/yarn-site.xml, 每个NodeManager都要配置

    <property>\
        <name>yarn.nodemanager.aux-services</name>\
                <value>mapreduce_shuffle,spark_shuffle</value>
    </property> 

        <property>
               <name>yarn.nodemanager.aux-services.spark_shuffle.class</name>
               <value>org.apache.spark.network.yarn.YarnShuffleService</value>
       </property>

3.复制\$SPARK_HOME/yarn/spark-2.3.2-yarn-shuffle.jar至\$HADOOP_HOME/share/hadoop/yarn/

4.重启每个NodeManager

5.此时执行sql就可以看到有很多Executors在执行

TroubleShoot

1.shuffle的配置问题

2018-10-14 10:24:05 WARN  YarnScheduler:66 - Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
2018-10-14 10:24:20 WARN  YarnScheduler:66 - Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
2018-10-14 10:24:35 WARN  YarnScheduler:66 - Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources

webUI什么错误信息也没, 状态也看不到

最后在yarn的application的日志里找到这个错误

2018-10-14 10:20:38 ERROR YarnAllocator:91 - Failed to launch executor 23 on container container_e69_1538148198468_17372_01_000024
org.apache.spark.SparkException: Exception while starting container container_e69_1538148198468_17372_01_000024 on host jevoncode.com
    at org.apache.spark.deploy.yarn.ExecutorRunnable.startContainer(ExecutorRunnable.scala:125)
    at org.apache.spark.deploy.yarn.ExecutorRunnable.run(ExecutorRunnable.scala:65)
    at org.apache.spark.deploy.yarn.YarnAllocator$$anonfun$runAllocatedContainers$1$$anon$1.run(YarnAllocator.scala:534)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.hadoop.yarn.exceptions.InvalidAuxServiceException: The auxService:spark_shuffle does not exist
    at sun.reflect.GeneratedConstructorAccessor35.newInstance(Unknown Source)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
    at org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.instantiateException(SerializedExceptionPBImpl.java:168)
    at org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.deSerialize(SerializedExceptionPBImpl.java:106)
    at org.apache.hadoop.yarn.client.api.impl.NMClientImpl.startContainer(NMClientImpl.java:205)
    at org.apache.spark.deploy.yarn.ExecutorRunnable.startContainer(ExecutorRunnable.scala:122)
    ... 5 more

shuffle的配置问题, 上述错误是因为yarn-site.xml没有配置spark_shuffle和指定spark_shuffle.class

2.HADOOP_CONF_DIR配置问题
需在~/.bashrc增加配置

export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop

2018年第41周-sparkSql搭建及配置

spark搭建

下载spark-2.3.2

修改配置

新建启动脚本并执行

启动测试

spark动态资源配置

TroubleShoot

电脑杂技集团

引用和评论

内网穿透工具frp的源码解读之概念流程篇

【活动回顾】StarRocks Singapore Meetup #2 @Shopee

鹰角：EMR Serverless Spark 在《明日方舟》游戏业务的应用

Spark on K8s 在vivo大数据平台的混部实战

最佳实践 | 在 EMR Serverless Spark 中实现 Doris 读写操作

最佳实践 | 在 EMR Serverless Spark 中实现 StarRocks 读写操作

立马耀：通过阿里云 Serverless Spark 和 Milvus 构建高效向量检索系统，驱动个性化推荐业务