• 技术栈
Elasticsearch 7.17
Spark 3.5
AWS EMR 7.3
  • 报错片段
24/10/24 02:45:56 ERROR NetworkClient: Node [192.168.83.87:9200] failed (java.net.BindException: Address already in use); selected next node [192.168.83.232:9200]
24/10/24 02:45:56 ERROR Executor: Exception in task 7506.0 in stage 1.0 (TID 17506)
org.elasticsearch.hadoop.rest.EsHadoopNoNodesLeftException: Connection error (check network and/or proxy settings)- all nodes failed; tried [[192.168.83.232:9200, 192.168.83.87:9200, 192.168.83.26:9200]] 
    at org.elasticsearch.hadoop.rest.NetworkClient.execute(NetworkClient.java:160) ~[DataAnalysis_EMR_Spark3-1.0.jar:?]
    at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:441) ~[DataAnalysis_EMR_Spark3-1.0.jar:?]
    at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:437) ~[DataAnalysis_EMR_Spark3-1.0.jar:?]
    at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:397) ~[DataAnalysis_EMR_Spark3-1.0.jar:?]
    at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:401) ~[DataAnalysis_EMR_Spark3-1.0.jar:?]
    at org.elasticsearch.hadoop.rest.RestClient.get(RestClient.java:177) ~[DataAnalysis_EMR_Spark3-1.0.jar:?]
    at org.elasticsearch.hadoop.rest.request.GetAliasesRequestBuilder.execute(GetAliasesRequestBuilder.java:68) ~[DataAnalysis_EMR_Spark3-1.0.jar:?]
    at org.elasticsearch.hadoop.rest.RestService.createWriter(RestService.java:623) ~[DataAnalysis_EMR_Spark3-1.0.jar:?]
    at org.elasticsearch.spark.rdd.EsRDDWriter.write(EsRDDWriter.scala:71) ~[DataAnalysis_EMR_Spark3-1.0.jar:?]
    at org.elasticsearch.spark.rdd.EsSpark$.$anonfun$doSaveToEs$1(EsSpark.scala:108) ~[DataAnalysis_EMR_Spark3-1.0.jar:?]
    at org.elasticsearch.spark.rdd.EsSpark$.$anonfun$doSaveToEs$1$adapted(EsSpark.scala:108) ~[DataAnalysis_EMR_Spark3-1.0.jar:?]
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:93) ~[spark-core_2.12-3.5.1-amzn-1.jar:3.5.1-amzn-1]
    at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:174) ~[spark-core_2.12-3.5.1-amzn-1.jar:3.5.1-amzn-1]
    at org.apache.spark.scheduler.Task.run(Task.scala:152) ~[spark-core_2.12-3.5.1-amzn-1.jar:3.5.1-amzn-1]
    at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:632) ~[spark-core_2.12-3.5.1-amzn-1.jar:3.5.1-amzn-1]
    at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64) ~[spark-common-utils_2.12-3.5.1-amzn-1.jar:3.5.1-amzn-1]
    at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61) ~[spark-common-utils_2.12-3.5.1-amzn-1.jar:3.5.1-amzn-1]
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:96) ~[spark-core_2.12-3.5.1-amzn-1.jar:3.5.1-amzn-1]
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:635) [spark-core_2.12-3.5.1-amzn-1.jar:3.5.1-amzn-1]
  • 下面 3 个是协调节点的地址
[192.168.83.232:9200, 192.168.83.87:9200, 192.168.83.26:9200] 
  • 在 Spark 代码中调整以下写入参数,都没用
# ES
es.nodes.wan.only = true 
es.nodes.discovery = false
es.nodes.client.only= true
# Spark
spark.driver.bindAddress = 0.0.0.0
spark.driver.port = 0
spark.executor.bindAddress = 0.0.0.0
spark.executor.port = 0
  • 最后排查结果是多个索引的数据文件太多,在切换索引时,新绑定端口不够用,用 spark repartition 减少数据文件个数后问题得以解决。
  • 感觉很奇怪:看起来像是一个数据文件占用了一个端口,而不是一个线程占用了一个端口。
  • 还有另外一种可能是,先 bindconnect,可以参考 一次Commons-HttpClient的BindException排查
  • 感觉都不大科学
  • 报错内容也很有迷惑性,source 端口绑定失败,却去切换 destination 的地址。
本文出自 qbit snap

qbit
268 声望279 粉丝