问题描述
hadoop集群安装完毕,在yarn的控制台显示节点id和节点地址都是localhost
hadoop@master sbin]$ yarn node -list
20/12/17 12:21:19 INFO client.RMProxy: Connecting to ResourceManager at master/172.16.8.42:18040
Total Nodes:1
Node-Id Node-State Node-Http-Address Number-of-Running-Containers
localhost:43141 RUNNING localhost:8042 0
提交作业时在yarn的日志中也打印出节点信息为127.0.0.1,并且使用该ip作为节点IP,肯定连接出错
2020-12-17 00:53:30,721 INFO org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Setting up container Container: [ContainerId: container_1607916354082_0008_01_000001, AllocationRequestId: 0, Version: 0, NodeId: localhost:43141, NodeHttpAddress: localhost:8042, Resource: <memory:2048, vCores:1>, Priority: 0, Token: Token { kind: ContainerToken, service: 127.0.0.1:35845 }, ExecutionType: GUARANTEED, ] for AM appattempt_1607916354082_0008_000001
020-12-17 00:56:30,801 INFO org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Error launching appattempt_1607916354082_0008_000001. Got exception: java.net.ConnectException: Call From master/172.16.8.42 to localhost:43141 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
at sun.reflect.GeneratedConstructorAccessor46.newInstance(Unknown Source)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:827)
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:757)
at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1553)
at org.apache.hadoop.ipc.Client.call(Client.java:1495)
at org.apache.hadoop.ipc.Client.call(Client.java:1394)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
问题原因
在hadoop的源码中,获取节点信息的代码如下
private NodeId buildNodeId(InetSocketAddress connectAddress,String hostOverride) {
if (hostOverride != null) {
connectAddress = NetUtils.getConnectAddress(
new InetSocketAddress(hostOverride, connectAddress.getPort()));
}
return NodeId.newInstance(
connectAddress.getAddress().getCanonicalHostName(),
connectAddress.getPort());
}
其中主机名是通过connectAddress.getAddress().getCanonicalHostName()
进行获取,我们知道获取主机名还可以通过getHostName
获取,那么这两种有什么区别?getCanonicalHostName获取的是全域名,getHostName获取的是主机名,比如主机名是definesys但可能dns上面配的域名是definesys.com,getCanonicalHostName就是通过dns进行解析获取全域名,实际上getAddress获取到的是127.0.0.1,在hosts文件中是这样配置的
127.0.0.1 localhost localhost.localdomain
因此解析成了localhost
解决方案
在hadoop的推荐方案里是这么写的
- If the error message says the remote service is on "127.0.0.1" or "localhost" that means the configuration file is telling the client that the service is on the local server. If your client is trying to talk to a remote system, then your configuration is broken.
- Check that there isn't an entry for your hostname mapped to 127.0.0.1 or 127.0.1.1 in /etc/hosts (Ubuntu is notorious for this).
翻译过来是建议删除127.0.0.1 和 127.0.1.1在hosts中的配置,删除后恢复正常,问题解决。
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。