溫馨提示×

您好,登錄后才能下訂單哦!

密碼登錄×
登錄注冊(cè)×
其他方式登錄
點(diǎn)擊 登錄注冊(cè) 即表示同意《億速云用戶服務(wù)條款》

Yarn上的不健康節(jié)點(diǎn)UNHEALTHY nodes怎么處理

發(fā)布時(shí)間:2021-12-29 14:45:42 來源:億速云 閱讀:280 作者:小新 欄目:云計(jì)算

小編給大家分享一下Yarn上的不健康節(jié)點(diǎn)UNHEALTHY nodes怎么處理,希望大家閱讀完這篇文章之后都有所收獲,下面讓我們一起去探討吧!

一、錯(cuò)誤

自己的三臺(tái)虛擬機(jī)hadoop001、hadoop002、hadoop003

檢查23188 發(fā)現(xiàn)有Unhealthy Nodes,正常的active nodes數(shù)目不對(duì)

另外查看

$ yarn node -list -all

Total Nodes:4

         Node-Id             Node-State Node-Http-Address       Number-of-Running-Containers

 hadoop001:34354              UNHEALTHY   hadoop001:23999                                  0

 hadoop002:60027                RUNNING   hadoop002:23999                                  0

 hadoop001:50623              UNHEALTHY   hadoop001:23999                                  0

 hadoop003:39700              UNHEALTHY   hadoop003:23999                                  0

二、日志檢查

查看resourcemanager的日志可以看到

2016-09-10 12:02:05,953 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Added node hadoop002:60027 cluster capacity: <memory:4096, vCores:4>
2016-09-10 12:02:05,990 INFO org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Node hadoop001:50623 reported UNHEALTHY with details: 1/1 local-dirs are bad: /data/disk1/data/yarn/local; 1/1 log-dirs are bad: /opt/beh/logs
/yarn/userlog
2016-09-10 12:02:05,991 INFO org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: hadoop001:50623 Node Transitioned from RUNNING to UNHEALTHY
2016-09-10 12:02:05,993 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Removed node hadoop001:50623 cluster capacity: <memory:2048, vCores:2>
2016-09-10 12:02:06,378 INFO org.apache.hadoop.yarn.util.RackResolver: Resolved hadoop003 to /default-rack

檢查nodemanager的日志可以查看到

2016-09-10 12:02:02,869 INFO org.mortbay.log: jetty-6.1.26.cloudera.4
2016-09-10 12:02:02,905 INFO org.mortbay.log: Extract jar:file:/opt/beh/core/hadoop/share/hadoop/yarn/hadoop-yarn-common-2.6.0-cdh6.4.4.jar!/webapps/node to /tmp/Jetty_0_0_0_0_23999_node____tgfx6h/webapp
2016-09-10 12:02:03,242 INFO org.mortbay.log: Started HttpServer2$SelectChannelConnectorWithSafeStartup@0.0.0.0:23999
2016-09-10 12:02:03,242 INFO org.apache.hadoop.yarn.webapp.WebApps: Web app /node started at 23999
2016-09-10 12:02:03,735 INFO org.apache.hadoop.yarn.webapp.WebApps: Registered webapp guice modules
2016-09-10 12:02:03,775 INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending out 0 NM container statuses: []
2016-09-10 12:02:03,783 INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Registering with RM using containers :[]
2016-09-10 12:02:03,822 INFO org.apache.hadoop.yarn.client.ConfiguredRMFailoverProxyProvider: Failing over to rm2
2016-09-10 12:02:03,824 INFO org.apache.hadoop.io.retry.RetryInvocationHandler: Exception while invoking registerNodeManager of class ResourceTrackerPBClientImpl over rm2 after 1 fail over attempts. Trying to fail over after sleeping fo
r 2138ms.
java.net.ConnectException: Call From hadoop002/192.168.30.22 to hadoop002:23125 failed on connection exception: java.net.ConnectException: 拒絕連接; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
        at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:791)
        at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:731)
        at org.apache.hadoop.ipc.Client.call(Client.java:1472)
        at org.apache.hadoop.ipc.Client.call(Client.java:1399)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
        at com.sun.proxy.$Proxy27.registerNodeManager(Unknown Source)
        at org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.registerNodeManager(ResourceTrackerPBClientImpl.java:68)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
        at com.sun.proxy.$Proxy28.registerNodeManager(Unknown Source)
        at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.registerWithRM(NodeStatusUpdaterImpl.java:257)
        at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.serviceStart(NodeStatusUpdaterImpl.java:191)
        at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
        at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120)
        at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceStart(NodeManager.java:264)
        at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
        at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:463)
        at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:509)
Caused by: java.net.ConnectException: 拒絕連接
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
        at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
        at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:530)
        at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:494)
        at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:607)
        at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:705)
        at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:368)
        at org.apache.hadoop.ipc.Client.getConnection(Client.java:1521)
        at org.apache.hadoop.ipc.Client.call(Client.java:1438)
        ... 19 more
2016-09-10 12:02:05,965 INFO org.apache.hadoop.yarn.client.ConfiguredRMFailoverProxyProvider: Failing over to rm1
2016-09-10 12:02:05,996 INFO org.apache.hadoop.yarn.server.nodemanager.security.NMContainerTokenSecretManager: Rolling master-key for container-tokens, got key with id -1513537506
2016-09-10 12:02:05,998 INFO org.apache.hadoop.yarn.server.nodemanager.security.NMTokenSecretManagerInNM: Rolling master-key for container-tokens, got key with id 701920721
2016-09-10 12:02:05,999 INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Registered with ResourceManager as hadoop002:60027 with total resource of <memory:2048, vCores:2>
2016-09-10 12:02:05,999 INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Notifying ContainerManager to unblock new container-requests

三、錯(cuò)誤分析

NodeManager默認(rèn)會(huì)每兩分鐘檢查本地磁盤(local-dirs),找出那些目錄可以使用。注意如果判定這個(gè)磁盤不可用,則在重啟 NodeManager之前,就算磁盤好了,也不會(huì)把它變成可用。當(dāng)好磁盤數(shù)少于一定量時(shí),會(huì)把這臺(tái)機(jī)器變成unhealthy,將不會(huì)再給這臺(tái)機(jī)器分配任務(wù)。

查看自己的虛擬機(jī)磁盤情況,發(fā)現(xiàn)001和003的磁盤都要滿了,于是清除不需要的文件,騰出剩余空間,UNHEALTHY nodes立馬恢復(fù)正常
 

$  yarn node -list -all
Total Nodes:4

         Node-Id             Node-State Node-Http-Address       Number-of-Running-Containers

 hadoop001:34354                RUNNING   hadoop001:23999                                  0

 hadoop002:60027                RUNNING   hadoop002:23999                                  0

 hadoop003:39700                RUNNING   hadoop003:23999                                  0

 hadoop001:50623                   LOST   hadoop001:23999                                  0

此處為什么有2個(gè)hadoop001,因?yàn)樾薷牧伺渲梦募貑⑦^一次,所有出現(xiàn)了2個(gè),其中有一個(gè)為LOST狀態(tài),另一個(gè)正常RUNNING,不影響使用,yarn重啟后就可恢復(fù)正常。

看完了這篇文章,相信你對(duì)“Yarn上的不健康節(jié)點(diǎn)UNHEALTHY nodes怎么處理”有了一定的了解,如果想了解更多相關(guān)知識(shí),歡迎關(guān)注億速云行業(yè)資訊頻道,感謝各位的閱讀!

向AI問一下細(xì)節(jié)

免責(zé)聲明:本站發(fā)布的內(nèi)容(圖片、視頻和文字)以原創(chuàng)、轉(zhuǎn)載和分享為主,文章觀點(diǎn)不代表本網(wǎng)站立場(chǎng),如果涉及侵權(quán)請(qǐng)聯(lián)系站長郵箱:is@yisu.com進(jìn)行舉報(bào),并提供相關(guān)證據(jù),一經(jīng)查實(shí),將立刻刪除涉嫌侵權(quán)內(nèi)容。

AI