application master 持續(xù)org.apache.hadoop.ipc.Client: Retrying connect to server

發(fā)布時間：2020-07-08 01:03:53 來源：網(wǎng)絡(luò) 閱讀：2714 作者：zouqingyun 欄目：大數(shù)據(jù)

一、問題現(xiàn)象

某一個nodemanager退出后，導(dǎo)致 application master中出現(xiàn)大量的如下日志，并且持續(xù)很長時間，application master才成功退出。

2016-06-24 09:32:35,596 INFO [ContainerLauncher #3] org.apache.hadoop.ipc.Client: Retrying connect to server: dchadoop206/192.168.1.199:32951. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2016-06-24 09:32:35,596 INFO [ContainerLauncher #9] org.apache.hadoop.ipc.Client: Retrying connect to server: dchadoop206/192.168.1.199:32951. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2016-06-24 09:32:35,597 INFO [ContainerLauncher #7] org.apache.hadoop.ipc.Client: Retrying connect to server: dchadoop206/192.168.1.199:32951. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2016-06-24 09:32:36,455 INFO [ContainerLauncher #8] org.apache.hadoop.ipc.Client: Retrying connect to server: dchadoop206/192.168.1.199:32951. Already tried 7 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2016-06-24 09:32:36,539 INFO [ContainerLauncher #5] org.apache.hadoop.ipc.Client: Retrying connect to server: dchadoop206/192.168.1.199:32951. Already tried 4 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2016-06-24 09:32:36,539 INFO [ContainerLauncher #1] org.apache.hadoop.ipc.Client: Retrying connect to server: dchadoop206/192.168.1.199:32951. Already tried 4 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2016-06-24 09:32:36,539 INFO [ContainerLauncher #6] org.apache.hadoop.ipc.Client: Retrying connect to server: dchadoop206/192.168.1.199:32951. Already tried 3 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2016-06-24 09:32:36,539 INFO [ContainerLauncher #2] org.apache.hadoop.ipc.Client: Retrying connect to server: dchadoop206/192.168.1.199:32951. Already tried 3 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2016-06-24 09:32:36,539 INFO [ContainerLauncher #0] org.apache.hadoop.ipc.Client: Retrying connect to server: dchadoop206/192.168.1.199:32951. Already tried 4 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2016-06-24 09:32:36,596 INFO [ContainerLauncher #4] org.apache.hadoop.ipc.Client: Retrying connect to server: dchadoop206/192.168.1.199:32951. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2016-06-24 09:32:36,597 INFO [ContainerLauncher #3] org.apache.hadoop.ipc.Client: Retrying connect to server: dchadoop206/192.168.1.199:32951. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2016-06-24 09:32:36,597 INFO [ContainerLauncher #9] org.apache.hadoop.ipc.Client: Retrying connect to server: dchadoop206/192.168.1.199:32951. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
.......

2016-06-24 12:57:52,328 INFO [Thread-1835] org.apache.hadoop.ipc.Client: Retrying connect to server: dchadoop206/192.168.1.199:32951. Already tried 8 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2016-06-24 12:57:53,339 INFO [Thread-1835] org.apache.hadoop.ipc.Client: Retrying connect to server: dchadoop206/192.168.1.199:32951. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2016-06-24 12:58:04,357 INFO [Thread-1835] org.apache.hadoop.ipc.Client: Retrying connect to server: dchadoop206/192.168.1.199:32951. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2016-06-24 12:58:05,367 INFO [Thread-1835] org.apache.hadoop.ipc.Client: Retrying connect to server: dchadoop206/192.168.1.199:32951. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2016-06-24 12:58:06,378 INFO [Thread-1835] org.apache.hadoop.ipc.Client: Retrying connect to server: dchadoop206/192.168.1.199:32951. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2016-06-24 12:58:07,392 INFO [Thread-1835] org.apache.hadoop.ipc.Client: Retrying connect to server: dchadoop206/192.168.1.199:32951. Already tried 3 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2016-06-24 12:58:08,399 INFO [Thread-1835] org.apache.hadoop.ipc.Client: Retrying connect to server: dchadoop206/192.168.1.199:32951. Already tried 4 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2016-06-24 12:58:09,408 INFO [Thread-1835] org.apache.hadoop.ipc.Client: Retrying connect to server: dchadoop206/192.168.1.199:32951. Already tried 5 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2016-06-24 12:58:10,417 INFO [Thread-1835] org.apache.hadoop.ipc.Client: Retrying connect to server: dchadoop206/192.168.1.199:32951. Already tried 6 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2016-06-24 12:58:11,425 INFO [Thread-1835] org.apache.hadoop.ipc.Client: Retrying connect to server: dchadoop206/192.168.1.199:32951. Already tried 7 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2016-06-24 12:58:12,434 INFO [Thread-1835] org.apache.hadoop.ipc.Client: Retrying connect to server: dchadoop206/192.168.1.199:32951. Already tried 8 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)

1）dchadoop206上的nodemanager退出后（由于重啟），導(dǎo)致application master持續(xù)的去連接之前nodemanager上的container。顯然這些container是已經(jīng)連接不上了。

2）最終經(jīng)過非常長的時間大概3-4小時后，連接不上的異常才拋出，application master正常結(jié)束。

二、問題分析

這個問題主要涉及hadoop的rpc機制。首先看下面兩個配置參數(shù)

 #定義client連接到nodemanager的最大超時時間,不是單次連接，而是經(jīng)過多少時間連接不上nodemanager，則認(rèn)為操作失敗
 <property>
        <name>yarn.client.nodemanager-connect.max-wait-ms</name>
        <value>15*60*1000</value>
 </property>
 # 定義每次嘗試去連接nodemanager的時間間隔
 <property>
        <name>yarn.client.nodemanager-connect.retry-interval-ms</name>
        <value>10*1000</value>
 </property>

根據(jù)這兩個參數(shù)的定義，ApplicationMaster經(jīng)過15分鐘仍然連不上nodemanager的container，會取消try connect。但觀察的情況是Application Master 需要等大約30分鐘，才取消try connect。主要原因在于hadoop 的rpc機制如下，首先ApplicationMaster 會根據(jù)上面的兩個參數(shù)，構(gòu)造一個RetryUpToMaximumCountWithFixedSleep的重連策略，這個重連策略會通過以下方式計算

MaximumCount：yarn.client.nodemanager-connect.max-wait-ms/yarn.client.nodemanager-connect.retry-interval-ms=90次

而每次的RPC請求中，Client也有自己的重連策略，就是類似這樣的東東:

2016-06-24 09:32:36,455 INFO [ContainerLauncher #8] org.apache.hadoop.ipc.Client: Retrying connect to server: dchadoop206/192.168.1.199:32951. Already tried 7 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
由兩個rpc參數(shù)控制，ipc.client.connect.max.retries=10和ipc.client.connect.retry.interval=1000ms控制

所以最終ApplicationMaster 放棄try connect的等待時間是：90*(10+10)=1800s

三、解決辦法

1）在提交map-reduce/hive sql/hive server2等客戶端機器修改yarn-site.xml的以下參數(shù)
2）hadoop命令行中通過-D設(shè)置該參數(shù)

 <property>
        <name>yarn.client.nodemanager-connect.max-wait-ms</name>
        <value>180000</value>
 </property>

這樣總的等待時間就是6分鐘。

注意事項

這個修改是不需要做任何重啟yarn組件操作的，是一個客戶端相關(guān)的操作！

向AI問一下細節(jié)

application master 持續(xù)org.apache.hadoop.ipc.Client: Retrying connect to server

一、問題現(xiàn)象

三、解決辦法

注意事項

猜你喜歡

最新資訊

相關(guān)推薦

相關(guān)標(biāo)簽

三、解決辦法