yarn 怎么查看有多個job在跑_flink on yarn 模式下提示yarn資源不足問題分析

背景

在實時計算平臺上通過YarnClient向yarn上提交flink任務時一直卡在那里,并在client端一直輸出如下日志:

(YarnClusterDescriptor.java:1036)- Deployment took more than 60 seconds. Please check if the requested resources are available in the YARN cluster

看到這個的第一反應是yarn上的資源分配問題,于是來到yarn控制臺,發現Cluster Metrics中Apps Pending項為1。what?新提交的job為什么會處于pending狀態了?

1. 先確定cpu和內存情況如下:

d0735c0410bd152a46147b478eef7706.png

可以看出cpu和內存資源充足,沒有發現問題。

2. 查看調度器的使用情況

集群中使用的調度器的類型如下圖:

be57efb63b0b7551e81957b991c3a727.png

可以看到,集群中使用的是Capacity Scheduler調度器,也就是所謂的容量調度,這種方案更適合多租戶安全地共享大型集群,以便在分配的容量限制下及時分配資源。采用隊列的概念,任務提交到隊列,隊列可以設置資源的占比,并且支持層級隊列、訪問控制、用戶限制、預定等等配置。但是,對于資源的分配占比調優需要更多的經驗處理。但它不會出現在使用FIFO Scheduler時會出現的有大任務獨占資源,會導致其他任務一直處于 pending 狀態的問題。

3. 查看任務隊列的情況

e4e1ffaabf59e9fa9d0adb6fcc6663c4.png

從上圖中可以看出Configured Minimum User Limit Percent的配置為100%,由于集群目前相對較小,用戶隊列沒有做租戶劃分,用的都是default隊列,從圖中可以看出使用的容量也只有38.2%,隊列中最多可存放10000個application,而實際的遠遠少于10000,貎似這里也看不出來什么問題。

4. 查看resourceManager的日志

日志內容如下:

2020-11-26 19:33:46,669 INFO org.apache.hadoop.yarn.server.resourcemanager.ClientRMService: Allocated new applicationId: 3172020-11-26 19:33:48,874 INFO org.apache.hadoop.yarn.server.resourcemanager.ClientRMService: Application with id 317 submitted by user root2020-11-26 19:33:48,874 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Storing application with id application_1593338489799_03172020-11-26?19:33:48,874?INFO?org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger:?USER=root????IP=x.x.x.x????OPERATION=Submit?Application?Request????TARGET=ClientRMService????RESULT=SUCCESS????APPID=application_1593338489799_03172020-11-26 19:33:48,874 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1593338489799_0317 State change from NEW to NEW_SAVING on event=START2020-11-26 19:33:48,875 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Storing info for app: application_1593338489799_03172020-11-26 19:33:48,875 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1593338489799_0317 State change from NEW_SAVING to SUBMITTED on event=APP_NEW_SAVED2020-11-26 19:33:48,875 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: Application added - appId: application_1593338489799_0317 user: root leaf-queue of parent: root #applications: 162020-11-26 19:33:48,875 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Accepted application application_1593338489799_0317 from user: root, in queue: default2020-11-26 19:33:48,875 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1593338489799_0317 State change from SUBMITTED to ACCEPTED on event=APP_ACCEPTED2020-11-26 19:33:48,875 INFO org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: Registering app attempt : appattempt_1593338489799_0317_0000012020-11-26 19:33:48,875 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1593338489799_0317_000001 State change from NEW to SUBMITTED2020-11-26 19:33:48,875 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: not starting application as amIfStarted exceeds amLimit2020-11-26 19:33:48,877 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: Application added - appId: application_1593338489799_0317 user: org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue$User@6c0d5b4d, leaf-queue: default #user-pending-applications: 1 #user-active-applications: 15 #queue-pending-applications: 1 #queue-active-applications: 15

從日志中可以看到一個Application在yarn上進行資源分配的完整流程,只是這個任務因為一些原因進入了pending隊列而已,與我們要查找的問題相關的日志主要是如下幾行:

2020-11-26 19:33:48,875 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: not starting application as amIfStarted exceeds amLimit2020-11-26 19:33:48,877 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: Application added - appId: application_1593338489799_0317 user: org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue$User@6c0d5b4d, leaf-queue: default #user-pending-applications: 1 #user-active-applications: 15 #queue-pending-applications: 1 #queue-active-applications: 15

沒錯,問題就出來not starting application as amIfStarted exceeds amLimit,那么這個是什么原因引起的呢,我們看下stackoverflow[1]上的解釋:

ba4409ea3f232384e0da6acbcd022b7c.png

那么yarn.scheduler.capacity.maximum-am-resource-percent參數的真正含義是什么呢?國語意思就是集群中可用于運行application master的資源比例上限,這通常用于限制并發運行的應用程序數目,它的默認值為0.1。

查看了下集群上目前的任務總數有15個左右,每個任務分配有一個約1G左右的jobmanager(jobmanager為Application master類型的application),占15G左右,而集群上的總內存為144G,那么15>144 * 0.1 ,從而導致jobmanager的創建處于pending狀態。

5. 解決驗證

修改capacity-scheduler.xml的yarn.scheduler.capacity.maximum-am-resource-percent配置為如下:

    yarn.scheduler.capacity.maximum-am-resource-percent    0.5  

除了動態減少隊列數目外,capacity-scheduler.xml的其他配置的修改是可以動態更新的,更新命令為:

yarn rmadmin -refreshQueues

執行命令后,在resourceManager的日志中可以看到如下輸出:

2020-11-27 09:37:56,340 INFO org.apache.hadoop.conf.Configuration: found resource capacity-scheduler.xml at file:/work/hadoop-2.7.4/etc/hadoop/capacity-scheduler.xml2020-11-27 09:37:56,356 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Re-initializing queues...---------------------------------------------------------------------------2020-11-27 09:37:56,371 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Initialized queue mappings, override: false2020-11-27 09:37:56,372 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=root    IP=x.x.x.x    OPERATION=refreshQueues    TARGET=AdminServicRESULT=SUCCESS

仔細查看日志可以看到更新已經成功,在平臺上重新發布任務顯示成功,問題解決。

yarn Queue的配置

Resource Allocation

PropertyDescription
yarn.scheduler.capacity..capacityQueue capacity in percentage (%) as a float (e.g. 12.5) OR as absolute resource queue minimum capacity. The sum of capacities for all queues, at each level, must be equal to 100. However if absolute resource is configured, sum of absolute resources of child queues could be less than it’s parent absolute resource capacity. Applications in the queue may consume more resources than the queue’s capacity if there are free resources, providing elasticity.
yarn.scheduler.capacity..maximum-capacityMaximum queue capacity in percentage (%) as a float OR as absolute resource queue maximum capacity. This limits the elasticity for applications in the queue. 1) Value is between 0 and 100. 2) Admin needs to make sure absolute maximum capacity >= absolute capacity for each queue. Also, setting this value to -1 sets maximum capacity to 100%.
yarn.scheduler.capacity..minimum-user-limit-percentEach queue enforces a limit on the percentage of resources allocated to a user at any given time, if there is demand for resources. The user limit can vary between a minimum and maximum value. The former (the minimum value) is set to this property value and the latter (the maximum value) depends on the number of users who have submitted applications. For e.g., suppose the value of this property is 25. If two users have submitted applications to a queue, no single user can use more than 50% of the queue resources. If a third user submits an application, no single user can use more than 33% of the queue resources. With 4 or more users, no user can use more than 25% of the queues resources. A value of 100 implies no user limits are imposed. The default is 100. Value is specified as a integer.
yarn.scheduler.capacity..user-limit-factorThe multiple of the queue capacity which can be configured to allow a single user to acquire more resources. By default this is set to 1 which ensures that a single user can never take more than the queue’s configured capacity irrespective of how idle the cluster is. Value is specified as a float.
yarn.scheduler.capacity..maximum-allocation-mbThe per queue maximum limit of memory to allocate to each container request at the Resource Manager. This setting overrides the cluster configuration?yarn.scheduler.maximum-allocation-mb. This value must be smaller than or equal to the cluster maximum.
yarn.scheduler.capacity..maximum-allocation-vcoresThe per queue maximum limit of virtual cores to allocate to each container request at the Resource Manager. This setting overrides the cluster configuration?yarn.scheduler.maximum-allocation-vcores. This value must be smaller than or equal to the cluster maximum.
yarn.scheduler.capacity..user-settings..weightThis floating point value is used when calculating the user limit resource values for users in a queue. This value will weight each user more or less than the other users in the queue. For example, if user A should receive 50% more resources in a queue than users B and C, this property will be set to 1.5 for user A. Users B and C will default to 1.0.

Resource Allocation using Absolute Resources configuration

CapacityScheduler?supports configuration of absolute resources instead of providing Queue?capacity?in percentage. As mentioned in above configuration section for?yarn.scheduler.capacity..capacity?and?yarn.scheduler.capacity..max-capacity, administrator could specify an absolute resource value like?[memory=10240,vcores=12]. This is a valid configuration which indicates 10GB Memory and 12 VCores.

Running and Pending Application Limits

The?CapacityScheduler?supports the following parameters to control the running and pending applications:

PropertyDescription
yarn.scheduler.capacity.maximum-applications?/?yarn.scheduler.capacity..maximum-applicationsMaximum number of applications in the system which can be concurrently active both running and pending. Limits on each queue are directly proportional to their queue capacities and user limits. This is a hard limit and any applications submitted when this limit is reached will be rejected. Default is 10000. This can be set for all queues with?yarn.scheduler.capacity.maximum-applications?and can also be overridden on a per queue basis by setting?yarn.scheduler.capacity..maximum-applications. Integer value expected.
yarn.scheduler.capacity.maximum-am-resource-percent?/?yarn.scheduler.capacity..maximum-am-resource-percentMaximum percent of resources in the cluster which can be used to run application masters - controls number of concurrent active applications. Limits on each queue are directly proportional to their queue capacities and user limits. Specified as a float - ie 0.5 = 50%. Default is 10%. This can be set for all queues with?yarn.scheduler.capacity.maximum-am-resource-percent?and can also be overridden on a per queue basis by setting?yarn.scheduler.capacity..maximum-am-resource-percent

更多配置

參考:https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html

如果想要不重啟集群來動態刷新hadoop配置可嘗試如下方法:

1、刷新hdfs配置

在兩個(以三節點的集群為例)namenode節點上執行:

hdfs dfsadmin -fs hdfs://node1:9000 -refreshSuperUserGroupsConfigurationhdfs dfsadmin -fs hdfs://node2:9000 -refreshSuperUserGroupsConfiguration

2、刷新yarn配置

在兩個(以三節點的集群為例)namenode節點上執行:

yarn rmadmin -fs hdfs://node1:9000 -refreshSuperUserGroupsConfigurationyarn rmadmin -fs hdfs://node2:9000 -refreshSuperUserGroupsConfiguration

參考

?https://stackoverflow.com/questions/33465300/why-does-yarn-job-not-transition-to-running-state?https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html?https://stackoverflow.com/questions/29917540/capacity-scheduler?https://cloud.tencent.com/developer/article/1357111?https://cloud.tencent.com/developer/article/1194501

References

[1]?stackoverflow:?https://stackoverflow.com/questions/33465300/why-does-yarn-job-not-transition-to-running-state

本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。
如若轉載,請注明出處:http://www.pswp.cn/news/455325.shtml
繁體地址,請注明出處:http://hk.pswp.cn/news/455325.shtml
英文地址,請注明出處:http://en.pswp.cn/news/455325.shtml

如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!

相關文章

MPEG-2TS碼流編輯的原理及其應用(轉載

[作者:遼寧電視臺 趙季偉] 在當今數字媒體不斷發展、新媒體業務不斷涌現 的前提下,實踐證明襁褓中的新媒體只有兩種經營方略可供選擇:或是購買并集成整套節目,或是低成本深加工新節目,再不可能去按照傳統生產模式…

Python中的yield詳解

閱讀別人的python源碼時碰到了這個yield這個關鍵字,各種搜索終于搞懂了,在此做一下總結: 通常的for…in…循環中,in后面是一個數組,這個數組就是一個可迭代對象,類似的還有鏈表,字符串&#xf…

shell循環結構之while循環

while循環 1) while CONDITION; dostatementstatement<改變循環條件真假的語句>done 編寫腳本&#xff0c;計算1---100的和 #!/bin/bash#sum0i1while [ $i -le 100 ]; dolet sum$sum$ilet i$i1doneecho $sum2) while true; do statementstatementdone #!/bin/bash#while …

python 管道隊列_關于python:Multiprocessing-管道與隊列

Python的多處理程序包中的隊列和管道之間的根本區別是什么&#xff1f;在什么情況下應該選擇一種&#xff1f; 什么時候使用Pipe()有優勢&#xff1f; 什么時候使用Queue()有優勢&#xff1f;Pipe()只能有兩個端點。Queue()可以有多個生產者和消費者。何時使用它們如果需要兩個…

pip默認使用國內鏡像地址

很多小伙伴在ubuntu系統下,使用pip安裝會很慢 以為安裝源在國外服務器上面 今天小編就教大家配置成讓pip默認從國內源中尋找安裝包 首先CtrlAltT打開終端 進入家目錄 cd ~在家目錄中創建一個文件夾,命名為.pip mkdir .pip進入目錄,并創建一個名為pip.conf的文件 cd .pip…

“大型票務系統”和“實物電商系統”的數據庫選型

討論請移步至&#xff1a;http://www.zhiliaotech.com/ideajam/idea/detail/423 相關文章&#xff1a; 《今天你買到票了嗎&#xff1f;——從鐵道部12306.cn站點漫談電子商務站點的“海量事務快速處理”系統》 不能簡單套用“實物電商系統”對“大型票務系統”做需求分析 “大…

FLV文件格式(Z)(轉載)

剛才在看一些關于demux的東西&#xff0c;在處理flv格式的文件的時候&#xff0c;由于自己對flv文件的格式不了解&#xff0c;所以就比較云頭轉向&#xff0c;正好看到了一篇講述flv文件格式的文章&#xff0c;寫的比較明白&#xff0c;所以就轉過來了。O(∩_∩)O~flv頭文件比較…

mysql-5.7中的innodb_buffer_pool_prefetching(read-ahead)詳解

一、innodb的read-ahead是什么&#xff1a; 所謂的read-ahead就是innodb根據你現在訪問的數據&#xff0c;推測出你接下來可能要訪問的數據&#xff0c;并把它們(可能要訪問的數據)讀入 內存。 二、read-ahead是怎么做到的&#xff1a; 1、總的來說read-ahead利用的是程序的局部…

python compare excel_python簡單操作excle的方法

Python操作Excle文件&#xff1a;使用xlwt庫將數據寫入Excel表格&#xff0c;使用xlrd 庫從Excel讀取數據。從excle讀取數據存入數據庫1、導入模塊&#xff1a;import xlrd2、打開excle文件&#xff1a;data xlrd.open_workbook(excel.xls)3、獲取表、行/列值、行/列數、單元值…

collections系列

class Counter(dict):  Counter類繼承dict類、繼承了dict的所有功能計數器&#xff1a; 例&#xff1a;import collections obj collections.Counter(sdkasdioasdjoasjdoasd) print(obj)得&#xff1a;Counter({s: 5, d: 5, a: 4, o: 3, j: 2, k: 1, i: 1}) 拿到前幾位&…

Python中的虛擬環境-virtualenv

更低層次: virtualenv virtualenv 是一個創建隔絕的Python環境的 工具。virtualenv創建一個包含所有必要的可執行文件的文件夾&#xff0c;用來使用Python工程所需的包。 它可以獨立使用&#xff0c;代替Pipenv。 通過pip安裝virtualenv&#xff1a; $ pip install virtual…

mp4文件格式解析(一)

原文地址&#xff1a;mp4文件格式解析&#xff08;一&#xff09;作者&#xff1a;可下人間目前MP4的概念被炒得很火&#xff0c;也很亂。最開始MP4指的是音頻&#xff08;MP3的升級版&#xff09;&#xff0c;即MPEG-2 AAC標準。隨后MP4概念被轉移到視頻上&#xff0c;對應的是…

shiro身份驗證測試

2019獨角獸企業重金招聘Python工程師標準>>> 一、登錄驗證 1、首先在shiro.ini里準備一些用戶身份/憑據&#xff0c;后面這里會使用數據庫代替&#xff0c;如&#xff1a; [users] [main] #realm jdbcRealmcom.learnging.system.shiro.ShiroRealm securityManager…

shell if多個條件判斷_萌新關于Excel VBA中IF條件判斷語句的一點心得體會

作者:金人瑞 《Excel VBA175例無理論純實戰教程》學員最近正在學習鄭廣學老師的VBA 175例教程&#xff0c;這是一篇新手向的文章&#xff0c;也是一個新手的總結&#xff0c;高手可以批評文章中的不足之處&#xff0c;也可以無視&#xff0c;VBA中的IF判斷, 判斷一般起到控制作…

Django筆記01-基礎:一個完美主義的web框架

淺談Web框架 一,什么是框架? 軟件框架就是為實現或完成某種軟件開發時,提供了一些基礎的軟件產品, 框架的功能類似于基礎設施,提供并實現最為基礎的軟件架構和體系 通常情況下我們依據框架來實現更為復雜的業務程序開發 一個字,框架就是程序的骨架 二,框架的優缺點 可重…

mysql存儲引擎的一點學習心得總結

首先我們應該了解mysql中的一個重要特性——插件式存儲引擎&#xff0c;從名字就能夠看出在mysql中&#xff0c;用戶能夠依據自己的需求隨意的選擇存儲引擎。實際上也是這樣。即使在同一個數據庫中。不同的表也能夠使用不同的存儲引擎。Mysql中支持的存儲引擎有非常多種&#x…

常見音視頻格式(轉載)

Contents 1 MPEG 系列 1.1 MPEG-1 1.2 MPEG-2 1.3 MPEG-4 1.4 MPEG-4 AVC 1.5 MPEG Audio Layer 1/2 1.6 MPEG Audio Layer 3 1.7 MPEG-2 AAC 1.8 MPEG-4 AAC 1.9 MPEG-4 aacPlus 1.10 MPEG-4 VQF 1.11 mp3PRO 1.12 MP3 Surround 2 DVD系列 2.1 Dolby Digital AC3 2.2 Dolby D…

編程語言難度排名_谷歌排名第一的編程語言,小學生拿來做答題,分分鐘鐘搞定高難度算法!...

點擊上方藍色文字關注我們吧谷歌排名第一的編程語言時什么&#xff1f;毫無疑問&#xff1a;肯定是 Python。 也難怪&#xff0c;作為大數據時代和人工智能時代的必備語言&#xff0c;Python 的優點太多了&#xff0c;語言簡潔、易學、開發效率高、可移植性強...... 另外&#…

poj 2484 A Funny Game

題目&#xff1a;http://poj.org/problem?id2484 一&#xff0c;題意&#xff1a; n個硬幣圍成一個圈&#xff0c;Alice與Bob輪流從圈中取硬幣。每次能夠取一枚或者連續的兩枚。 硬幣取走后留下的空位不用填補&#xff0c;空位相隔的兩個硬幣視為不相鄰。Alice第一個開始取。 …

58到家MySQL軍規升級版

一、基礎規范 表存儲引擎必須使用InnoDB 表字符集默認使用utf8&#xff0c;必要時候使用utf8mb4 解讀&#xff1a; &#xff08;1&#xff09;通用&#xff0c;無亂碼風險&#xff0c;漢字3字節&#xff0c;英文1字節 &#xff08;2&#xff09;utf8mb4是utf8的超集&#…