您好,登錄后才能下訂單哦!
這篇文章主要介紹“Kubernetes Scheduler的NominatedPods是什么”,在日常操作中,相信很多人在Kubernetes Scheduler的NominatedPods是什么問題上存在疑惑,小編查閱了各式資料,整理出簡單好用的操作方法,希望對大家解答”Kubernetes Scheduler的NominatedPods是什么”的疑惑有所幫助!接下來,請跟著小編一起來學(xué)習(xí)吧!
當enable PodPriority feature gate后,scheduler會在集群資源資源不足時為preemptor搶占低優(yōu)先級的Pods(成為victims)的資源,然后preemptor會再次入調(diào)度隊列,等待下次victims的優(yōu)雅終止并進行下一次調(diào)度。
為了盡量避免從preemptor搶占資源到真正再次執(zhí)行調(diào)度這個時間段的scheduler能感知到那些資源已經(jīng)被搶占,在scheduler調(diào)度其他更低優(yōu)先級的Pods時考慮這些資源已經(jīng)被搶占,因此在搶占階段,為給preemptor設(shè)置pod.Status.NominatedNodeName
,表示在NominatedNodeName上發(fā)生了搶占,preemptor期望調(diào)度在該node上。
PriorityQueue中緩存了每個node上的NominatedPods,這些NominatedPods表示已經(jīng)被該node提名的,期望調(diào)度在該node上的,但是又還沒最終成功調(diào)度過來的Pods。
我們來重點關(guān)注下scheduler進行preempt時相關(guān)的流程。
func (sched *Scheduler) preempt(preemptor *v1.Pod, scheduleErr error) (string, error) { ... node, victims, nominatedPodsToClear, err := sched.config.Algorithm.Preempt(preemptor, sched.config.NodeLister, scheduleErr) ... var nodeName = "" if node != nil { nodeName = node.Name err = sched.config.PodPreemptor.SetNominatedNodeName(preemptor, nodeName) if err != nil { glog.Errorf("Error in preemption process. Cannot update pod %v annotations: %v", preemptor.Name, err) return "", err } ... } // Clearing nominated pods should happen outside of "if node != nil". Node could // be nil when a pod with nominated node name is eligible to preempt again, // but preemption logic does not find any node for it. In that case Preempt() // function of generic_scheduler.go returns the pod itself for removal of the annotation. for _, p := range nominatedPodsToClear { rErr := sched.config.PodPreemptor.RemoveNominatedNodeName(p) if rErr != nil { glog.Errorf("Cannot remove nominated node annotation of pod: %v", rErr) // We do not return as this error is not critical. } } return nodeName, err }
invoke ScheduleAlgorithm.Preempt進行資源搶占,返回搶占發(fā)生的node,victims,nominatedPodsToClear。
func (g *genericScheduler) Preempt(pod *v1.Pod, nodeLister algorithm.NodeLister, scheduleErr error) (*v1.Node, []*v1.Pod, []*v1.Pod, error) { ... candidateNode := pickOneNodeForPreemption(nodeToVictims) if candidateNode == nil { return nil, nil, nil, err } nominatedPods := g.getLowerPriorityNominatedPods(pod, candidateNode.Name) if nodeInfo, ok := g.cachedNodeInfoMap[candidateNode.Name]; ok { return nodeInfo.Node(), nodeToVictims[candidateNode].Pods, nominatedPods, err } return nil, nil, nil, fmt.Errorf( "preemption failed: the target node %s has been deleted from scheduler cache", candidateNode.Name) } func (g *genericScheduler) getLowerPriorityNominatedPods(pod *v1.Pod, nodeName string) []*v1.Pod { pods := g.schedulingQueue.WaitingPodsForNode(nodeName) if len(pods) == 0 { return nil } var lowerPriorityPods []*v1.Pod podPriority := util.GetPodPriority(pod) for _, p := range pods { if util.GetPodPriority(p) < podPriority { lowerPriorityPods = append(lowerPriorityPods, p) } } return lowerPriorityPods }
node:搶占發(fā)生的最佳node;
victims:待刪除的pods,以釋放資源給preemptor;
nominatedPodsToClear:那些將要被刪除.Status.NominatedNodeName
的Pods列表,這些Pods是首先是屬于PriorityQueue中的nominatedPods Cache中的Pods,并且他們的Pod Priority要低于preemptor Pod Priority,意味著這些nominatedPods已經(jīng)不再適合調(diào)度到之前搶占時選擇的這個node上了。
如果搶占成功(node非空),則調(diào)用podPreemptor.SetNominatedNodeName
設(shè)置preemptor的.Status.NominatedNodeName
為該node name,表示該preemptor期望搶占在該node上。
func (p *podPreemptor) SetNominatedNodeName(pod *v1.Pod, nominatedNodeName string) error { podCopy := pod.DeepCopy() podCopy.Status.NominatedNodeName = nominatedNodeName _, err := p.Client.CoreV1().Pods(pod.Namespace).UpdateStatus(podCopy) return err }
無論搶占是否成功(node是否為空),nominatedPodsToClear都可能不為空,都需要遍歷nominatedPodsToClear內(nèi)的所有Pods,調(diào)用podPreemptor.RemoveNominatedNodeName
將其.Status.NominatedNodeName
設(shè)置為空。
func (p *podPreemptor) RemoveNominatedNodeName(pod *v1.Pod) error { if len(pod.Status.NominatedNodeName) == 0 { return nil } return p.SetNominatedNodeName(pod, "") }
Premmptor搶占成功后,該Pod會被再次加入到PriorityQueue中的Unschedulable Sub-Queue隊列中,等待條件再次出發(fā)調(diào)度。關(guān)于這部分內(nèi)容更深入的解讀,請參考我的博客深入分析Kubernetes Scheduler的優(yōu)先級隊列。preemptor再次會通過podFitsOnNode對node進行predicate邏輯處理。
func podFitsOnNode( pod *v1.Pod, meta algorithm.PredicateMetadata, info *schedulercache.NodeInfo, predicateFuncs map[string]algorithm.FitPredicate, ecache *EquivalenceCache, queue SchedulingQueue, alwaysCheckAllPredicates bool, equivCacheInfo *equivalenceClassInfo, ) (bool, []algorithm.PredicateFailureReason, error) { var ( eCacheAvailable bool failedPredicates []algorithm.PredicateFailureReason ) predicateResults := make(map[string]HostPredicate) podsAdded := false for i := 0; i < 2; i++ { metaToUse := meta nodeInfoToUse := info if i == 0 { podsAdded, metaToUse, nodeInfoToUse = addNominatedPods(util.GetPodPriority(pod), meta, info, queue) } else if !podsAdded || len(failedPredicates) != 0 { // 有問題吧?應(yīng)該是podsAdded,而不是!podsAdded break } // Bypass eCache if node has any nominated pods. // TODO(bsalamat): consider using eCache and adding proper eCache invalidations // when pods are nominated or their nominations change. eCacheAvailable = equivCacheInfo != nil && !podsAdded for _, predicateKey := range predicates.Ordering() { var ( fit bool reasons []algorithm.PredicateFailureReason err error ) func() { var invalid bool if eCacheAvailable { ... } if !eCacheAvailable || invalid { // we need to execute predicate functions since equivalence cache does not work fit, reasons, err = predicate(pod, metaToUse, nodeInfoToUse) if err != nil { return } ... } }() ... } } } return len(failedPredicates) == 0, failedPredicates, nil }
一共會嘗試進行兩次predicate:
第一次predicate時,調(diào)用addNominatedPods
,遍歷PriorityQueue nominatedPods中所有Pods,將那些PodPriority大于等于該調(diào)度Pod的優(yōu)先級的所有nominatedPods添加到SchedulerCache的NodeInfo中,意味著調(diào)度該pod時要考慮這些高優(yōu)先級nominatedPods進行預(yù)選,比如要減去它們的resourceRequest等,并更新到PredicateMetadata中,接著執(zhí)行正常的predicate邏輯。
第二次predicate時,如果前面的predicate邏輯有失敗的情況,或者前面的podsAdded為false(如果在addNominatedPods
時,發(fā)現(xiàn)該node對應(yīng)nominatedPods cache是空的,那么返回值podAdded為false),那么第二次predicate立馬結(jié)束,并不會觸發(fā)真正的predicate邏輯。
第二次predicate時,如果前面的predicate邏輯都成功,并且podAdded為true的情況下,那么需要觸發(fā)真正的第二次predicate邏輯,因為nominatedPods的添加成功,可能會Inter-Pod Affinity會影響predicate結(jié)果。
下面是addNominatedPods的代碼,負責(zé)生成臨時的schedulercache.NodeInfo和algorithm.PredicateMetadata,提供給具體的predicate Function進行預(yù)選處理。
// addNominatedPods adds pods with equal or greater priority which are nominated // to run on the node given in nodeInfo to meta and nodeInfo. It returns 1) whether // any pod was found, 2) augmented meta data, 3) augmented nodeInfo. func addNominatedPods(podPriority int32, meta algorithm.PredicateMetadata, nodeInfo *schedulercache.NodeInfo, queue SchedulingQueue) (bool, algorithm.PredicateMetadata, *schedulercache.NodeInfo) { if queue == nil || nodeInfo == nil || nodeInfo.Node() == nil { // This may happen only in tests. return false, meta, nodeInfo } nominatedPods := queue.WaitingPodsForNode(nodeInfo.Node().Name) if nominatedPods == nil || len(nominatedPods) == 0 { return false, meta, nodeInfo } var metaOut algorithm.PredicateMetadata if meta != nil { metaOut = meta.ShallowCopy() } nodeInfoOut := nodeInfo.Clone() for _, p := range nominatedPods { if util.GetPodPriority(p) >= podPriority { nodeInfoOut.AddPod(p) if metaOut != nil { metaOut.AddPod(p, nodeInfoOut) } } } return true, metaOut, nodeInfoOut } // WaitingPodsForNode returns pods that are nominated to run on the given node, // but they are waiting for other pods to be removed from the node before they // can be actually scheduled. func (p *PriorityQueue) WaitingPodsForNode(nodeName string) []*v1.Pod { p.lock.RLock() defer p.lock.RUnlock() if list, ok := p.nominatedPods[nodeName]; ok { return list } return nil }
addNominatedPods的邏輯如下:
調(diào)用WaitingPodsForNode獲取PriorityQueue中的該node上的nominatedPods cache數(shù)據(jù),如果nominatedPods為空,則返回podAdded為false,addNominatedPods流程結(jié)束。
克隆出PredicateMeta和NodeInfo對象,遍歷nominatedPods,逐個將優(yōu)先級不低于待調(diào)度pod的nominated pod加到克隆出來的NodeInfo對象中,并更新到克隆出來的PredicateMeta對象中。這些克隆出來的NodeInfo和PredicateMeta對象,最終會傳入到predicate Functions中進行預(yù)選處理。遍歷完成后,返回podAdded(true)和NodeInfo和PredicateMeta對象。
深入分析Kubernetes Scheduler的優(yōu)先級隊列中分析了scheduler中podInformer、nodeInformer、serviceInformer、pvcInformer等注冊的EventHandler中對PriorityQueue的操作,其中跟NominatedPods相關(guān)的EventHandler如下。
當往PriorityQueue中active queue添加Pod后,會調(diào)用addNominatedPodIfNeeded相應(yīng)的將待添加的pod先從PriorityQueue nominatedPods Cache中刪除,刪除后再重新添加到nominatedPods cache中。
// Add adds a pod to the active queue. It should be called only when a new pod // is added so there is no chance the pod is already in either queue. func (p *PriorityQueue) Add(pod *v1.Pod) error { p.lock.Lock() defer p.lock.Unlock() err := p.activeQ.Add(pod) if err != nil { glog.Errorf("Error adding pod %v to the scheduling queue: %v", pod.Name, err) } else { if p.unschedulableQ.get(pod) != nil { glog.Errorf("Error: pod %v is already in the unschedulable queue.", pod.Name) p.deleteNominatedPodIfExists(pod) p.unschedulableQ.delete(pod) } p.addNominatedPodIfNeeded(pod) p.cond.Broadcast() } return err } func (p *PriorityQueue) addNominatedPodIfNeeded(pod *v1.Pod) { nnn := NominatedNodeName(pod) if len(nnn) > 0 { for _, np := range p.nominatedPods[nnn] { if np.UID == pod.UID { glog.Errorf("Pod %v/%v already exists in the nominated map!", pod.Namespace, pod.Name) return } } p.nominatedPods[nnn] = append(p.nominatedPods[nnn], pod) } }
當往PriorityQueue中unSchedulableQ queue添加Pod后,會調(diào)用addNominatedPodIfNeeded相應(yīng)的將待添加的pod添加/更新到PriorityQueue nominatedPods Cache中。
func (p *PriorityQueue) AddUnschedulableIfNotPresent(pod *v1.Pod) error { p.lock.Lock() defer p.lock.Unlock() if p.unschedulableQ.get(pod) != nil { return fmt.Errorf("pod is already present in unschedulableQ") } if _, exists, _ := p.activeQ.Get(pod); exists { return fmt.Errorf("pod is already present in the activeQ") } if !p.receivedMoveRequest && isPodUnschedulable(pod) { p.unschedulableQ.addOrUpdate(pod) p.addNominatedPodIfNeeded(pod) return nil } err := p.activeQ.Add(pod) if err == nil { p.addNominatedPodIfNeeded(pod) p.cond.Broadcast() } return err }
注意將pod添加到nominatedPods cache中的前提是該pod的
.Status.NominatedNodeName
不為空。
當更新PriorityQueue中Pod后,會接著調(diào)用updateNominatedPod更新PriorityQueue中nominatedPods Cache。
// Update updates a pod in the active queue if present. Otherwise, it removes // the item from the unschedulable queue and adds the updated one to the active // queue. func (p *PriorityQueue) Update(oldPod, newPod *v1.Pod) error { p.lock.Lock() defer p.lock.Unlock() // If the pod is already in the active queue, just update it there. if _, exists, _ := p.activeQ.Get(newPod); exists { p.updateNominatedPod(oldPod, newPod) err := p.activeQ.Update(newPod) return err } // If the pod is in the unschedulable queue, updating it may make it schedulable. if usPod := p.unschedulableQ.get(newPod); usPod != nil { p.updateNominatedPod(oldPod, newPod) if isPodUpdated(oldPod, newPod) { p.unschedulableQ.delete(usPod) err := p.activeQ.Add(newPod) if err == nil { p.cond.Broadcast() } return err } p.unschedulableQ.addOrUpdate(newPod) return nil } // If pod is not in any of the two queue, we put it in the active queue. err := p.activeQ.Add(newPod) if err == nil { p.addNominatedPodIfNeeded(newPod) p.cond.Broadcast() } return err }
updateNominatedPod更新PriorityQueue nominatedPods Cache的邏輯是:先刪除oldPod,再添加newPod進去。
// updateNominatedPod updates a pod in the nominatedPods. func (p *PriorityQueue) updateNominatedPod(oldPod, newPod *v1.Pod) { // Even if the nominated node name of the Pod is not changed, we must delete and add it again // to ensure that its pointer is updated. p.deleteNominatedPodIfExists(oldPod) p.addNominatedPodIfNeeded(newPod) }
當從PriorityQueue中刪除Pod前,會先調(diào)用deleteNominatedPodIfExists從PriorityQueue nominatedPods cache中刪除該pod。
// Delete deletes the item from either of the two queues. It assumes the pod is // only in one queue. func (p *PriorityQueue) Delete(pod *v1.Pod) error { p.lock.Lock() defer p.lock.Unlock() p.deleteNominatedPodIfExists(pod) err := p.activeQ.Delete(pod) if err != nil { // The item was probably not found in the activeQ. p.unschedulableQ.delete(pod) } return nil }
deleteNominatedPodIfExists時,先檢查該pod的.Status.NominatedNodeName
是否為空:
如果為空,則不做任何操作,直接return結(jié)束流程。
如果不為空,則遍歷nominatedPods cache,一旦找到UID匹配的pod,就說明nominatedPods中存在該pod,然后就從cache中刪除該pod。如果刪除后,發(fā)現(xiàn)該pod對應(yīng)的NominatedNode上沒有nominatePods了,則把整個node的nominatedPods從map cache中刪除。
func (p *PriorityQueue) deleteNominatedPodIfExists(pod *v1.Pod) { nnn := NominatedNodeName(pod) if len(nnn) > 0 { for i, np := range p.nominatedPods[nnn] { if np.UID == pod.UID { p.nominatedPods[nnn] = append(p.nominatedPods[nnn][:i], p.nominatedPods[nnn][i+1:]...) if len(p.nominatedPods[nnn]) == 0 { delete(p.nominatedPods, nnn) } break } } } }
到此,關(guān)于“Kubernetes Scheduler的NominatedPods是什么”的學(xué)習(xí)就結(jié)束了,希望能夠解決大家的疑惑。理論與實踐的搭配能更好的幫助大家學(xué)習(xí),快去試試吧!若想繼續(xù)學(xué)習(xí)更多相關(guān)知識,請繼續(xù)關(guān)注億速云網(wǎng)站,小編會繼續(xù)努力為大家?guī)砀鄬嵱玫奈恼拢?/p>
免責(zé)聲明:本站發(fā)布的內(nèi)容(圖片、視頻和文字)以原創(chuàng)、轉(zhuǎn)載和分享為主,文章觀點不代表本網(wǎng)站立場,如果涉及侵權(quán)請聯(lián)系站長郵箱:is@yisu.com進行舉報,并提供相關(guān)證據(jù),一經(jīng)查實,將立刻刪除涉嫌侵權(quán)內(nèi)容。