溫馨提示×

溫馨提示×

您好，登錄后才能下訂單哦！

密碼登錄×

忘記密碼？

登錄注冊(cè)×

獲取短信驗(yàn)證碼

其他方式登錄

點(diǎn)擊登錄注冊(cè) 即表示同意《億速云用戶服務(wù)條款》

用戶登錄×

賬戶密碼登錄

請(qǐng)使用微信掃描上方二維碼

使用幫助

請(qǐng)求超時(shí)！

請(qǐng)點(diǎn)擊重新獲取二維碼

docker中如何理解cgroups

發(fā)布時(shí)間：2021-10-19 18:32:23 來(lái)源：億速云閱讀：166 作者：柒染欄目：大數(shù)據(jù)

這篇文章將為大家詳細(xì)講解有關(guān)docker中如何理解cgroups，文章內(nèi)容質(zhì)量較高，因此小編分享給大家做個(gè)參考，希望大家閱讀完這篇文章后對(duì)相關(guān)知識(shí)有一定的了解。

理解docker，主要從namesapce，cgroups，聯(lián)合文件，運(yùn)行時(shí)(runC)，網(wǎng)絡(luò)幾個(gè)方面。接下來(lái)我們會(huì)花一些時(shí)間，分別介紹。

namesapce主要是隔離作用，cgroups主要是資源限制，聯(lián)合文件主要用于鏡像分層存儲(chǔ)和管理，runC是運(yùn)行時(shí)，遵循了oci接口，一般來(lái)說(shuō)基于libcontainer。網(wǎng)絡(luò)主要是docker單機(jī)網(wǎng)絡(luò)和多主機(jī)通信模式。

cgroups簡(jiǎn)介

cgroups是什么？

Cgroup是control group的簡(jiǎn)寫(xiě)，屬于Linux內(nèi)核提供的一個(gè)特性，用于限制和隔離一組進(jìn)程對(duì)系統(tǒng)資源的使用，也就是做資源QoS，這些資源主要包括CPU、內(nèi)存、block I/O和網(wǎng)絡(luò)帶寬。Cgroup從2.6.24開(kāi)始進(jìn)入內(nèi)核主線，目前各大發(fā)行版都默認(rèn)打開(kāi)了Cgroup特性。
Cgroups提供了以下四大功能:

資源限制（Resource Limitation）：cgroups可以對(duì)進(jìn)程組使用的資源總額進(jìn)行限制。如設(shè)定應(yīng)用運(yùn)行時(shí)使用內(nèi)存的上限，一旦超過(guò)這個(gè)配額就發(fā)出OOM（Out of Memory）。
優(yōu)先級(jí)分配（Prioritization）：通過(guò)分配的CPU時(shí)間片數(shù)量及硬盤(pán)IO帶寬大小，實(shí)際上就相當(dāng)于控制了進(jìn)程運(yùn)行的優(yōu)先級(jí)。
資源統(tǒng)計(jì)（Accounting）： cgroups可以統(tǒng)計(jì)系統(tǒng)的資源使用量，如CPU使用時(shí)長(zhǎng)、內(nèi)存用量等等，這個(gè)功能非常適用于計(jì)費(fèi)。
進(jìn)程控制（Control）：cgroups可以對(duì)進(jìn)程組執(zhí)行掛起、恢復(fù)等操作。

Cgroups中的三個(gè)組件

cgroup 控制組。cgroup 是對(duì)進(jìn)程分組管理的一種機(jī)制，一個(gè)cgroup包含一組進(jìn)程，并可以在這個(gè)cgroup上增加Linux subsystem的各種參數(shù)的配置，將一組進(jìn)程和一組subsystem的系統(tǒng)參數(shù)關(guān)聯(lián)起來(lái)。
subsystem 子系統(tǒng)。subsystem 是一組資源控制的模塊。這塊在下面會(huì)詳細(xì)介紹。
hierarchy 層級(jí)樹(shù)。hierarchy 的功能是把一組cgroup串成一個(gè)樹(shù)狀的結(jié)構(gòu)，一個(gè)這樣的樹(shù)便是一個(gè)hierarchy，通過(guò)這種樹(shù)狀的結(jié)構(gòu)，Cgroups可以做到繼承。比如我的系統(tǒng)對(duì)一組定時(shí)的任務(wù)進(jìn)程通過(guò)cgroup1限制了CPU的使用率，然后其中有一個(gè)定時(shí)dump日志的進(jìn)程還需要限制磁盤(pán)IO，為了避免限制了影響到其他進(jìn)程，就可以創(chuàng)建cgroup2繼承于cgroup1并限制磁盤(pán)的IO，這樣cgroup2便繼承了cgroup1中的CPU的限制，并且又增加了磁盤(pán)IO的限制而不影響到cgroup1中的其他進(jìn)程。

cgroups子系統(tǒng)

docker中如何理解cgroups
cgroup中實(shí)現(xiàn)的子系統(tǒng)及其作用如下：

devices：設(shè)備權(quán)限控制。
cpuset：分配指定的CPU和內(nèi)存節(jié)點(diǎn)。
cpu：控制CPU占用率。
cpuacct：統(tǒng)計(jì)CPU使用情況。
memory：限制內(nèi)存的使用上限。
freezer：凍結(jié)（暫停）Cgroup中的進(jìn)程。
net_cls：配合tc（traffic controller）限制網(wǎng)絡(luò)帶寬。
net_prio：設(shè)置進(jìn)程的網(wǎng)絡(luò)流量?jī)?yōu)先級(jí)。
huge_tlb：限制HugeTLB的使用。
perf_event：允許Perf工具基于Cgroup分組做性能監(jiān)測(cè)。

每個(gè)子系統(tǒng)的目錄下有更詳細(xì)的設(shè)置項(xiàng)，例如：
cpu
docker中如何理解cgroups
除了限制 CPU 的使用量，cgroup 還能把任務(wù)綁定到特定的 CPU，讓它們只運(yùn)行在這些 CPU 上，這就是 cpuset 子資源的功能。除了 CPU 之外，還能綁定內(nèi)存節(jié)點(diǎn)（memory node）。
在把任務(wù)加入到 cpuset 的 task 文件之前，用戶必須設(shè)置 cpuset.cpus 和 cpuset.mems 參數(shù)。

cpuset.cpus：設(shè)置 cgroup 中任務(wù)能使用的 CPU，格式為逗號(hào)（,）隔開(kāi)的列表，減號(hào)（-）可以表示范圍。比如，0-2,7 表示 CPU 第 0，1，2，和 7 核。
cpuset.mems：設(shè)置 cgroup 中任務(wù)能使用的內(nèi)存節(jié)點(diǎn)，和 cpuset.cpus 格式一樣。

memory：
圖片描述

memory.limit_bytes：強(qiáng)制限制最大內(nèi)存使用量，單位有k、m、g三種，填-1則代表無(wú)限制。
memory.soft_limit_bytes：軟限制，只有比強(qiáng)制限制設(shè)置的值小時(shí)才有意義。填寫(xiě)格式同上。當(dāng)整體內(nèi)存緊張的情況下，task獲取的內(nèi)存就被限制在軟限制額度之內(nèi)，以保證不會(huì)有太多進(jìn)程因內(nèi)存挨餓。可以看到，加入了內(nèi)存的資源限制并不代表沒(méi)有資源競(jìng)爭(zhēng)。
memory.memsw.limit_bytes：設(shè)定最大內(nèi)存與swap區(qū)內(nèi)存之和的用量限制。填寫(xiě)格式同上。

這里專(zhuān)門(mén)講一下監(jiān)控和統(tǒng)計(jì)相關(guān)的參數(shù)，比如cadvisor采集的那些參數(shù)。

memory.usage_bytes：報(bào)告該 cgroup中進(jìn)程使用的當(dāng)前總內(nèi)存用量（以字節(jié)為單位）。
memory.max_usage_bytes：報(bào)告該 cgroup 中進(jìn)程使用的最大內(nèi)存用量。

docker如何使用cgroup

創(chuàng)建一個(gè)容器

# Run a container that will spawn 300 processes.
docker run cirocosta/stress pid  -n 300
Starting to spawn 300 blocking children
[1] Waiting for SIGINT

# Open another window and see that we have 300
# PIDS
docker stats
CONTAINER      …   MEM USAGE / LIMIT          PIDS
a730051832     …   21.02MiB / 1.951GiB     300

驗(yàn)證Docker是否為此容器放置了一些cgroup

# let's get the ID of the container. Docker uses that ID
# to name things in the host to we can probably use it to
# find the cgroup created for the container
# under the parent docker cgroup
docker ps
CONTAINER ID        IMAGE               COMMAND       
a730051832e7        cirocosta/stress    "pid -n 300"  

 # Having the prefix in hands, let's search for it under the
 # mountpoint for cgroups in our system
 find  /sys/fs/cgroup/ -name "a730051832e7*"
 
/sys/fs/cgroup/cpu,cpuacct/docker/a730051832e7d776442b2e969e057660ad108a7d6e6a30569398ec660a75a959
/sys/fs/cgroup/cpuset/docker/a730051832e7d776442b2e969e057660ad108a7d6e6a30569398ec660a75a959
/sys/fs/cgroup/devices/docker/a730051832e7d776442b2e969e057660ad108a7d6e6a30569398ec660a75a959
/sys/fs/cgroup/pids/docker/a730051832e7d776442b2e969e057660ad108a7d6e6a30569398ec660a75a959
/sys/fs/cgroup/freezer/docker/a730051832e7d776442b2e969e057660ad108a7d6e6a30569398ec660a75a959
/sys/fs/cgroup/perf_event/docker/a730051832e7d776442b2e969e057660ad108a7d6e6a30569398ec660a75a959
/sys/fs/cgroup/blkio/docker/a730051832e7d776442b2e969e057660ad108a7d6e6a30569398ec660a75a959
/sys/fs/cgroup/memory/docker/a730051832e7d776442b2e969e057660ad108a7d6e6a30569398ec660a75a959
/sys/fs/cgroup/net_cls,net_prio/docker/a730051832e7d776442b2e969e057660ad108a7d6e6a30569398ec660a75a959
/sys/fs/cgroup/hugetlb/docker/a730051832e7d776442b2e969e057660ad108a7d6e6a30569398ec660a75a959
/sys/fs/cgroup/systemd/docker/a730051832e7d776442b2e969e057660ad108a7d6e6a30569398ec660a75a959

# There they are! Docker creates a control group with the name
# being the exact ID of the container under all the subsystems.

# What can we discover from this inspection? We can look at the
# subsystem that we want to place contrainst on (PIDs), for instance:

 tree /sys/fs/cgroup/pids/docker/a7300518327d...
/sys/fs/cgroup/pids/docker/a73005183...
├── cgroup.clone_children
├── cgroup.procs
├── notify_on_release
├── pids.current
├── pids.events
├── pids.max
└── tasks

# Which means that, if we want to know how many PIDs are in use right
# now we can look at 'pids.current', to know the limits, 'pids.max' and
# to know which processes have been assigned to this control group,
# look at tasks. Lets do it:
cat /sys/fs/cgroup/pids/docker/a730...c660a75a959/tasks 
5329
5371
5372
5373
5374
5375
5376
5377
(...)
# continues until the 300th entry - as we have 300 processes in this container

# 300 pids
cat /sys/fs/cgroup/pids/docker/a730051832e7d7764...9/pids.current
300

# no max set
cat /sys/fs/cgroup/pids/docker/a730051832e7d77.../pids.max 
max

PS

一般在安裝k8s的過(guò)程中經(jīng)常會(huì)遇到如下錯(cuò)誤：
```
create kubelet: misconfiguration: kubelet cgroup driver: "cgroupfs"
is different from docker cgroup driver: "systemd"
```
其實(shí)此處錯(cuò)誤信息已經(jīng)很明白了，就是docker 和kubelet指定的cgroup driver不一樣。 docker
支持systemd和cgroupfs兩種驅(qū)動(dòng)方式。通過(guò)runc代碼可以更加直觀了解。
cgroup 只能限制 CPU 的使用，而不能保證CPU的使用。也就是說(shuō)，使用
cpuset-cpus，可以讓容器在指定的CPU或者核上運(yùn)行，但是不能確保它獨(dú)占這些CPU；cpu-shares
是個(gè)相對(duì)值，只有在CPU不夠用的時(shí)候才其作用。也就是說(shuō)，當(dāng)CPU夠用的時(shí)候，每個(gè)容器會(huì)分到足夠的CPU；不夠用的時(shí)候，會(huì)按照指定的比重在多個(gè)容器之間分配CPU。
對(duì)內(nèi)存來(lái)說(shuō)，cgroups 可以限制容器最多使用的內(nèi)存。使用 -m 參數(shù)可以設(shè)置最多可以使用的內(nèi)存。

代碼解讀

關(guān)于cgroups在runc的代碼部分，大家可以點(diǎn)擊進(jìn)去詳細(xì)閱讀。這邊我們只講一個(gè)大概。
首先container的創(chuàng)建是由factory調(diào)用create方法實(shí)現(xiàn)的，而cgroup相關(guān)，factory實(shí)現(xiàn)了根據(jù)配置文件cgroup drive驅(qū)動(dòng)的配置項(xiàng)，新建CgroupsManager的方法，systemd和cgroupfs兩種實(shí)現(xiàn)方式：

// SystemdCgroups is an options func to configure a LinuxFactory to return
// containers that use systemd to create and manage cgroups.
func SystemdCgroups(l *LinuxFactory) error {
    l.NewCgroupsManager = func(config *configs.Cgroup, paths map[string]string) cgroups.Manager {
        return &systemd.Manager{
            Cgroups: config,
            Paths:   paths,
        }
    }
    return nil
}

// Cgroupfs is an options func to configure a LinuxFactory to return containers
// that use the native cgroups filesystem implementation to create and manage
// cgroups.
func Cgroupfs(l *LinuxFactory) error {
    l.NewCgroupsManager = func(config *configs.Cgroup, paths map[string]string) cgroups.Manager {
        return &fs.Manager{
            Cgroups: config,
            Paths:   paths,
        }
    }
    return nil
}

抽象cgroup manager接口。接口如下：

type Manager interface {
    // Applies cgroup configuration to the process with the specified pid
    Apply(pid int) error

    // Returns the PIDs inside the cgroup set
    GetPids() ([]int, error)

    // Returns the PIDs inside the cgroup set & all sub-cgroups
    GetAllPids() ([]int, error)

    // Returns statistics for the cgroup set
    GetStats() (*Stats, error)

    // Toggles the freezer cgroup according with specified state
    Freeze(state configs.FreezerState) error

    // Destroys the cgroup set
    Destroy() error

    // The option func SystemdCgroups() and Cgroupfs() require following attributes:
    //     Paths   map[string]string
    //     Cgroups *configs.Cgroup
    // Paths maps cgroup subsystem to path at which it is mounted.
    // Cgroups specifies specific cgroup settings for the various subsystems

    // Returns cgroup paths to save in a state file and to be able to
    // restore the object later.
    GetPaths() map[string]string

    // Sets the cgroup as configured.
    Set(container *configs.Config) error
}

在創(chuàng)建container的過(guò)程中，會(huì)調(diào)用上面接口的方法。例如：
在container_linux.go中，

func (c *linuxContainer) Set(config configs.Config) error {
    c.m.Lock()
    defer c.m.Unlock()
    status, err := c.currentStatus()
    if err != nil {
        return err
    }
    ...
    if err := c.cgroupManager.Set(&config); err != nil {
        // Set configs back
        if err2 := c.cgroupManager.Set(c.config); err2 != nil {
            logrus.Warnf("Setting back cgroup configs failed due to error: %v, your state.json and actual configs might be inconsistent.", err2)
        }
        return err
    }
...
}

接下來(lái)我們重點(diǎn)講一下fs的實(shí)現(xiàn)。

圖片描述
在fs中，基本上每個(gè)子系統(tǒng)都是一個(gè)文件，如上圖。

重點(diǎn)說(shuō)一下memory.go，即memory子系統(tǒng),其他子系統(tǒng)與此類(lèi)似。
關(guān)鍵方法：

func (s *MemoryGroup) Apply(d *cgroupData) (err error) {
    path, err := d.path("memory")
    if err != nil && !cgroups.IsNotFound(err) {
        return err
    } else if path == "" {
        return nil
    }
    if memoryAssigned(d.config) {
        if _, err := os.Stat(path); os.IsNotExist(err) {
            if err := os.MkdirAll(path, 0755); err != nil {
                return err
            }
            // Only enable kernel memory accouting when this cgroup
            // is created by libcontainer, otherwise we might get
            // error when people use `cgroupsPath` to join an existed
            // cgroup whose kernel memory is not initialized.
            if err := EnableKernelMemoryAccounting(path); err != nil {
                return err
            }
        }
    }
    defer func() {
        if err != nil {
            os.RemoveAll(path)
        }
    }()

    // We need to join memory cgroup after set memory limits, because
    // kmem.limit_in_bytes can only be set when the cgroup is empty.
    _, err = d.join("memory")
    if err != nil && !cgroups.IsNotFound(err) {
        return err
    }
    return nil
}

通過(guò)d.path("memory")查找到cgroup的memory路徑

func (raw *cgroupData) path(subsystem string) (string, error) {
    mnt, err := cgroups.FindCgroupMountpoint(subsystem)
    // If we didn't mount the subsystem, there is no point we make the path.
    if err != nil {
        return "", err
    }

    // If the cgroup name/path is absolute do not look relative to the cgroup of the init process.
    if filepath.IsAbs(raw.innerPath) {
        // Sometimes subsystems can be mounted together as 'cpu,cpuacct'.
        return filepath.Join(raw.root, filepath.Base(mnt), raw.innerPath), nil
    }

    // Use GetOwnCgroupPath instead of GetInitCgroupPath, because the creating
    // process could in container and shared pid namespace with host, and
    // /proc/1/cgroup could point to whole other world of cgroups.
    parentPath, err := cgroups.GetOwnCgroupPath(subsystem)
    if err != nil {
        return "", err
    }

    return filepath.Join(parentPath, raw.innerPath), nil
}

d.join("memory")，將pid寫(xiě)到memory路徑下

func (raw *cgroupData) join(subsystem string) (string, error) {
    path, err := raw.path(subsystem)
    if err != nil {
        return "", err
    }
    if err := os.MkdirAll(path, 0755); err != nil {
        return "", err
    }
    if err := cgroups.WriteCgroupProc(path, raw.pid); err != nil {
        return "", err
    }
    return path, nil
}

關(guān)于docker中如何理解cgroups就分享到這里了，希望以上內(nèi)容可以對(duì)大家有一定的幫助，可以學(xué)到更多知識(shí)。如果覺(jué)得文章不錯(cuò)，可以把它分享出去讓更多的人看到。

向AI問(wèn)一下細(xì)節(jié)

推薦閱讀：

免責(zé)聲明：本站發(fā)布的內(nèi)容（圖片、視頻和文字）以原創(chuàng)、轉(zhuǎn)載和分享為主，文章觀點(diǎn)不代表本網(wǎng)站立場(chǎng)，如果涉及侵權(quán)請(qǐng)聯(lián)系站長(zhǎng)郵箱：is@yisu.com進(jìn)行舉報(bào)，并提供相關(guān)證據(jù)，一經(jīng)查實(shí)，將立刻刪除涉嫌侵權(quán)內(nèi)容。

上一篇新聞：
k8s cpu limit node節(jié)點(diǎn)異常怎么辦
下一篇新聞：
如何理解Linux CPU負(fù)載和CPU使用率

猜你喜歡

AI
助
手

產(chǎn)品服務(wù)

地區(qū)劃分

專(zhuān)題活動(dòng)

幫助支持

關(guān)于我們

售后咨詢

7*24小時(shí)在線電話：400-100-2938

7*24小時(shí)在線 QQ：800811969

關(guān)注億速云

億速云公眾號(hào)

手機(jī)網(wǎng)站二維碼

<div id="myoxg"><progress id="myoxg"><strike id="myoxg"></strike></progress></div>