如何進(jìn)行NVIDIA及k8s-device-plugin源碼分析

發(fā)布時(shí)間：2021-12-15 19:02:46 來源：億速云閱讀：340 作者：柒染欄目：云計(jì)算

如何進(jìn)行NVIDIA及k8s-device-plugin源碼分析，針對(duì)這個(gè)問題，這篇文章詳細(xì)介紹了相對(duì)應(yīng)的分析和解答，希望可以幫助更多想解決這個(gè)問題的小伙伴找到更簡(jiǎn)單易行的方法。

k8s-device-plugin內(nèi)部實(shí)現(xiàn)原理圖

在Kubernetes如何通過Device Plugins來使用NVIDIA GPU中，對(duì)NVIDIA/k8s-device-plugin的工作原理進(jìn)行了深入分析，為了方便我們?cè)谶@再次貼出其內(nèi)部實(shí)現(xiàn)原理圖：

PreStartContainer和GetDevicePluginOptions兩個(gè)接口，在NVIDIA/k8s-device-plugin中可以忽略，可以認(rèn)為是空實(shí)現(xiàn)。我們主要關(guān)注ListAndWatch和Allocate的實(shí)現(xiàn)。

啟動(dòng)

一切從main函數(shù)開始！核心的代碼如下：

func main() {
	log.Println("Loading NVML")
	if err := nvml.Init(); err != nil {
		select {}
	}
    ...
	log.Println("Fetching devices.")
	if len(getDevices()) == 0 {
		select {}
	}

	log.Println("Starting FS watcher.")
	watcher, err := newFSWatcher(pluginapi.DevicePluginPath)
	if err != nil {
		os.Exit(1)
	}
    ...
	log.Println("Starting OS watcher.")
	sigs := newOSWatcher(syscall.SIGHUP, syscall.SIGINT, syscall.SIGTERM, syscall.SIGQUIT)

	restart := true
	var devicePlugin *NvidiaDevicePlugin

L:
	for {
		if restart {
			if devicePlugin != nil {
				devicePlugin.Stop()
			}

			devicePlugin = NewNvidiaDevicePlugin()
			if err := devicePlugin.Serve(); err != nil {
				...
			} else {
				restart = false
			}
		}

		select {
		case event := <-watcher.Events:
			if event.Name == pluginapi.KubeletSocket && event.Op&fsnotify.Create == fsnotify.Create {
				restart = true
			}

		case err := <-watcher.Errors:

		case s := <-sigs:
			switch s {
			case syscall.SIGHUP:
				restart = true
			default:
				devicePlugin.Stop()
				break L
			}
		}
	}
}

相關(guān)說明不需多說，請(qǐng)參考下面的流程邏輯圖：

如何進(jìn)行NVIDIA及k8s-device-plugin源碼分析

Serve

k8s-device-plugin啟動(dòng)流程中，devicePlugin.Serve負(fù)責(zé)啟動(dòng)gRPC Server Start對(duì)外提供服務(wù)，然后把自己注冊(cè)到kubelet。

// Serve starts the gRPC server and register the device plugin to Kubelet
func (m *NvidiaDevicePlugin) Serve() error {
	err := m.Start()
	if err != nil {
		log.Printf("Could not start device plugin: %s", err)
		return err
	}
	log.Println("Starting to serve on", m.socket)

	err = m.Register(pluginapi.KubeletSocket, resourceName)
	if err != nil {
		log.Printf("Could not register device plugin: %s", err)
		m.Stop()
		return err
	}
	log.Println("Registered device plugin with Kubelet")

	return nil
}

Start

Start的代碼如下：

// Start starts the gRPC server of the device plugin
func (m *NvidiaDevicePlugin) Start() error {
	err := m.cleanup()
	if err != nil {
		return err
	}

	sock, err := net.Listen("unix", m.socket)
	if err != nil {
		return err
	}

	m.server = grpc.NewServer([]grpc.ServerOption{}...)
	pluginapi.RegisterDevicePluginServer(m.server, m)

	go m.server.Serve(sock)

	// Wait for server to start by launching a blocking connexion
	conn, err := dial(m.socket, 5*time.Second)
	if err != nil {
		return err
	}
	conn.Close()

	go m.healthcheck()

	return nil
}

更加深入的代碼調(diào)用關(guān)系，這里不多介紹，直接貼出Start的實(shí)現(xiàn)邏輯圖：

如何進(jìn)行NVIDIA及k8s-device-plugin源碼分析

Start流程中負(fù)責(zé)創(chuàng)建nvidia.sock文件。

需要特別說明healthcheck部分：

healthcheck啟動(dòng)協(xié)程對(duì)管理的devices進(jìn)行健康狀態(tài)監(jiān)控，一旦發(fā)現(xiàn)有device unhealthy，則發(fā)送到NvidiaDevicePlugin的health channel。device plugin的ListAndWatch會(huì)從health channel中獲取這些unhealthy devices，并通知到kubelet進(jìn)行更新。
只監(jiān)控nvmlEventTypeXidCriticalError事件，一旦監(jiān)控到某個(gè)device的這個(gè)Event，就認(rèn)為該device unhealthy。關(guān)于nvmlEventTypeXidCriticalError的說明，請(qǐng)參考NVIDIA的nvml api文檔。
可以通過設(shè)置NVIDIA device plugin Pod內(nèi)的環(huán)境變量DP_DISABLE_HEALTHCHECKS為”all”來取消healthcheck。不設(shè)置或者設(shè)置為其他值都會(huì)啟動(dòng)healthcheck，默認(rèn)部署時(shí)不設(shè)置。

Register

Start之后，接著進(jìn)入Register流程，其代碼如下：

// Register registers the device plugin for the given resourceName with Kubelet.
func (m *NvidiaDevicePlugin) Register(kubeletEndpoint, resourceName string) error {
	conn, err := dial(kubeletEndpoint, 5*time.Second)
	if err != nil {
		return err
	}
	defer conn.Close()

	client := pluginapi.NewRegistrationClient(conn)
	reqt := &pluginapi.RegisterRequest{
		Version:      pluginapi.Version,
		Endpoint:     path.Base(m.socket),
		ResourceName: resourceName,
	}

	_, err = client.Register(context.Background(), reqt)
	if err != nil {
		return err
	}
	return nil
}

Register的實(shí)現(xiàn)流程圖如下：

如何進(jìn)行NVIDIA及k8s-device-plugin源碼分析

注冊(cè)的Resource Name是nvidia.com/gpu
注冊(cè)的Version是v1beta1

Stop

Stop的代碼如下：

// Stop stops the gRPC server
func (m *NvidiaDevicePlugin) Stop() error {
	if m.server == nil {
		return nil
	}

	m.server.Stop()
	m.server = nil
	close(m.stop)

	return m.cleanup()
}

Stop的實(shí)現(xiàn)流程圖如下：

如何進(jìn)行NVIDIA及k8s-device-plugin源碼分析

Stop流程中負(fù)責(zé)停止gRPC Server，并刪除nvidia.sock。

ListAndWatch

ListAndWatch接口主要負(fù)責(zé)監(jiān)控health channel，發(fā)現(xiàn)有g(shù)pu變成unhealthy后，將完成的gpu list信息（ID和health狀態(tài)）發(fā)送給kubelet進(jìn)行更新。

// ListAndWatch lists devices and update that list according to the health status
func (m *NvidiaDevicePlugin) ListAndWatch(e *pluginapi.Empty, s pluginapi.DevicePlugin_ListAndWatchServer) error {
	s.Send(&pluginapi.ListAndWatchResponse{Devices: m.devs})

	for {
		select {
		case <-m.stop:
			return nil
		case d := <-m.health:
			// FIXME: there is no way to recover from the Unhealthy state.
			d.Health = pluginapi.Unhealthy
			s.Send(&pluginapi.ListAndWatchResponse{Devices: m.devs})
		}
	}
}

ListAndWatch的實(shí)現(xiàn)流程圖如下：

如何進(jìn)行NVIDIA及k8s-device-plugin源碼分析

Allocate

Allocate負(fù)責(zé)接口kubelet為Container請(qǐng)求分配gpu的請(qǐng)求，請(qǐng)求的結(jié)構(gòu)體如下：

// - Allocate is expected to be called during pod creation since allocation
//   failures for any container would result in pod startup failure.
// - Allocate allows kubelet to exposes additional artifacts in a pod's
//   environment as directed by the plugin.
// - Allocate allows Device Plugin to run device specific operations on
//   the Devices requested
type AllocateRequest struct {
	ContainerRequests []*ContainerAllocateRequest `protobuf:"bytes,1,rep,name=container_requests,json=containerRequests" json:"container_requests,omitempty"`
}

type ContainerAllocateRequest struct {
	DevicesIDs []string `protobuf:"bytes,1,rep,name=devicesIDs" json:"devicesIDs,omitempty"`
}

device plugin Allocate的Response結(jié)構(gòu)體定義如下：

// AllocateResponse includes the artifacts that needs to be injected into
// a container for accessing 'deviceIDs' that were mentioned as part of
// 'AllocateRequest'.
// Failure Handling:
// if Kubelet sends an allocation request for dev1 and dev2.
// Allocation on dev1 succeeds but allocation on dev2 fails.
// The Device plugin should send a ListAndWatch update and fail the
// Allocation request
type AllocateResponse struct {
	ContainerResponses []*ContainerAllocateResponse `protobuf:"bytes,1,rep,name=container_responses,json=containerResponses" json:"container_responses,omitempty"`
}

type ContainerAllocateResponse struct {
	// List of environment variable to be set in the container to access one of more devices.
	Envs map[string]string `protobuf:"bytes,1,rep,name=envs" json:"envs,omitempty" protobuf_key:"bytes,1,opt,name=key,proto3" protobuf_val:"bytes,2,opt,name=value,proto3"`
	// Mounts for the container.
	Mounts []*Mount `protobuf:"bytes,2,rep,name=mounts" json:"mounts,omitempty"`
	// Devices for the container.
	Devices []*DeviceSpec `protobuf:"bytes,3,rep,name=devices" json:"devices,omitempty"`
	// Container annotations to pass to the container runtime
	Annotations map[string]string `protobuf:"bytes,4,rep,name=annotations" json:"annotations,omitempty" protobuf_key:"bytes,1,opt,name=key,proto3" protobuf_val:"bytes,2,opt,name=value,proto3"`
}

Allocate的代碼實(shí)現(xiàn)如下：

// Allocate which return list of devices.
func (m *NvidiaDevicePlugin) Allocate(ctx context.Context, reqs *pluginapi.AllocateRequest) (*pluginapi.AllocateResponse, error) {
	devs := m.devs
	responses := pluginapi.AllocateResponse{}
	for _, req := range reqs.ContainerRequests {
		response := pluginapi.ContainerAllocateResponse{
			Envs: map[string]string{
				"NVIDIA_VISIBLE_DEVICES": strings.Join(req.DevicesIDs, ","),
			},
		}

		for _, id := range req.DevicesIDs {
			if !deviceExists(devs, id) {
				return nil, fmt.Errorf("invalid allocation request: unknown device: %s", id)
			}
		}

		responses.ContainerResponses = append(responses.ContainerResponses, &response)
	}

	return &responses, nil
}

下面是其實(shí)現(xiàn)邏輯圖：

如何進(jìn)行NVIDIA及k8s-device-plugin源碼分析

Allocate中會(huì)遍歷ContainerRequests，將DeviceIDs封裝到ContainerAllocateResponse的Envs:NVIDIA_VISIBLE_DEVICES中，格式為：”${ID_1},${ID_2},...”
除此之外，并沒有封裝Mounts, Devices, Annotations。

關(guān)于如何進(jìn)行NVIDIA及k8s-device-plugin源碼分析問題的解答就分享到這里了，希望以上內(nèi)容可以對(duì)大家有一定的幫助，如果你還有很多疑惑沒有解開，可以關(guān)注億速云行業(yè)資訊頻道了解更多相關(guān)知識(shí)。

向AI問一下細(xì)節(jié)

如何進(jìn)行NVIDIA及k8s-device-plugin源碼分析

k8s-device-plugin內(nèi)部實(shí)現(xiàn)原理圖

啟動(dòng)

Serve

Start

Register

Stop

ListAndWatch

Allocate

猜你喜歡

最新資訊

相關(guān)推薦

相關(guān)標(biāo)簽