推荐|【博客502】Nvidia k8s gpu plugin原理

Nvidia k8s gpu plugin原理

Nvidia GPU设备在Kubernetes中管理调度的整个工作流程分为以下两个方面：

1、如何在容器中使用GPU
2、Kubernetes 如何调度GPU

1、如何在容器中使用GPU

想要在容器中的应用可以操作GPU，需要实两个目标：

1、容器中可以查看GPU设备
2、容器中运行的应用，可以通过Nvidia驱动操作GPU显卡

见上篇博客：Nvidia docker原理

2、Kubernetes 如何调度GPU：Nvidia plugin

为了能够在Kubernetes中管理和调度GPU， Nvidia提供了Nvidia GPU的Device Plugin。主要功能如下：

1、支持ListAndWatch 接口，上报节点上的GPU数量
2、支持Allocate接口， 支持分配GPU的行为。

Nvidia plugin k8s的ListAndWatch 与Allocate源码剖析：

// ListAndWatch lists devices and update that list according to the health status
func (plugin *NvidiaDevicePlugin) ListAndWatch(e *pluginapi.Empty, s pluginapi.DevicePlugin_ListAndWatchServer) error {
    s.Send(&pluginapi.ListAndWatchResponse{Devices: plugin.apiDevices()})

    for {
        select {
        case <-plugin.stop:
            return nil
        case d := <-plugin.health:
            // 收到某个设备有健康问题，标志该设备不健康
            // FIXME: there is no way to recover from the Unhealthy state.
            d.Health = pluginapi.Unhealthy
            log.Printf("'%s' device marked unhealthy: %s", plugin.rm.Resource(), d.ID)
            // 重新发送新的可用的device列表
            s.Send(&pluginapi.ListAndWatchResponse{Devices: plugin.apiDevices()})
        }
    }
}

// Allocat主要是分配显卡，给容器指定要附加的NVIDIA_VISIBLE_DEVICES环境变量
func (plugin *NvidiaDevicePlugin) Allocate(ctx context.Context, reqs *pluginapi.AllocateRequest) (*pluginapi.AllocateResponse, error) {
    responses := pluginapi.AllocateResponse{}
    // 为每个请求分配设备
    for _, req := range reqs.ContainerRequests {
 
        if plugin.config.Sharing.TimeSlicing.FailRequestsGreaterThanOne && rm.AnnotatedIDs(req.DevicesIDs).AnyHasAnnotations() {
            if len(req.DevicesIDs) > 1 {
                return nil, fmt.Errorf("request for '%v: %v' too large: maximum request size for shared resources is 1", plugin.rm.Resource(), len(req.DevicesIDs))
            }
        }
        // 判断一下申请的设备ID是不是自己所管理的，也就是所拥有的设备，也就是校验是不是自己注册的那些设备
        for _, id := range req.DevicesIDs {
            if !plugin.rm.Devices().Contains(id) {
                return nil, fmt.Errorf("invalid allocation request for '%s': unknown device: %s", plugin.rm.Resource(), id)
            }
        }

        response := pluginapi.ContainerAllocateResponse{}
        // 将注册时的设备ID转换为具体的gpu id
        ids := req.DevicesIDs
        deviceIDs := plugin.deviceIDsFromAnnotatedDeviceIDs(ids)
        // 将分配的设备信息保存到Env里面去，后续docker的runC将设备信息以环境变量的形式注入到容器
        if *plugin.config.Flags.Plugin.DeviceListStrategy == spec.DeviceListStrategyEnvvar {
            response.Envs = plugin.apiEnvs(plugin.deviceListEnvvar, deviceIDs)
        }
        if *plugin.config.Flags.Plugin.DeviceListStrategy == spec.DeviceListStrategyVolumeMounts {
            response.Envs = plugin.apiEnvs(plugin.deviceListEnvvar, []string{deviceListAsVolumeMountsContainerPathRoot})
            response.Mounts = plugin.apiMounts(deviceIDs)
        }
        if *plugin.config.Flags.Plugin.PassDeviceSpecs {
            response.Devices = plugin.apiDeviceSpecs(*plugin.config.Flags.NvidiaDriverRoot, ids)
        }
        if *plugin.config.Flags.GDSEnabled {
            response.Envs["NVIDIA_GDS"] = "enabled"
        }
        if *plugin.config.Flags.MOFEDEnabled {
            response.Envs["NVIDIA_MOFED"] = "enabled"
        }

        responses.ContainerResponses = append(responses.ContainerResponses, &response)
    }

    return &responses, nil
}

整个Kubernetes调度GPU的过程如下：

1、GPU Device plugin 部署到GPU节点上，通过 ListAndWatch  接口，
  上报注册节点的GPU信息和对应的DeviceID。

2、当有声明 nvidia.com/gpu  的GPU Pod创建出现，调度器会综合考虑GPU设备的空闲情况，
   将Pod调度到有充足GPU设备的节点上。

3、节点上的kubelet 启动Pod时，根据request中的声明调用各个Device plugin 的 
     allocate接口， 由于容器声明了GPU。kubelet 根据之前 ListAndWatch 接口
     收到的Device信息，选取合适的设备，DeviceID 作为参数，调用GPU DevicePlugin
     的 Allocate 接口。Nvidia GPU device plugin做的事情，就是根据kubelet 请求中
     的GPU DeviceId， 转换为 NVIDIA_VISIBLE_DEVICES 环境变量返回给kubelet

4、GPU DevicePlugin ，接收到调用，将DeviceID 转换为 NVIDIA_VISIBLE_DEVICES
      环境变量，返回给kubelet

5、kubelet收到返回内容后，会自动将返回的环境变量注入到容器中。启动容器

6、容器启动时， gpu-container-runtime 调用 gpu-containers-runtime-hook 
   Nvidia的 gpu-container-runtime根据容器的 NVIDIA_VISIBLE_DEVICES 环境变量，
   会决定这个容器是否为GPU容器，并且可以使用哪些GPU设备。
   如果没有携带NVIDIA_VISIBLE_DEVICES这个环境变量，
   那么就会按照普通的docker启动方式来启动
   
7、gpu-containers-runtime-hook根据容器的 NVIDIA_VISIBLE_DEVICES 环境变量，
   转换为 --devices 参数，调用 nvidia-container-cli prestart，
   nvidia-container-cli 。根据 --devices ，将GPU设备映射到容器中。
   并且将宿主机的Nvidia Driver Lib 的so文件也映射到容器中。 
   此时容器可以通过这些so文件，调用宿主机的Nvidia Driver。

原理总结

1、device plugin端启动自己服务, 地址为(/var/lib/kubelet/device-plugins/sock.sock).

2、device plugin向地址为(/var/lib/kubelet/device-plugins/kubelet.sock)发送注册请求(含有resoucename以及自己服务的地址/var/lib/kubelet/device-plugins/sock.sock).

3、device manager收到请求分配一个新的endpoint与该device plugin通过device plugin的ListAndWatch进行连接并通信.

4、当device plugin的ListAndWatch有变化时, 对应的endpoint会感知并通过回调函数告知device manager需要更新它的资源以及对应设备信息(healthyDevices和unhealthyDevices)

流程图：

在这里插入图片描述

Nvidia k8s gpu plugin原理

1、如何在容器中使用GPU

2、Kubernetes 如何调度GPU：Nvidia plugin

原理总结

评论记录：