Kubernetes学习(深入理解emptyDir)

发表于 2022-12-30 分类于 Kubernetes 阅读次数：本文字数： 6.7k 阅读时长 ≈ 6 分钟

emptyDir这种local storage在Kubernetes中使用频率还是比较高的，今天就来详细说一说emptyDir中使用共享内存的相关知识，同时也会结合源码，希望可以帮助大家更好的理解emptyDir

共享目录

先来看看emptyDir的常规使用方法，相信大家也非常熟悉了，简单举几个例子

应用场景：在一个pod中准备两个容器nginx和busybox，然后声明一个Volume分别挂到两个容器的目录中，然后nginx容器负责想Volume中写日志，busybox中通过命令将日志内容读取到控制台，多个容器间共享路径，这个应该是使用最为频繁的

yaml如下:

apiVersion: v1
kind: Pod
metadata:
  name: volume-emptydir
  namespace: dev
spec:
  containers:
  - name: nginx
    image: nginx:1.17.1
    ports:
    - containerPort: 80
    volumeMounts:
    - name: logs-volume
      mountPath: /var/log/nginx
  - name: busybox
    image: busybox:1.30
    command: ["/bin/sh","-c","tail -f /logs/access.log"]
    volumeMounts:
    - name: logs-volume
      mountPath: /logs
  volumes:
  - name: logs-volume
    emptyDir: {}

emptyDir在被创建之初是一个空目录，并且会一直存在于Pod的生命周期中，当Pod被移除(node重启时)时，emptyDir中的数据也会被随之删除，emptyDir可以被挂载到Pod的任意容器下的任意路径下，并且被Pod中的所有容器所共享

这种只适用于数据不重要的场景:

问题1: 这个共享形式其实是个目录，但这个目录存在于哪里?跟node有什么关联?

在kubelet的工作目录(root-dir参数控制)，默认为/var/lib/kubelet, 会为每个使用了emptyDir: {}的pod创建一个目录，格式如/var/lib/kubelet/pods/{podid}/volumes/kubernetes.io~empty-dir/, 所有放在emptyDir中数据，最终都是落在了node的上述路径中，从这里可以看出，由于默认情况下这个路径是挂载在根目录上的，因此建议还是对kubernetes集群中使用emptyDir: {}的使用情况进行监控，以免某些pod写入大量的数据导致node根目录被打爆，从而引发Evicted

这里有个因emptyDir: {}使用不当导致集群异常的案例skywalking-agent使用emptyDir导致磁盘空间不足 - 掘金)

总结一下: emptyDir: {}占用的是node的存储资源

共享内存

先解释一下共享内存与内存的区别:

内存一般指node的物理内存,就是我们常说的某某机器内存多大

共享内存是进程单通信的一种方式，多个进程都可访问同一块内存空间.

emptyDir可以指定memory做为存储介质, 这个是重点要讨论的地方，看yaml

apiVersion: v1
kind: Pod
metadata:
  name: volume-emptydir
  namespace: dev
spec:
  containers:
  - name: nginx
    image: nginx:1.17.1
    ports:
    - containerPort: 80
    resources:
      limits:
        cpu: "1"
        memory: 1024Mi
      requests:
        cpu: "1"
        memory: 1024Mi
    volumeMounts:
    - name: logs-volume
      mountPath: /var/log/nginx
  volumes:
  - name: logs-volume
    emptyDir:
      medium: Memory
      sizeLimit: 512Mi

ps: 以上yaml只是举个例子，真实场景并不会把昂贵的内存资源用于存储日志

这里要特别说明的是，由于是内存，所以在pod或者是node重启时，也会导致存储丢失

这里指定了medium=Memory(没有指定medium时如第一个case), 且指定sizeLimit为500Mi

在这里引出第2个问题:

问题2: 这个Memory占用的是谁的内存，是pod的resource中limit内存的一部分还是会额外占用Node的内存? 对于使用sizeLimit参数，设置与不设置有什么不同?

先说结论，medium=Memory指定的内存不能超过pod中所有容器limit.memory资源之和，其中的大小关系由sizeLimit限定

linux共享内存

这里不得不先说一说linux中的共享内存(这里不考虑swap分区)，在linux中，常常能看到tmpfs类型的文件系统， tmpfs是一种虚拟内存文件系统,最常用的虚拟内存就是/dev/shm了，经常会被来做为内存使用，可以提高数据的读写速度，在linux中，声明一块共享内存非常简单，如下命令即可将/tmp挂载到共享内存上，大小为20m

mount -t tmpfs -o size=20m tmpfs /tmp

tmpfs有一个特点就是，虽然size=20m, 但它是按照实际情况计算内存的，只是说最大能用到20m,这个特性很重要.

medium=Memory的实现

回到问题2，kubernetes在处理medium=Memory类型的emptyDir时，就是使用了tmpfs，可以从代码看到:

kubernetes/pkg/volume/emptydir/empty_dir.go

// 省略其它代码
switch {
    case ed.medium == v1.StorageMediumDefault:
        err = ed.setupDir(dir)
    case ed.medium == v1.StorageMediumMemory:
        err = ed.setupTmpfs(dir)
    case v1helper.IsHugePageMedium(ed.medium): // 不讨论这种情况
        err = ed.setupHugepages(dir)
    default:
        err = fmt.Errorf("unknown storage medium %q", ed.medium)
    }
// 省略其它代码

// setupTmpfs creates a tmpfs mount at the specified directory.
func (ed *emptyDir) setupTmpfs(dir string) error {
    if ed.mounter == nil {
        return fmt.Errorf("memory storage requested, but mounter is nil")
    }
    if err := ed.setupDir(dir); err != nil {
        return err
    }
    // Make SetUp idempotent.
    medium, isMnt, _, err := ed.mountDetector.GetMountMedium(dir, ed.medium)
    if err != nil {
        return err
    }
    // If the directory is a mountpoint with medium memory, there is no
    // work to do since we are already in the desired state.
    if isMnt && medium == v1.StorageMediumMemory {
        return nil
    }

    var options []string
    // Linux system default is 50% of capacity.
    if ed.sizeLimit != nil && ed.sizeLimit.Value() > 0 {
        options = []string{fmt.Sprintf("size=%d", ed.sizeLimit.Value())}
    }

    klog.V(3).Infof("pod %v: mounting tmpfs for volume %v", ed.pod.UID, ed.volName)
    return ed.mounter.MountSensitiveWithoutSystemd("tmpfs", dir, "tmpfs", options, nil)
}

重点就是最后一句话，最终转化成执行 mount tmpfs命令创建了一个共享内存挂载到pod中

是否会影响调度

再来看看如下的yaml

# 省略其它字段
  resources:
      limits:
        cpu: "1"
        memory: 1Gi
      requests:
        cpu: "1"
        memory: 1Gi
  volumes:
  - name: logs-volume
    emptyDir:
      medium: Memory
      sizeLimit: 512Mi

pod即申请了resource资源，又指定的emptyDir.medium，那么pod真正占用的资源是1024Mi还是1024Mi+512Mi? medium.sizeLimit如何影响调度器调度？

答案是1024Mi, 因为medium.memory指定的内存会包含在limit的指定的内存数值中，这个很好理解，举个例子，我们常说一个node的内存是4G, 那么如果有共享内存，那么共享内存的大小最大也只能在4Gi这个范围(同样这里不考虑swap分区)

源码中定义了sizeLimit的使用范围

kubernetes/pkg/volume/emptydir/empty_dir.go

func calculateEmptyDirMemorySize(nodeAllocatableMemory *resource.Quantity, spec *volume.Spec, pod *v1.Pod) *resource.Quantity {
    // if feature is disabled, continue the default behavior of linux host default
    sizeLimit := &resource.Quantity{}
    if !utilfeature.DefaultFeatureGate.Enabled(features.SizeMemoryBackedVolumes) {
        return sizeLimit
    }

    // size limit defaults to node allocatable (pods can't consume more memory than all pods)
    sizeLimit = nodeAllocatableMemory
    zero := resource.MustParse("0")

    // determine pod resource allocation
    // we use the same function for pod cgroup assignment to maintain consistent behavior
    // NOTE: this could be nil on systems that do not support pod memory containment (i.e. windows)
    podResourceConfig := cm.ResourceConfigForPod(pod, false, uint64(100000), false)
    if podResourceConfig != nil && podResourceConfig.Memory != nil {
        podMemoryLimit := resource.NewQuantity(*(podResourceConfig.Memory), resource.BinarySI)
        // ensure 0 < value < size
        if podMemoryLimit.Cmp(zero) > 0 && podMemoryLimit.Cmp(*sizeLimit) < 1 {
            sizeLimit = podMemoryLimit
        }
    }

    // volume local size is  used if and only if less than what pod could consume
    if spec.Volume.EmptyDir.SizeLimit != nil {
        volumeSizeLimit := spec.Volume.EmptyDir.SizeLimit
        // ensure 0 < value < size
        if volumeSizeLimit.Cmp(zero) > 0 && volumeSizeLimit.Cmp(*sizeLimit) < 1 {
            sizeLimit = volumeSizeLimit
        }
    }
    return sizeLimit
}

上述代码中存在一句很重要的注释，即

1	// volume local size is used if and only if less than what pod could consume

也就是说使用的volume size不能超过pod所能消耗的资源总和

上述代码指定了如下的等量关系:

当 Pod 并没有设置memory limit时，此时 shm大小为node的Allocateable Memory大小
当Pod 设置了Memory Limit 但是在medium的emptyDir未设置sizeLimit时，shm 大小为Pod 的memory Limit
当Pod的medium emptyDir设置sizeLimit时，shm大小为sizeLimit, 注意，如果设置的值超过pod的的memory Limit，则会被自动设置等于pod的的memoryLimit

正因为如此，medium.memory并不会影响调度器的调度，调度器还是会以pod requests中的资源进行调度，然后cgroup以按limits中的资源设置

问题3: medium.memory受不受cgroup的限制?

是受限制的，由于这部分内存本就是resource.limits中的一部分，当medium.memory对应的目录使用超过了sizelimit时，也会被kubelet Evicted，但不是立即被Evicted，而是会等kubelet Evicte周期探测后发生，默认大概1-2分钟。

总结一下: emptyDir: medium.memory占用的是node的内存资源

建议

因此在需要Pod中使用共享内存的场景，建议如下:

务必配置pod的memory limits
挂载emptyDir.medium.memory挂载到pod的/dev/shm(其它路径也可)
设置sizeLimit为limits的50%
限制emptyDir的使用并且做好emptyDir的监控措施以免发生集群雪崩

参考文章:

Setting up the shared memory of a kubernetes Pod - SoByte

skywalking-agent使用emptyDir导致磁盘空间不足 - 掘金

https://zhuanlan.zhihu.com/p/38359775

k8s pod 配置shareMemory | 云里雾里