Oh My Memory

So there I was, happily scaling up a Kubernetes cluster, watching pod counts climb into the hundreds. Everything seemed fine until I noticed something peculiar - our nodes were running out of allocatable memory way faster than they should have been. The workloads themselves weren’t particularly hungry, so where was all that memory going?

Turns out, every single one of those pods was mounting an EFS volume. Same data, same filesystem, just… a lot of mounts. And the EFS CSI driver was absolutely chomping through memory to maintain all those mount points.

The Problem: Death by a Thousand Mounts

After some digging, I discovered what was happening under the hood. The EFS CSI driver maintains metadata and the mount system tree in memory for each individual mount operation. When you’ve got a handful of pods, this is negligible. But scale that up to 200, 300, or more pods? You’re suddenly looking at gigabytes of memory consumed just for mount bookkeeping.

Here’s what the typical EFS mounting pattern looks like in Kubernetes:

flowchart TB
    subgraph AWS["AWS Cloud"]
        EFS["EFS Filesystem<br/>fs-xxxxxxxx"]
    end

    subgraph Node1["Kubernetes Node 1"]
        subgraph Driver1["EFS CSI Driver"]
            M1["Mount 1 metadata"]
            M2["Mount 2 metadata"]
            M3["Mount N metadata"]
        end
        Pod1["Pod A"] --> M1
        Pod2["Pod B"] --> M2
        Pod3["Pod C...N"] --> M3
        M1 --> EFS
        M2 --> EFS
        M3 --> EFS
    end

    style Driver1 fill:#ff6b6b,stroke:#c92a2a
    style M1 fill:#ffa8a8
    style M2 fill:#ffa8a8
    style M3 fill:#ffa8a8

Each pod creates its own mount, and the CSI driver dutifully tracks each one. Memory usage grows linearly with pod count. Not ideal when you’re trying to keep infrastructure lean.

The Solution: One Mount to Rule Them All

After some research and experimentation, I landed on an elegant workaround: use a DaemonSet to mount EFS once per node, then have pods access the data through a hostPath volume. This way, you get one mount per node instead of one per pod.

Here’s the new architecture:

flowchart TB
    subgraph AWS["AWS Cloud"]
        EFS["EFS Filesystem<br/>fs-xxxxxxxx"]
    end

    subgraph Node1["Kubernetes Node 1"]
        subgraph DS1["DaemonSet Pod"]
            Mount1["Single EFS Mount"]
        end
        HP1["/var/lib/efs<br/>(hostPath)"]
        Mount1 --> HP1
        Pod1["Pod A"] --> HP1
        Pod2["Pod B"] --> HP1
        Pod3["Pod C...N"] --> HP1
        Mount1 --> EFS
    end

    subgraph Node2["Kubernetes Node 2"]
        subgraph DS2["DaemonSet Pod"]
            Mount2["Single EFS Mount"]
        end
        HP2["/var/lib/efs<br/>(hostPath)"]
        Mount2 --> HP2
        Pod4["Pod X"] --> HP2
        Pod5["Pod Y"] --> HP2
        Pod6["Pod Z...N"] --> HP2
        Mount2 --> EFS
    end

    style DS1 fill:#51cf66,stroke:#2f9e44
    style DS2 fill:#51cf66,stroke:#2f9e44
    style HP1 fill:#8ce99a
    style HP2 fill:#8ce99a

The DaemonSet runs on every node, mounts EFS to a hostPath directory, and uses bidirectional mount propagation to make that mount available to all pods on the node. Simple, but surprisingly effective.

The Implementation

Here’s the DaemonSet configuration that made it all work:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: efs-mounter
  namespace: kube-system
spec:
  selector:
    matchLabels:
      app: efs-mounter
  template:
    metadata:
      labels:
        app: efs-mounter
    spec:
      serviceAccountName: default
      hostPID: true
      tolerations:
        - operator: Exists
      containers:
        - name: efs-mounter
          image: amazonlinux:2023
          securityContext:
            privileged: true
          command:
            - /bin/bash
            - -c
            - |
              set -euo pipefail

              echo "Installing amazon-efs-utils..."
              yum -y install amazon-efs-utils

              echo "Preparing mount points..."
              mkdir -p /mnt/efs/app-config
              mkdir -p /mnt/efs/app-data

              echo "Mount command available:"
              which mount.efs || { echo "mount.efs not found!"; exit 1; }

              echo "Mounting EFS filesystems..."
              retry_mount() {
                  fs_id=$1
                  target=$2
                  for i in 1 2 3 4 5; do
                      echo "Mounting $fs_id to $target (attempt $i)..."
                      if mount -t efs "$fs_id:/" "$target"; then
                          echo "Mounted $fs_id successfully."
                          return 0
                      fi
                      echo "Mount failed. Retrying in 5s..."
                      sleep 5
                  done
                  echo "ERROR: Failed to mount $fs_id after 5 attempts"
                  exit 1
              }

              retry_mount "${EFS1_ID}" "/mnt/efs/app-config"
              retry_mount "${EFS2_ID}" "/mnt/efs/app-data"

              echo "All mounts completed. Keeping container alive..."
              sleep infinity
          lifecycle:
            preStop:
              exec:
                command:
                  - /bin/bash
                  - -c
                  - |
                    echo "Unmounting EFS filesystems..."
                    umount /mnt/efs/app-config || true
                    umount /mnt/efs/app-data || true
                    echo "Unmount complete."
          env:
            - name: EFS1_ID
              value: "fs-xxxxxxxx"
            - name: EFS2_ID
              value: "fs-yyyyyyyy"
          volumeMounts:
            - name: efs-mount
              mountPath: /mnt/efs
              mountPropagation: Bidirectional
      volumes:
        - name: efs-mount
          hostPath:
            path: /var/lib/efs
            type: DirectoryOrCreate

A few things worth noting about this configuration:

  • Privileged mode is required because we’re performing mount operations at the host level
  • mountPropagation: Bidirectional is the secret sauce - it allows mounts made inside the container to be visible on the host and vice versa
  • Tolerations are set to allow the DaemonSet to run on all nodes, including tainted ones
  • preStop lifecycle hook ensures clean unmounting when pods are terminated

How Pods Consume the Mount

Once the DaemonSet is running, your application pods can access EFS through a simple hostPath volume:

volumes:
  - name: app-config
    hostPath:
      path: /var/lib/efs/app-config
  - name: app-data
    hostPath:
      path: /var/lib/efs/app-data
volumeMounts:
  - name: app-config
    mountPath: /config
  - name: app-data
    mountPath: /data

No more EFS CSI driver involvement per pod - just straightforward directory access. Each volume maps to a subdirectory under the hostPath where the DaemonSet mounted EFS.

You can mount as many directories as you need - just add the corresponding mkdir and retry_mount calls to the DaemonSet script, and update your pod specs accordingly. That said, keep in mind that if you go overboard with the number of mounts or don’t structure things thoughtfully, you might find yourself back in performance bottleneck territory. The goal is consolidation, not just moving the problem around.

The Results

After rolling this out, the memory savings were immediate and substantial. We recovered roughly 2-3GB of allocatable memory per node. For a cluster running lean on resources, that’s the difference between needing to scale out or not.

Caveats and Considerations

This approach isn’t without trade-offs:

  • Write-heavy workloads: I haven’t battle-tested this with high-throughput concurrent writes. For read-heavy or moderate-write scenarios, it works beautifully.
  • Security posture: Running privileged containers in kube-system is a trade-off. Make sure your security policies account for this.
  • Failure domain: If the DaemonSet pod dies, all pods on that node lose access to EFS until it recovers. The retry logic and Kubernetes’ self-healing help here, but it’s something to consider.
  • Mount propagation quirks: Not all Kubernetes distributions handle mount propagation identically. Test thoroughly in your environment.

Wrapping Up

Sometimes the most elegant solutions come from stepping back and questioning the default approach. The EFS CSI driver is fantastic for many use cases, but at scale, the per-pod mounting overhead can become a real problem. By consolidating to a single mount per node, we traded a small amount of complexity for significant resource savings.

If you’re running into similar memory pressure with EFS at scale, give this DaemonSet approach a try. Your nodes (and your cloud bill) might thank you.