Oh My Memory
So there I was, happily scaling up a Kubernetes cluster, watching pod counts climb into the hundreds. Everything seemed fine until I noticed something peculiar - our nodes were running out of allocatable memory way faster than they should have been. The workloads themselves weren’t particularly hungry, so where was all that memory going?
Turns out, every single one of those pods was mounting an EFS volume. Same data, same filesystem, just… a lot of mounts. And the EFS CSI driver was absolutely chomping through memory to maintain all those mount points.
The Problem: Death by a Thousand Mounts
After some digging, I discovered what was happening under the hood. The EFS CSI driver maintains metadata and the mount system tree in memory for each individual mount operation. When you’ve got a handful of pods, this is negligible. But scale that up to 200, 300, or more pods? You’re suddenly looking at gigabytes of memory consumed just for mount bookkeeping.
Here’s what the typical EFS mounting pattern looks like in Kubernetes:
flowchart TB
subgraph AWS["AWS Cloud"]
EFS["EFS Filesystem<br/>fs-xxxxxxxx"]
end
subgraph Node1["Kubernetes Node 1"]
subgraph Driver1["EFS CSI Driver"]
M1["Mount 1 metadata"]
M2["Mount 2 metadata"]
M3["Mount N metadata"]
end
Pod1["Pod A"] --> M1
Pod2["Pod B"] --> M2
Pod3["Pod C...N"] --> M3
M1 --> EFS
M2 --> EFS
M3 --> EFS
end
style Driver1 fill:#ff6b6b,stroke:#c92a2a
style M1 fill:#ffa8a8
style M2 fill:#ffa8a8
style M3 fill:#ffa8a8
Each pod creates its own mount, and the CSI driver dutifully tracks each one. Memory usage grows linearly with pod count. Not ideal when you’re trying to keep infrastructure lean.
The Solution: One Mount to Rule Them All
After some research and experimentation, I landed on an elegant workaround: use a DaemonSet to mount EFS once per node, then have pods access the data through a hostPath volume. This way, you get one mount per node instead of one per pod.
Here’s the new architecture:
flowchart TB
subgraph AWS["AWS Cloud"]
EFS["EFS Filesystem<br/>fs-xxxxxxxx"]
end
subgraph Node1["Kubernetes Node 1"]
subgraph DS1["DaemonSet Pod"]
Mount1["Single EFS Mount"]
end
HP1["/var/lib/efs<br/>(hostPath)"]
Mount1 --> HP1
Pod1["Pod A"] --> HP1
Pod2["Pod B"] --> HP1
Pod3["Pod C...N"] --> HP1
Mount1 --> EFS
end
subgraph Node2["Kubernetes Node 2"]
subgraph DS2["DaemonSet Pod"]
Mount2["Single EFS Mount"]
end
HP2["/var/lib/efs<br/>(hostPath)"]
Mount2 --> HP2
Pod4["Pod X"] --> HP2
Pod5["Pod Y"] --> HP2
Pod6["Pod Z...N"] --> HP2
Mount2 --> EFS
end
style DS1 fill:#51cf66,stroke:#2f9e44
style DS2 fill:#51cf66,stroke:#2f9e44
style HP1 fill:#8ce99a
style HP2 fill:#8ce99a
The DaemonSet runs on every node, mounts EFS to a hostPath directory, and uses bidirectional mount propagation to make that mount available to all pods on the node. Simple, but surprisingly effective.
The Implementation
Here’s the DaemonSet configuration that made it all work:
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: efs-mounter
namespace: kube-system
spec:
selector:
matchLabels:
app: efs-mounter
template:
metadata:
labels:
app: efs-mounter
spec:
serviceAccountName: default
hostPID: true
tolerations:
- operator: Exists
containers:
- name: efs-mounter
image: amazonlinux:2023
securityContext:
privileged: true
command:
- /bin/bash
- -c
- |
set -euo pipefail
echo "Installing amazon-efs-utils..."
yum -y install amazon-efs-utils
echo "Preparing mount points..."
mkdir -p /mnt/efs/app-config
mkdir -p /mnt/efs/app-data
echo "Mount command available:"
which mount.efs || { echo "mount.efs not found!"; exit 1; }
echo "Mounting EFS filesystems..."
retry_mount() {
fs_id=$1
target=$2
for i in 1 2 3 4 5; do
echo "Mounting $fs_id to $target (attempt $i)..."
if mount -t efs "$fs_id:/" "$target"; then
echo "Mounted $fs_id successfully."
return 0
fi
echo "Mount failed. Retrying in 5s..."
sleep 5
done
echo "ERROR: Failed to mount $fs_id after 5 attempts"
exit 1
}
retry_mount "${EFS1_ID}" "/mnt/efs/app-config"
retry_mount "${EFS2_ID}" "/mnt/efs/app-data"
echo "All mounts completed. Keeping container alive..."
sleep infinity
lifecycle:
preStop:
exec:
command:
- /bin/bash
- -c
- |
echo "Unmounting EFS filesystems..."
umount /mnt/efs/app-config || true
umount /mnt/efs/app-data || true
echo "Unmount complete."
env:
- name: EFS1_ID
value: "fs-xxxxxxxx"
- name: EFS2_ID
value: "fs-yyyyyyyy"
volumeMounts:
- name: efs-mount
mountPath: /mnt/efs
mountPropagation: Bidirectional
volumes:
- name: efs-mount
hostPath:
path: /var/lib/efs
type: DirectoryOrCreate
A few things worth noting about this configuration:
- Privileged mode is required because we’re performing mount operations at the host level
mountPropagation: Bidirectionalis the secret sauce - it allows mounts made inside the container to be visible on the host and vice versa- Tolerations are set to allow the DaemonSet to run on all nodes, including tainted ones
preStoplifecycle hook ensures clean unmounting when pods are terminated
How Pods Consume the Mount
Once the DaemonSet is running, your application pods can access EFS through a simple hostPath volume:
volumes:
- name: app-config
hostPath:
path: /var/lib/efs/app-config
- name: app-data
hostPath:
path: /var/lib/efs/app-data
volumeMounts:
- name: app-config
mountPath: /config
- name: app-data
mountPath: /data
No more EFS CSI driver involvement per pod - just straightforward directory access. Each volume maps to a subdirectory under the hostPath where the DaemonSet mounted EFS.
You can mount as many directories as you need - just add the corresponding mkdir and retry_mount calls to the DaemonSet script, and update your pod specs accordingly. That said, keep in mind that if you go overboard with the number of mounts or don’t structure things thoughtfully, you might find yourself back in performance bottleneck territory. The goal is consolidation, not just moving the problem around.
The Results
After rolling this out, the memory savings were immediate and substantial. We recovered roughly 2-3GB of allocatable memory per node. For a cluster running lean on resources, that’s the difference between needing to scale out or not.
Caveats and Considerations
This approach isn’t without trade-offs:
- Write-heavy workloads: I haven’t battle-tested this with high-throughput concurrent writes. For read-heavy or moderate-write scenarios, it works beautifully.
- Security posture: Running privileged containers in kube-system is a trade-off. Make sure your security policies account for this.
- Failure domain: If the DaemonSet pod dies, all pods on that node lose access to EFS until it recovers. The retry logic and Kubernetes’ self-healing help here, but it’s something to consider.
- Mount propagation quirks: Not all Kubernetes distributions handle mount propagation identically. Test thoroughly in your environment.
Wrapping Up
Sometimes the most elegant solutions come from stepping back and questioning the default approach. The EFS CSI driver is fantastic for many use cases, but at scale, the per-pod mounting overhead can become a real problem. By consolidating to a single mount per node, we traded a small amount of complexity for significant resource savings.
If you’re running into similar memory pressure with EFS at scale, give this DaemonSet approach a try. Your nodes (and your cloud bill) might thank you.