Fresh Swap Features for Linux Users in Kubernetes 1.32
Swap is a fundamental and an invaluable Linux feature. It offers numerous benefits, such as effectively increasing a node’s memory by swapping out unused data, shielding nodes from system-level memory spikes, preventing Pods from crashing when they hit their memory limits, and much more. As a result, the node special interest group within the Kubernetes project has invested significant effort into supporting swap on Linux nodes.
The 1.22 release introduced Alpha support for configuring swap memory usage for Kubernetes workloads running on Linux on a per-node basis. Later, in release 1.28, support for swap on Linux nodes has graduated to Beta, along with many new improvements. In the following Kubernetes releases more improvements were made, paving the way to GA in the near future.
Prior to version 1.22, Kubernetes did not provide support for swap memory on Linux systems. This was due to the inherent difficulty in guaranteeing and accounting for pod memory utilization when swap memory was involved. As a result, swap support was deemed out of scope in the initial design of Kubernetes, and the default behavior of a kubelet was to fail to start if swap memory was detected on a node.
In version 1.22, the swap feature for Linux was initially introduced in its Alpha stage. This provided Linux users the opportunity to experiment with the swap feature for the first time. However, as an Alpha version, it was not fully developed and only partially worked on limited environments.
In version 1.28 swap support on Linux nodes was promoted to Beta.
The Beta version was a drastic leap forward.
Not only did it fix a large amount of bugs and made swap work in a stable way,
but it also brought cgroup v2 support, introduced a wide variety of tests
which include complex scenarios such as node-level pressure, and more.
It also brought many exciting new capabilities such as the LimitedSwap
behavior
which sets an auto-calculated swap limit to containers, OpenMetrics instrumentation
support (through the /metrics/resource
endpoint) and Summary API for
VerticalPodAutoscalers (through the /stats/summary
endpoint), and more.
Today we are working on more improvements, paving the way for GA.
Currently, the focus is especially towards ensuring node stability,
enhanced debug abilities, addressing user feedback,
polishing the feature and making it stable.
For example, in order to increase stability, containers in high-priority pods
cannot access swap which ensures the memory they need is ready to use.
In addition, the UnlimitedSwap
behavior was removed since it might compromise
the node's health.
Secret content protection against swapping has also been introduced
(see relevant security-risk section for more info).
To conclude, compared to previous releases, the kubelet's support for running with swap enabled is more stable and robust, more user-friendly, and addresses many known shortcomings. That said, the NodeSwap feature introduces basic swap support, and this is just the beginning. In the near future, additional features are planned to enhance swap functionality in various ways, such as improving evictions, extending the API, increasing customizability, and more!
How do I use it?
In order for the kubelet to initialize on a swap-enabled node, the failSwapOn
field must be set to false
on kubelet's configuration setting, or the deprecated
--fail-swap-on
command line flag must be deactivated.
It is possible to configure the memorySwap.swapBehavior
option to define the
manner in which a node utilizes swap memory.
For instance,
# this fragment goes into the kubelet's configuration file
memorySwap:
swapBehavior: LimitedSwap
The currently available configuration options for swapBehavior
are:
NoSwap
(default): Kubernetes workloads cannot use swap. However, processes outside of Kubernetes' scope, like system daemons (such as kubelet itself!) can utilize swap. This behavior is beneficial for protecting the node from system-level memory spikes, but it does not safeguard the workloads themselves from such spikes.LimitedSwap
: Kubernetes workloads can utilize swap memory, but with certain limitations. The amount of swap available to a Pod is determined automatically, based on the proportion of the memory requested relative to the node's total memory. Only non-high-priority Pods under the Burstable Quality of Service (QoS) tier are permitted to use swap. For more details, see the section below.
If configuration for memorySwap
is not specified,
by default the kubelet will apply the same behaviour as the NoSwap
setting.
On Linux nodes, Kubernetes only supports running with swap enabled for hosts that use cgroup v2. On cgroup v1 systems, all Kubernetes workloads are not allowed to use swap memory.
Install a swap-enabled cluster with kubeadm
Before you begin
It is required for this demo that the kubeadm tool be installed, following the steps outlined in the kubeadm installation guide. If swap is already enabled on the node, cluster creation may proceed. If swap is not enabled, please refer to the provided instructions for enabling swap.
Create a swap file and turn swap on
I'll demonstrate creating 4GiB of swap, both in the encrypted and unencrypted case.
Setting up unencrypted swap
An unencrypted swap file can be set up as follows.
# Allocate storage and restrict access
fallocate --length 4GiB /swapfile
chmod 600 /swapfile
# Format the swap space
mkswap /swapfile
# Activate the swap space for paging
swapon /swapfile
Setting up encrypted swap
An encrypted swap file can be set up as follows.
Bear in mind that this example uses the cryptsetup
binary (which is available
on most Linux distributions).
# Allocate storage and restrict access
fallocate --length 4GiB /swapfile
chmod 600 /swapfile
# Create an encrypted device backed by the allocated storage
cryptsetup --type plain --cipher aes-xts-plain64 --key-size 256 -d /dev/urandom open /swapfile cryptswap
# Format the swap space
mkswap /dev/mapper/cryptswap
# Activate the swap space for paging
swapon /dev/mapper/cryptswap
Verify that swap is enabled
Swap can be verified to be enabled with both swapon -s
command or the free
command
> swapon -s
Filename Type Size Used Priority
/dev/dm-0 partition 4194300 0 -2
> free -h
total used free shared buff/cache available
Mem: 3.8Gi 1.3Gi 249Mi 25Mi 2.5Gi 2.5Gi
Swap: 4.0Gi 0B 4.0Gi
Enable swap on boot
After setting up swap, to start the swap file at boot time,
you either set up a systemd unit to activate (encrypted) swap, or you
add a line similar to /swapfile swap swap defaults 0 0
into /etc/fstab
.
Set up a Kubernetes cluster that uses swap-enabled nodes
To make things clearer, here is an example kubeadm configuration file kubeadm-config.yaml
for the swap enabled cluster.
---
apiVersion: "kubeadm.k8s.io/v1beta3"
kind: InitConfiguration
---
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
failSwapOn: false
memorySwap:
swapBehavior: LimitedSwap
Then create a single-node cluster using kubeadm init --config kubeadm-config.yaml
.
During init, there is a warning that swap is enabled on the node and in case the kubelet
failSwapOn
is set to true. We plan to remove this warning in a future release.
How is the swap limit being determined with LimitedSwap?
The configuration of swap memory, including its limitations, presents a significant challenge. Not only is it prone to misconfiguration, but as a system-level property, any misconfiguration could potentially compromise the entire node rather than just a specific workload. To mitigate this risk and ensure the health of the node, we have implemented Swap with automatic configuration of limitations.
With LimitedSwap
, Pods that do not fall under the Burstable QoS classification (i.e.
BestEffort
/Guaranteed
QoS Pods) are prohibited from utilizing swap memory.
BestEffort
QoS Pods exhibit unpredictable memory consumption patterns and lack
information regarding their memory usage, making it difficult to determine a safe
allocation of swap memory.
Conversely, Guaranteed
QoS Pods are typically employed for applications that rely on the
precise allocation of resources specified by the workload, with memory being immediately available.
To maintain the aforementioned security and node health guarantees,
these Pods are not permitted to use swap memory when LimitedSwap
is in effect.
In addition, high-priority pods are not permitted to use swap in order to ensure the memory
they consume always residents on disk, hence ready to use.
Prior to detailing the calculation of the swap limit, it is necessary to define the following terms:
nodeTotalMemory
: The total amount of physical memory available on the node.totalPodsSwapAvailable
: The total amount of swap memory on the node that is available for use by Pods (some swap memory may be reserved for system use).containerMemoryRequest
: The container's memory request.
Swap limitation is configured as:
(containerMemoryRequest / nodeTotalMemory) × totalPodsSwapAvailable
In other words, the amount of swap that a container is able to use is proportionate to its memory request, the node's total physical memory and the total amount of swap memory on the node that is available for use by Pods.
It is important to note that, for containers within Burstable QoS Pods, it is possible to opt-out of swap usage by specifying memory requests that are equal to memory limits. Containers configured in this manner will not have access to swap memory.
How does it work?
There are a number of possible ways that one could envision swap use on a node. When swap is already provisioned and available on a node, the kubelet is able to be configured so that:
- It can start with swap on.
- It will direct the Container Runtime Interface to allocate zero swap memory to Kubernetes workloads by default.
Swap configuration on a node is exposed to a cluster admin via the
memorySwap
in the KubeletConfiguration.
As a cluster administrator, you can specify the node's behaviour in the
presence of swap memory by setting memorySwap.swapBehavior
.
The kubelet employs the CRI
(container runtime interface) API, and directs the container runtime to
configure specific cgroup v2 parameters (such as memory.swap.max
) in a manner that will
enable the desired swap configuration for a container. For runtimes that use control groups,
the container runtime is then responsible for writing these settings to the container-level cgroup.
How can I monitor swap?
Node and container level metric statistics
Kubelet now collects node and container level metric statistics,
which can be accessed at the /metrics/resource
(which is used mainly by monitoring
tools like Prometheus) and /stats/summary
(which is used mainly by Autoscalers) kubelet HTTP endpoints.
This allows clients who can directly interrogate the kubelet to
monitor swap usage and remaining swap memory when using LimitedSwap
.
Additionally, a machine_swap_bytes
metric has been added to cadvisor to show
the total physical swap capacity of the machine.
See this page for more info.
Node Feature Discovery (NFD)
Node Feature Discovery is a Kubernetes addon for detecting hardware features and configuration. It can be utilized to discover which nodes are provisioned with swap.
As an example, to figure out which nodes are provisioned with swap, use the following command:
kubectl get nodes -o jsonpath='{range .items[?(@.metadata.labels.feature\.node\.kubernetes\.io/memory-swap)]}{.metadata.name}{"\t"}{.metadata.labels.feature\.node\.kubernetes\.io/memory-swap}{"\n"}{end}'
This will result in an output similar to:
k8s-worker1: true
k8s-worker2: true
k8s-worker3: false
In this example, swap is provisioned on nodes k8s-worker1
and k8s-worker2
, but not on k8s-worker3
.
Caveats
Having swap available on a system reduces predictability. While swap can enhance performance by making more RAM available, swapping data back to memory is a heavy operation, sometimes slower by many orders of magnitude, which can cause unexpected performance regressions. Furthermore, swap changes a system's behaviour under memory pressure. Enabling swap increases the risk of noisy neighbors, where Pods that frequently use their RAM may cause other Pods to swap. In addition, since swap allows for greater memory usage for workloads in Kubernetes that cannot be predictably accounted for, and due to unexpected packing configurations, the scheduler currently does not account for swap memory usage. This heightens the risk of noisy neighbors.
The performance of a node with swap memory enabled depends on the underlying physical storage. When swap memory is in use, performance will be significantly worse in an I/O operations per second (IOPS) constrained environment, such as a cloud VM with I/O throttling, when compared to faster storage mediums like solid-state drives or NVMe. As swap might cause IO pressure, it is recommended to give a higher IO latency priority to system critical daemons. See the relevant section in the recommended practices section below.
Memory-backed volumes
On Linux nodes, memory-backed volumes (such as secret
volume mounts, or emptyDir
with medium: Memory
)
are implemented with a tmpfs
filesystem.
The contents of such volumes should remain in memory at all times, hence should
not be swapped to disk.
To ensure the contents of such volumes remain in memory, the noswap
tmpfs option
is being used.
The Linux kernel officially supports the noswap
option from version 6.3 (more info
can be found in Linux Kernel Version Requirements).
However, the different distributions often choose to backport this mount option to older
Linux versions as well.
In order to verify whether the node supports the noswap
option, the kubelet will do the following:
- If the kernel's version is above 6.3 then the
noswap
option will be assumed to be supported. - Otherwise, kubelet would try to mount a dummy tmpfs with the
noswap
option at startup. If kubelet fails with an error indicating of an unknown option,noswap
will be assumed to not be supported, hence will not be used. A kubelet log entry will be emitted to warn the user about memory-backed volumes might swap to disk. If kubelet succeeds, the dummy tmpfs will be deleted and thenoswap
option will be used.- If the
noswap
option is not supported, kubelet will emit a warning log entry, then continue its execution.
- If the
It is deeply encouraged to encrypt the swap space. See the section above with an example for setting unencrypted swap. However, handling encrypted swap is not within the scope of kubelet; rather, it is a general OS configuration concern and should be addressed at that level. It is the administrator's responsibility to provision encrypted swap to mitigate this risk.
Good practice for using swap in a Kubernetes cluster
Disable swap for system-critical daemons
During the testing phase and based on user feedback, it was observed that the performance
of system-critical daemons and services might degrade.
This implies that system daemons, including the kubelet, could operate slower than usual.
If this issue is encountered, it is advisable to configure the cgroup of the system slice
to prevent swapping (i.e., set memory.swap.max=0
).
Protect system-critical daemons for I/O latency
Swap can increase the I/O load on a node. When memory pressure causes the kernel to rapidly swap pages in and out, system-critical daemons and services that rely on I/O operations may experience performance degradation.
To mitigate this, it is recommended for systemd users to prioritize the system slice in terms of I/O latency.
For non-systemd users,
setting up a dedicated cgroup for system daemons and processes and prioritizing I/O latency in the same way is advised.
This can be achieved by setting io.latency
for the system slice,
thereby granting it higher I/O priority.
See cgroup's documentation for more info.
Swap and control plane nodes
The Kubernetes project recommends running control plane nodes without any swap space configured. The control plane primarily hosts Guaranteed QoS Pods, so swap can generally be disabled. The main concern is that swapping critical services on the control plane could negatively impact performance.
Use of a dedicated disk for swap
It is recommended to use a separate, encrypted disk for the swap partition. If swap resides on a partition or the root filesystem, workloads may interfere with system processes that need to write to disk. When they share the same disk, processes can overwhelm swap, disrupting the I/O of kubelet, container runtime, and systemd, which would impact other workloads. Since swap space is located on a disk, it is crucial to ensure the disk is fast enough for the intended use cases. Alternatively, one can configure I/O priorities between different mapped areas of a single backing device.
Looking ahead
As you can see, the swap feature was dramatically improved lately, paving the way for a feature GA. However, this is just the beginning. It's a foundational implementation marking the beginning of enhanced swap functionality.
In the near future, additional features are planned to further improve swap capabilities, including better eviction mechanisms, extended API support, increased customizability, better debug abilities and more!
How can I learn more?
You can review the current documentation for using swap with Kubernetes.
For more information, please see KEP-2400 and its design proposal.
How do I get involved?
Your feedback is always welcome! SIG Node meets regularly and can be reached via Slack (channel #sig-node), or the SIG's mailing list. A Slack channel dedicated to swap is also available at #sig-node-swap.
Feel free to reach out to me, Itamar Holder (@iholder101 on Slack and GitHub) if you'd like to help or ask further questions.