KVM tuning
AI usage¶
This document adheres to the AI contribution policy found here. If you find any errors in the instructions, please let us know.
Introduction¶
This guide covers advanced performance tuning for KVM/libvirt virtual machines on Rocky Linux. It addresses common production issues including NUMA memory misallocation and vCPU scheduling contention.
The topics covered include:
- NUMA topology awareness and memory placement.
- The dangers of
vm.min_free_kbyteson NUMA systems. - vCPU pinning with libvirt domain XML.
- CPU isolation with
isolcpusfor dedicated VM workloads. tunedprofiles for virtualization hosts.- NUMA-aware VM placement with
virsh.
Prerequisites¶
- A machine running Rocky Linux 8, 9, or 10 with KVM/libvirt installed and configured. See the Setting Up libvirt on Rocky Linux guide for initial setup.
- Root or
sudoaccess on the hypervisor host. - Familiarity with
virshand libvirt domain XML.
Understanding your NUMA topology¶
Before making any tuning changes, you must understand the NUMA (Non-Uniform Memory Access) topology of your host. On multi-socket systems, each CPU socket has its own local memory. Accessing remote memory on another socket incurs higher latency.
Install numactl if it is not already present:
sudo dnf install -y numactl
Display your NUMA topology:
numactl --hardware
Example output on a dual-socket system:
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
node 0 size: 65536 MB
node 0 free: 42000 MB
node 1 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
node 1 size: 65536 MB
node 1 free: 45000 MB
You can also view the CPU-to-NUMA mapping:
lscpu --extended
Record which CPU cores belong to which NUMA node. You will need this information for vCPU pinning and memory placement in the sections that follow.
The dangers of vm.min_free_kbytes on NUMA systems¶
The vm.min_free_kbytes kernel parameter controls the minimum amount of memory the kernel keeps free for emergency allocations1. On NUMA systems, this reservation is distributed across lowmem zones on all NUMA nodes, not applied as a single system-wide pool.
Setting vm.min_free_kbytes too high can cause immediate OOM kills
On a NUMA system, the kernel distributes the vm.min_free_kbytes reservation proportionally across each zone. If you set it to 10 GB on a system with two NUMA nodes, each node's Normal zone may be required to keep approximately 5 GB free. If one NUMA node has most of its memory committed to virtual machines, it cannot meet its watermark requirement. The kernel marks the zone as all_unreclaimable and triggers the OOM killer, even when other NUMA nodes have plenty of free memory.
How the watermark distribution works¶
The kernel calculates per-zone minimum watermarks from vm.min_free_kbytes. You can view the current watermarks:
cat /proc/zoneinfo | grep -E "Node|min|low|high|managed"
Example output showing the per-zone distribution:
Node 0, zone Normal
min 131072
low 163840
high 196608
managed 16777216
Node 1, zone Normal
min 131072
low 163840
high 196608
managed 16777216
Safe guidelines for KVM hosts¶
On KVM hosts where most physical memory is committed to virtual machines, follow these guidelines:
- Keep
vm.min_free_kbytesat a conservative value. A reasonable starting point is 1-2% of total physical memory, not exceeding a few hundred megabytes. - Check the current value:
sysctl vm.min_free_kbytes
- To set a safe value (example for a 128 GB host):
sudo sysctl -w vm.min_free_kbytes=524288
- Make it persistent:
echo "vm.min_free_kbytes = 524288" | sudo tee /etc/sysctl.d/99-kvm-min-free.conf
sudo sysctl --system
Always check NUMA zone free memory before changing this value
Run numactl --hardware and verify that every NUMA node has significantly more free memory than its share of the proposed vm.min_free_kbytes value. If NUMA node 0 has only 2 GB free and you set vm.min_free_kbytes to 4 GB, node 0 will immediately enter a reclaim spiral.
vCPU pinning with libvirt¶
By default, the KVM scheduler can migrate guest vCPUs across any host CPU core. This causes cache thrashing, cross-NUMA memory access, and unpredictable latency. vCPU pinning assigns each guest vCPU to a specific host CPU, eliminating these issues.
Identifying available host CPUs¶
First, determine which host CPUs are available and their NUMA node membership:
virsh capabilities | grep -A 20 "topology"
Or use lscpu:
lscpu --extended
Configuring vCPU pinning in domain XML¶
Edit your VM domain XML with virsh edit:
virsh edit your-vm-name
Add a <cputune> section2. Each <vcpupin> directive maps a guest vCPU to a specific host CPU:
<cputune>
<vcpupin vcpu='0' cpuset='4'/>
<vcpupin vcpu='1' cpuset='5'/>
<vcpupin vcpu='2' cpuset='6'/>
<vcpupin vcpu='3' cpuset='7'/>
<emulatorpin cpuset='0-3'/>
</cputune>
Terminology
vcpupin: pins a guest vCPU to specific host CPU cores.emulatorpin: pins the QEMU emulator threads (which handle I/O, timers, and device emulation) to specific host CPUs.
The importance of emulatorpin¶
The <emulatorpin> directive is critical but often overlooked. Without it, QEMU emulator threads are free to run on any host CPU, including CPUs you have isolated for guest vCPUs. This causes contention that can deschedule guest vCPUs, leading to clock drift and application timeouts.
Always configure emulatorpin
Without emulatorpin, QEMU emulator threads can run on your isolated vCPU cores, causing the hypervisor to deschedule guest CPUs. Applications sensitive to timing (such as F5 BIG-IP, database clusters, and real-time workloads) will experience failures when vCPUs lose scheduling time.
Pin emulator threads to your housekeeping CPUs (the cores not assigned to any guest vCPU):
<emulatorpin cpuset='0-3'/>
Pinning rules¶
Follow these steps for effective vCPU pinning:
- Pin all vCPUs to the same NUMA node as the VM memory allocation.
- Use dedicated 1:1 pinning, where each guest vCPU maps to exactly one host CPU. Avoid floating pools where multiple vCPUs share a cpuset.
- Reserve housekeeping CPUs (typically cores 0-3) for the host kernel, IRQ handling, QEMU emulator threads, and system daemons.
- Never overlap housekeeping cores with guest vCPU pins. Allowing guest vCPU pools to include housekeeping cores causes scheduling contention.
Complete vCPU pinning example¶
This example shows a 12-vCPU VM pinned to NUMA node 0 with emulator threads on housekeeping cores:
<vcpu placement='static'>12</vcpu>
<cputune>
<vcpupin vcpu='0' cpuset='4'/>
<vcpupin vcpu='1' cpuset='5'/>
<vcpupin vcpu='2' cpuset='6'/>
<vcpupin vcpu='3' cpuset='7'/>
<vcpupin vcpu='4' cpuset='8'/>
<vcpupin vcpu='5' cpuset='9'/>
<vcpupin vcpu='6' cpuset='10'/>
<vcpupin vcpu='7' cpuset='11'/>
<vcpupin vcpu='8' cpuset='12'/>
<vcpupin vcpu='9' cpuset='13'/>
<vcpupin vcpu='10' cpuset='14'/>
<vcpupin vcpu='11' cpuset='15'/>
<emulatorpin cpuset='0-3'/>
</cputune>
<numatune>
<memory mode='strict' nodeset='0'/>
</numatune>
Verify the pinning after starting the VM:
virsh vcpuinfo your-vm-name
CPU isolation with isolcpus¶
vCPU pinning alone does not prevent the host kernel scheduler from placing other processes on those CPUs. To fully dedicate CPUs to virtual machines, use kernel CPU isolation.
Kernel command line parameters¶
Add the following to the kernel command line to isolate non-housekeeping CPUs (in this example CPU 4-63) for exclusive VM use:
sudo grubby --args="isolcpus=4-63 nohz_full=4-63 rcu_nocbs=4-63" --update-kernel=ALL
Reboot to apply:
sudo reboot
These parameters work together3:
isolcpus=4-63: removes CPUs 4-63 from the general scheduler. Only explicitly pinned tasks run on them.nohz_full=4-63: disables the scheduler tick on isolated CPUs when only one task is running, reducing jitter4.rcu_nocbs=4-63: offloads RCU (Read-Copy-Update) callback processing from isolated CPUs to dedicated kthreads, which can then be affined to housekeeping CPUs.
Verifying CPU isolation¶
After rebooting, confirm that isolation is active:
cat /sys/devices/system/cpu/isolated
Expected output:
4-63
You can further verify with a stress test. Run stress-ng and confirm that processes only land on housekeeping CPUs:
taskset -c 0-63 stress-ng --cpu 8 --timeout 30s &
watch -n1 'ps -eo pid,psr,comm | grep stress'
The PSR column should show only CPUs 0-3 (the non-isolated housekeeping cores). If stress processes appear on isolated CPUs, the isolation configuration requires troubleshooting.
tuned profiles for virtualization hosts¶
The tuned daemon provides system tuning profiles that automate kernel parameter and CPU governor settings. Rocky Linux includes profiles specifically designed for virtualization hosts.
Installing the virtualization profiles¶
sudo dnf install -y tuned-profiles-cpu-partitioning
Available profiles¶
View all available profiles:
tuned-adm list
The relevant profiles for KVM hosts are:
virtual-host: basic tuning for virtualization hosts. Enables transparent hugepages and adjusts kernel scheduling parameters.cpu-partitioning: advanced profile that combines CPU isolation with optimized scheduling. Best for latency-sensitive VM workloads.
Configuring cpu-partitioning¶
The cpu-partitioning profile requires configuration before activation. Edit the variables file:
sudo vi /etc/tuned/cpu-partitioning-variables.conf
Set the isolated and non-balanced cores to match your CPU isolation plan:
isolated_cores=4-63
no_balance_cores=4-63
Activate the profile:
sudo tuned-adm profile cpu-partitioning
Verify the active profile:
tuned-adm active
Expected output:
Current active profile: cpu-partitioning
After activating the profile, reboot to ensure all kernel parameters are fully applied:
sudo reboot
Verifying tuned settings¶
After reboot, confirm that the profile has applied the expected kernel parameters:
cat /proc/cmdline | tr ' ' '\n' | grep -E "isolcpus|nohz_full|rcu_nocbs"
NUMA-aware VM placement with virsh¶
For optimal performance, a VM's memory and vCPUs must be on the same NUMA node. Accessing memory on a remote NUMA node adds significant latency.
Configuring NUMA memory binding¶
In the VM domain XML, use the <numatune> element to bind VM memory to a specific NUMA node2:
<numatune>
<memory mode='strict' nodeset='0'/>
</numatune>
The strict mode ensures that all memory allocations for this VM come from NUMA node 0. If the node does not have sufficient free memory, allocations will fail rather than silently falling back to remote memory. This may trigger the OOM killer rather than preventing VM startup, so ensure the target NUMA node has sufficient free memory before starting the VM.
Aligning vCPU pins with NUMA memory¶
Ensure that the CPUs you pin vCPUs to belong to the same NUMA node as the memory binding. Using the numactl --hardware output from earlier, if NUMA node 0 contains CPUs 0-15, pin your vCPUs to cores within that range (excluding housekeeping cores):
<vcpu placement='static'>8</vcpu>
<cputune>
<vcpupin vcpu='0' cpuset='4'/>
<vcpupin vcpu='1' cpuset='5'/>
<vcpupin vcpu='2' cpuset='6'/>
<vcpupin vcpu='3' cpuset='7'/>
<vcpupin vcpu='4' cpuset='8'/>
<vcpupin vcpu='5' cpuset='9'/>
<vcpupin vcpu='6' cpuset='10'/>
<vcpupin vcpu='7' cpuset='11'/>
<emulatorpin cpuset='0-3'/>
</cputune>
<numatune>
<memory mode='strict' nodeset='0'/>
</numatune>
Verifying NUMA placement at runtime¶
After starting the VM, verify its NUMA memory usage:
virsh numatune your-vm-name
Check which NUMA nodes the VM process is using:
numastat -c qemu-kvm
Rebalancing VMs across NUMA nodes¶
If one NUMA node is overcommitted while another is underutilized, redistribute VMs between nodes. Shut down the VM, update its <numatune> and <cputune> to target the other NUMA node, then restart it:
virsh shutdown your-vm-name
virsh edit your-vm-name
# Change nodeset and vcpupin to target NUMA node 1 CPUs
virsh start your-vm-name
Hugepages for KVM¶
Hugepages reduce TLB (Translation Lookaside Buffer) misses by using larger memory pages (2 MB or 1 GB instead of the default 4 KB). This significantly improves memory access performance for virtual machines.
Configuring 2 MB hugepages¶
Calculate the number of hugepages needed. For a VM with 16 GB of memory:
16384 MB / 2 MB = 8192 hugepages
Reserve hugepages:
echo "vm.nr_hugepages = 8192" | sudo tee /etc/sysctl.d/99-hugepages.conf
sudo sysctl --system
Verify the allocation:
grep HugePages /proc/meminfo
Configuring hugepages in the VM domain XML¶
Add the <memoryBacking> element to use hugepages2:
<memoryBacking>
<hugepages>
<page size='2048' unit='KiB'/>
</hugepages>
<locked/>
</memoryBacking>
The locked element
The <locked/> element prevents the host from swapping the VM memory. This is recommended for performance but requires sufficient physical memory. If using <locked/>, ensure you have enough hugepages allocated for all VMs.
NUMA-aware hugepage allocation¶
For NUMA systems, allocate hugepages per node to ensure they are available where needed:
echo 4096 | sudo tee /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
echo 4096 | sudo tee /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages
Match the hugepage allocation to the VM placement on each NUMA node.
Putting it all together¶
A fully tuned VM configuration combines all the elements covered in this guide5. Here is a complete example for a performance-critical VM on a dual-socket Rocky Linux 9 host:
<domain type='kvm'>
<name>production-vm</name>
<memory unit='GiB'>16</memory>
<vcpu placement='static'>8</vcpu>
<cpu mode='host-passthrough'>
<topology sockets='1' dies='1' cores='8' threads='1'/>
<numa>
<cell id='0' cpus='0-7' memory='16' unit='GiB'/>
</numa>
</cpu>
<cputune>
<vcpupin vcpu='0' cpuset='4'/>
<vcpupin vcpu='1' cpuset='5'/>
<vcpupin vcpu='2' cpuset='6'/>
<vcpupin vcpu='3' cpuset='7'/>
<vcpupin vcpu='4' cpuset='8'/>
<vcpupin vcpu='5' cpuset='9'/>
<vcpupin vcpu='6' cpuset='10'/>
<vcpupin vcpu='7' cpuset='11'/>
<emulatorpin cpuset='0-3'/>
</cputune>
<numatune>
<memory mode='strict' nodeset='0'/>
</numatune>
<memoryBacking>
<hugepages>
<page size='2048' unit='KiB'/>
</hugepages>
<locked/>
</memoryBacking>
<!-- remaining domain configuration (disk, network, etc.) -->
</domain>
Conclusion¶
Performance tuning KVM/libvirt on Rocky Linux requires a holistic approach that addresses CPU scheduling, memory placement, and I/O configuration together. The key principles are:
- Understand your NUMA topology before making any changes.
- Pin vCPUs to specific cores on the same NUMA node as the VM memory.
- Always configure
emulatorpinto prevent QEMU threads from contending with guest vCPUs. - Isolate CPUs with
isolcpus,nohz_full, andrcu_nocbsfor latency-sensitive workloads. - Use the
cpu-partitioningtuned profile for consistent system-wide tuning. - Be cautious with
vm.min_free_kbyteson NUMA systems to avoid OOM conditions.
References¶
- "Documentation for /proc/sys/vm/" by the Linux kernel documentation project https://docs.kernel.org/admin-guide/sysctl/vm.html
- "Domain XML format" by the libvirt project https://libvirt.org/formatdomain.html
- "The kernel's command-line parameters" by the Linux kernel documentation project https://docs.kernel.org/admin-guide/kernel-parameters.html
- "NO_HZ: Reducing scheduling-clock ticks" by the Linux kernel documentation project https://docs.kernel.org/timers/no_hz.html
- "KVM real time guest configuration" by the libvirt project https://libvirt.org/kbase/kvm-realtime.html
Author: Howard Van Der Wal