XFS recovery

AI usage¶

This document adheres to the AI contribution policy found here. If you find any errors in the instructions, please let us know.

Introduction¶

Rocky Linux uses XFS as its default file system for all partitions except /boot (which uses ext4). XFS is a high-performance, journal-based file system that handles large files and high I/O workloads well¹, but like any file system, it can encounter issues ranging from kernel-level memory leaks to metadata corruption that prevents booting.

This guide covers:

Monitoring XFS-related kernel slab memory usage.
Diagnosing slab memory leaks caused by known kernel bugs.
Mitigating memory issues related to Transparent Huge Pages (THP).
Recovering a system that will not boot using rd.break and xfs_repair.
Understanding when and how to use xfs_repair -L.
Avoiding data loss during RAID controller recovery.

The procedures in this guide apply to Rocky Linux 8, Rocky Linux 9, and Rocky Linux 10. Where versions differ, the guide notes the distinction.

Prerequisites¶

Before working through this guide, ensure you have:

Root or sudo access to the Rocky Linux system.
Basic familiarity with LVM (Logical Volume Manager) concepts.
Console access (physical, IPMI, or iDRAC) for rd.break recovery procedures.
A current backup of critical data before running any repair operations.

Install the xfsprogs package if it is not already present:

dnf install xfsprogs

Viewing `XFS` file system information¶

Use xfs_info to display the configuration of a mounted XFS file system². Pass the mount point as the argument:

xfs_info /

Standard Rocky Linux installations

The xfs_info / command works on standard Rocky Linux installations where / is formatted as XFS. Some cloud provider images may use ext4 for the root file system. Verify with df -Th / before running xfs_info.

This shows the block size, inode size, log size, and other structural details. Record this output as a baseline before troubleshooting.

To list all mounted XFS file systems:

mount -t xfs

Monitoring `XFS` slab memory usage¶

The Linux kernel uses slab memory managers to handle internal objects, including those used by XFS³. Monitoring slab usage helps identify memory leaks and abnormal growth patterns.

Viewing slab allocations with `slabtop`¶

The slabtop command displays real-time slab statistics. Sort by cache size to see the largest consumers:

slabtop -s c

Key XFS-related slab objects to monitor:

xfs_inode - cached XFS inode structures.
xfs_buf - XFS buffer cache entries.
xfs_ili - XFS inode log items.
xfs_trans - XFS transaction structures.

Other kernel slab objects that grow alongside XFS under memory pressure:

dentry - directory entry cache.
inode_cache - VFS inode cache.
radix_tree_node - tree nodes used for page cache indexing.

Checking `/proc/slabinfo` directly¶

For scripting or automated monitoring, read /proc/slabinfo directly:

grep -E 'xfs_inode|xfs_buf|dentry|inode_cache' /proc/slabinfo

Each line shows the object name, active objects, total objects, object size, and other details.

Monitoring memory fields in `/proc/meminfo`¶

Three fields in /proc/meminfo track slab memory⁴:

grep -E 'Slab|SReclaimable|SUnreclaim' /proc/meminfo

Slab - total memory used by the slab manager.
SReclaimable - slab memory that the kernel can reclaim under pressure.
SUnreclaim - slab memory that cannot be reclaimed.

An increasing SUnreclaim value over days or weeks indicates a potential memory leak.

Establishing a baseline¶

Record slab values shortly after a clean boot to establish a normal baseline:

date && grep -E 'Slab|SReclaimable|SUnreclaim' /proc/meminfo && slabtop -o -s c | head -20

Compare against this baseline during routine monitoring. Normal growth depends on workload, but doubling of slab values without a corresponding increase in application activity warrants investigation.

Diagnosing `XFS` slab memory leaks¶

Slab memory leaks in the kernel manifest as a steady increase in slab allocations over days or weeks that do not decrease when workloads are reduced.

Identifying the growth pattern¶

Signs of a kernel slab memory leak:

SUnreclaim in /proc/meminfo grows steadily over days without returning to baseline.
Specific slab objects (such as numa_policy, xfs_inode, or pid) grow by 10x or more compared to a freshly booted system.
Total memory usage climbs to 80-90% of physical RAM despite stable application workloads.
Swap usage increases as the kernel consumes available memory.

Checking for runaway `kworker` threads¶

Kernel worker threads (kworker) handle deferred kernel operations. XFS uses kernel work queues for journal writes and metadata operations, so an abnormally high kworker count can indicate a kernel I/O issue:

ps -eLf | grep -c kworker

A healthy system typically has dozens to low hundreds of kworker threads depending on CPU count and I/O load. Counts of 500 or more, combined with high slab growth, indicate a kernel-level problem.

Mitigating `XFS` memory issues with `THP`¶

Transparent Huge Pages (THP) can interact poorly with the kernel memory compaction system, triggering slab memory leaks on affected kernel versions⁵. Disabling THP is the recommended workaround when a kernel update is not immediately possible.

Checking the current `THP` state¶

cat /sys/kernel/mm/transparent_hugepage/enabled

The output shows three options with the active setting in brackets. For example, always [madvise] never means madvise is active.

Disabling `THP` immediately¶

To disable THP on a running system without a reboot:

echo never > /sys/kernel/mm/transparent_hugepage/enabled
echo never > /sys/kernel/mm/transparent_hugepage/defrag

Verify the change:

cat /sys/kernel/mm/transparent_hugepage/enabled

The output should show always madvise [never].

Making the `THP` change persistent¶

To ensure THP remains disabled across reboots, add a kernel boot parameter⁹. Edit the GRUB configuration:

grubby --update-kernel=ALL --args="transparent_hugepage=never"

Verify the parameter was added:

grubby --info=ALL | grep args

The transparent_hugepage=never parameter should appear in the kernel arguments.

Performance considerations

Disabling THP may reduce performance for applications that benefit from large memory pages, such as databases and in-memory caches. Monitor application performance after disabling THP and re-enable it if your workload benefits from it.

Recovering a system that will not boot with `rd.break`¶

When an XFS file system suffers metadata corruption, the system may fail to mount its root file system during boot. Errors such as Metadata has LSN ahead of current LSN in the boot output indicate that the XFS journal (log) contains sequence numbers that are inconsistent with the file system metadata. The rd.break kernel parameter interrupts the boot process before the root file system is mounted⁶.

Console access required

The rd.break recovery procedure requires interactive access to the GRUB boot menu. This means physical console access, or remote console access through IPMI, iDRAC, iLO, or a similar baseboard management controller. You cannot perform this procedure over SSH.

When to use `rd.break`¶

Use rd.break when:

The system drops to a dracut emergency shell during boot.
Boot messages show XFS mount failures with LSN errors.
The root file system (/sysroot) fails to mount.
Standard rescue mode (systemd.unit=rescue.target) also fails because it depends on mounting root.

Step-by-step recovery procedure¶

Step 1 - Access the GRUB boot menu. When the GRUB menu appears during boot, press e to edit the default boot entry.

Step 2 - Add rd.break to the kernel command line. Find the line that begins with linux (or linuxefi) and append rd.break to the end of that line.

Step 3 - Boot with the modified parameters. Press Ctrl+X to boot. The system will stop in the initramfs environment before mounting the root file system. You will see a switch_root:/root# prompt.

Step 4 - Activate LVM volumes. If the system uses LVM (which is the default Rocky Linux layout)⁷, activate all volume groups:

lvm vgchange -ay

List the available logical volumes to identify which ones need repair:

lvm lvs

Step 5 - Run xfs_repair on each XFS logical volume. Start without the -L flag to assess the damage:

xfs_repair /dev/mapper/vg00-lv_root

If xfs_repair completes but reports Maximum metadata LSN is ahead of log or shows extensive CRC errors and metadata corruption, the file system needs log zeroing. Proceed with the -L flag as described in the next section.

Run xfs_repair -L on every XFS logical volume that reported LSN or metadata errors:

xfs_repair -L /dev/mapper/vg00-lv_root
xfs_repair -L /dev/mapper/vg00-lv_home
xfs_repair -L /dev/mapper/vg00-lv_var
xfs_repair -L /dev/mapper/vg00-lv_tmp
xfs_repair -L /dev/mapper/vg00-lv_opt
xfs_repair -L /dev/mapper/vg00-lv_var_log

Skip swap volumes

Do not run xfs_repair on swap volumes. Swap uses a different format and is not an XFS file system. Running xfs_repair on a swap volume will produce errors or damage the swap signature.

Step 6 - Reboot the system. After repairing all XFS volumes:

reboot -f

The system should now boot normally. After booting, verify file system health by checking dmesg for XFS messages:

dmesg | grep -i xfs

Understanding `xfs_repair -L`¶

The xfs_repair command checks and repairs XFS file system metadata⁸. The -L flag has a specific and significant purpose that you should understand before using it.

What `-L` does¶

The -L flag tells xfs_repair to zero (clear) the file system journal log. The XFS journal records pending metadata operations. When -L is used, any operations recorded in the journal that have not yet been written to the file system are permanently discarded.

When to use `-L`¶

Use xfs_repair -L when:

The file system cannot be mounted due to journal corruption (LSN errors).
Running xfs_repair without -L reports Maximum metadata LSN is ahead of log and formats the log.
Running xfs_repair without -L shows extensive CRC errors and metadata corruption.
The system will not boot and standard repair alone did not restore the ability to mount.

When NOT to use `-L`¶

Do not use xfs_repair -L when:

The file system mounts normally - use xfs_repair without -L instead.
You have not tried xfs_repair without -L first.
The issue is performance-related rather than corruption-related.

Always try standard repair first¶

Run xfs_repair without any flags first:

xfs_repair /dev/mapper/vg00-lv_root

If the journal can be replayed cleanly, this preserves all pending writes. Only escalate to -L when standard repair reports LSN mismatches or extensive metadata CRC errors. Note that xfs_repair without -L may still complete its phases but output a Maximum metadata LSN is ahead of log message - this indicates that -L is needed for a full repair.

Potential data loss

Using xfs_repair -L discards all pending journal entries. This may result in recent file changes being lost. Files that were being written when the system crashed may be incomplete or missing. Always back up data before running xfs_repair -L if possible.

Post-repair verification¶

After running xfs_repair (with or without -L), mount the file system and verify:

mount /dev/mapper/vg00-lv_root /mnt
ls -la /mnt
df -h /mnt
umount /mnt

Check the kernel log for XFS messages after mounting:

dmesg | grep -i xfs

Clean mount messages without errors confirm the repair was successful.

RAID controller recovery considerations¶

When a server with hardware RAID fails to boot, the RAID controller configuration is a critical factor. Incorrect RAID recovery actions can cause permanent data loss.

Physical servers only

This section applies to physical servers with hardware RAID controllers such as Dell PERC, Broadcom MegaRAID, or HPE Smart Array. Virtual machines and cloud instances do not use hardware RAID.

Understanding foreign RAID configurations¶

A "foreign configuration" occurs when a RAID controller detects physical disks that contain RAID metadata from a different controller or a previous configuration. This commonly happens after:

Cable or disk removal and replacement during hardware maintenance
Controller replacement
Moving disks between servers

The RAID controller presents two options:

Import - reads the existing RAID metadata and reconstructs the virtual disk. All data is preserved.
Clear - removes the RAID metadata and treats the disks as new. The virtual disk mapping is destroyed and all data on the array becomes inaccessible.

Import preserves data, clear destroys data¶

Clearing a foreign configuration is irreversible

Clearing a foreign RAID configuration destroys the RAID metadata that maps physical disks to virtual disks. Without this mapping, the controller can no longer present a coherent virtual disk to the operating system. Partition tables, LVM metadata, and file system data remain on the raw disk sectors but become inaccessible through normal means. The data is effectively unrecoverable. Always import the foreign configuration unless you are intentionally rebuilding the array from scratch.

When presented with a foreign configuration:

Always attempt import first. Importing reads the existing metadata and reconstructs the array as it was before the disruption.
Verify the array state after import. Check that virtual disks show as "Online" or "Optimal" in the RAID management interface (iDRAC, BIOS RAID utility, or storcli/perccli).
Only clear if import fails and you accept data loss. Some scenarios where import is not possible include physical disk failure or metadata corruption. In these cases, clearing is the only option, but it requires a full OS install.

Verifying disk state after RAID recovery¶

After recovering from a RAID issue, verify that the disks contain valid data before attempting to boot:

Check for a valid partition table:

fdisk -l /dev/sda

A valid disk shows a Disklabel type line (such as gpt or dos) and one or more partition entries. A disk with no Disklabel and no partitions has been wiped.

Check for LVM physical volumes:

lvm pvscan

This should list physical volumes. If the boot disk is not listed, its LVM metadata has been destroyed.

Check the first sector for data:

dd if=/dev/sda bs=512 count=1 | hexdump -C | head -5

A valid disk contains non-zero data in the first sector (partition table, boot code). A disk that shows all zeros has been wiped.

Prevention¶

To reduce the risk of RAID-related data loss:

Document the RAID configuration (virtual disk layout, RAID level, disk slot assignments) before any hardware maintenance.
Take screenshots of the RAID management interface before and after hardware work.
Ensure hardware maintenance personnel understand the difference between importing and clearing foreign configurations.
Keep current backups of all boot disks.

Quick reference¶

Command summary¶

Task	Command
View `XFS` file system info	`xfs_info /`
List mounted `XFS` file systems	`mount -t xfs`
Monitor slab allocations	`slabtop -s c`
Check slab memory in `/proc/meminfo`	`grep -E 'Slab\\|SReclaimable\\|SUnreclaim' /proc/meminfo`
Search for `XFS` slab objects	`grep -E 'xfs_inode\\|xfs_buf' /proc/slabinfo`
Count `kworker` threads	`ps -eLf \\| grep -c kworker`
Check `THP` status	`cat /sys/kernel/mm/transparent_hugepage/enabled`
Disable `THP` (runtime)	`echo never > /sys/kernel/mm/transparent_hugepage/enabled`
Disable `THP` (persistent)	`grubby --update-kernel=ALL --args="transparent_hugepage=never"`
Activate `LVM` volume groups	`lvm vgchange -ay`
Repair `XFS` (standard)	`xfs_repair /dev/mapper/<vg>-<lv>`
Repair `XFS` (zero log)	`xfs_repair -L /dev/mapper/<vg>-<lv>`
Check kernel version	`uname -r`
Check `XFS` messages in kernel log	`dmesg \\| grep -i xfs`

What to do in case of boot failures¶

System drops to dracut emergency shell or fails to mount /sysroot:
- Access GRUB, add rd.break, boot to initramfs shell
- Run lvm vgchange -ay to activate volumes
- Run xfs_repair /dev/mapper/<vg>-<lv> on the root volume
- If standard repair fails, run xfs_repair -L /dev/mapper/<vg>-<lv>
- Repeat for all XFS logical volumes in the volume group
- Reboot with reboot -f
System shows "No boot device available":
- Check RAID controller for foreign configuration
- If foreign configuration exists, import it (do not clear)
- If no foreign configuration and disks show as empty, data has been lost - reinstall the OS
- Verify disk state with fdisk -l, lvm pvscan, and dd | hexdump
System boots but shows high memory usage with no application cause:
- Check slab memory with slabtop -s c and /proc/meminfo
- Count kworker threads with ps -eLf | grep -c kworker
- If slab objects show 10x+ growth and kworker count exceeds 500, check kernel version
- Disable THP as an immediate workaround
- Update to a kernel version containing the fix

References¶

"XFS Administration" by the Linux Kernel Project https://docs.kernel.org/admin-guide/xfs.html
"xfs_info(8) man page" by the Linux man-pages Project https://man7.org/linux/man-pages/man8/xfs_info.8.html
"Short Users Guide for SLUB" by the Linux Kernel Project https://docs.kernel.org/mm/slab.html
"The /proc File System" by the Linux Kernel Project https://www.kernel.org/doc/html/latest/filesystems/proc.html
"Transparent Huge Page Support" by the Linux Kernel Project https://docs.kernel.org/admin-guide/mm/transhuge.html
"dracut.cmdline(7) man page" by the Linux man-pages Project https://man7.org/linux/man-pages/man7/dracut.cmdline.7.html
"LVM2 Resource Page" by the LVM2 Project https://www.sourceware.org/lvm2/
"xfs_repair(8) man page" by the Linux man-pages Project https://man7.org/linux/man-pages/man8/xfs_repair.8.html
"GNU GRUB Manual" by the GNU Project https://www.gnu.org/software/grub/manual/grub/grub.html

Author: Howard Van Der Wal

XFS recovery

AI usage¶

Introduction¶

Prerequisites¶

Viewing XFS file system information¶

Monitoring XFS slab memory usage¶

Viewing slab allocations with slabtop¶

Checking /proc/slabinfo directly¶

Monitoring memory fields in /proc/meminfo¶