# alinux-system_diagnosis_and_troubleshooting

Part of **ALINUX**

# Alibaba Cloud Linux System Diagnosis and Troubleshooting Troubleshooting Guide

## Problem Index

| Problem | Symptom | Severity | Solution Summary |
|--------|--------|---------|------------------|
| OverlayFS dentry leak causing system crash | Kernel Panic due to dentry reference count leak in OverlayFS | High | Upgrade kernel or install kernel-hotfix-4375449 |
| Ftrace filter enabling causes buffer overflow crash | `general protection fault: 0000 [#1] SMP NOPTI` after enabling Ftrace filter | High | Update kernel or install kernel-hotfix-5692820 |
| NULL pointer dereference during memory compaction | `general protection fault` in `free_one_page`/`compact_zone_order` during interrupt handling | High | Upgrade kernel or install kernel-hotfix-5000697 |
| EROFS mount triggers kernel NULL pointer dereference | System crash when mounting EROFS loop device | High | Install kernel-hotfix-18359162 or upgrade kernel |
| printk deadlock causing system hang | Call trace showing `console_unlock` and `printk` under spinlock | High | Set `kernel.printk="4 4 1 7"` via sysctl |
| vmcore file not generated after kernel panic | No crash dump in `/var/crash/` despite kdump being configured | Medium | Disable `crash_kexec_post_notifiers` via grubby or sysfs |

## Problem Details

### Problem 1: OverlayFS dentry leak causing system crash

**Symptoms**
- Error message: `Kernel Panic`
- Behavior: System becomes unresponsive and reboots unexpectedly
- Context: Occurs on Alibaba Cloud Linux 2 instances using OverlayFS under concurrent workloads

**Root Cause**
A race condition in the OverlayFS implementation causes dentry reference counts to be incremented without proper locking, leading to reference count leakage. When the reference count wraps around or becomes inconsistent, the kernel triggers a panic.

**Solution**
1. Upgrade the kernel to a fixed version:
   ```bash
   sudo yum update kernel
   sudo reboot
   ```
2. Alternatively, install the specific hotfix for your current kernel:
   ```bash
   sudo yum install -y kernel-hotfix-4375449-`uname -r | awk -F"-" '{print $NF}'`
   ```

**Verification**
After reboot, monitor system stability under load. Confirm the kernel version includes the fix by checking release notes or absence of recurrence.

### Problem 2: Ftrace filter enabling causes buffer overflow crash

**Symptoms**
- Error message: `general protection fault: 0000 [#1] SMP NOPTI`
- Behavior: Immediate system crash after enabling Ftrace filtering
- Context: Occurs on Alibaba Cloud Linux 2 with kernel versions prior to `4.19.91-23.al7`

**Root Cause**
Enabling Ftrace filters triggers a miscalculation in buffer size allocation within the kernel, resulting in a buffer overflow and subsequent memory corruption.

**Solution**
1. Update the kernel to a patched version:
   ```bash
   sudo yum update kernel
   sudo reboot
   ```
2. Or apply the targeted hotfix:
   ```bash
   sudo yum install -y kernel-hotfix-5692820-`uname -r | awk -F"-" '{print $NF}'`
   ```

**Verification**
After applying the fix, enable Ftrace filters again and confirm the system remains stable. Check `dmesg` for any related errors.

### Problem 3: NULL pointer dereference during memory compaction

**Symptoms**
- Error message: `general protection fault`
- Behavior: System crash during memory-intensive operations
- Context: Affects Alibaba Cloud Linux 2.1903 LTS with kernel `4.19.91-21.al7.x86_64` or earlier

**Root Cause**
During memory compaction, an interrupt occurs before the Capture Control structure is fully initialized. The interrupt handler attempts to access this uninitialized (NULL) pointer when releasing memory pages, causing a crash.

**Solution**
1. Upgrade the kernel:
   ```bash
   sudo yum update kernel
   sudo reboot
   ```
2. For systems that cannot immediately upgrade, install the hotfix:
   ```bash
   sudo yum install -y kernel-hotfix-5000697-`uname -r | awk -F"-" '{print $NF}'`
   ```

**Verification**
After reboot, run memory stress tests (e.g., using `stress-ng`) and monitor for crashes. Check `dmesg` output for absence of `general protection fault`.

### Problem 4: EROFS mount triggers kernel NULL pointer dereference

**Symptoms**
- Error message: `NULL pointer dereference`
- Behavior: System crash when mounting an EROFS filesystem (e.g., via loop device)
- Context: Occurs on Alibaba Cloud Linux 3 with kernel `5.10.134-15.al8`

**Root Cause**
The `__erofs_bread()` function fails to handle generic block device mounts correctly, leading to a NULL pointer dereference when accessing device-specific structures.

**Solution**
Install the provided hotfix:
```bash
sudo yum install -y kernel-hotfix-18359162-5.10.134-15
```
Alternatively, upgrade to a newer kernel version that includes the fix.

**Verification**
Test EROFS mounting:
```bash
sudo yum install -y erofs-utils
mkdir -p test mnt
mkfs.erofs foo.erofs test
sudo mount -t erofs -o loop foo.erofs mnt
```
Confirm successful mount and no kernel oops in `dmesg`.

### Problem 5: printk deadlock causing system hang

**Symptoms**
- Call trace includes: `native_queued_spin_lock_slowpath`, `console_unlock`, `printk`
- Behavior: System becomes completely unresponsive during kernel warning messages
- Context: More frequent in Alibaba Cloud Linux 3 with kernel `5.10.134-16.3`

**Root Cause**
When the kernel holds a workqueue or runqueue spinlock and calls `printk`, the console subsystem invokes DRM drivers that attempt to acquire the same lock, creating a deadlock.

**Solution**
Reduce console log level to prevent warning messages from triggering the deadlock:
```bash
sysctl -w kernel.printk="4 4 1 7"
echo 'kernel.printk = 4 4 1 7' >> /etc/sysctl.conf
```

**Verification**
After applying the setting, induce a non-critical kernel warning (if possible in test environment) and confirm the system does not hang. Note that warnings will still appear in `dmesg` but not on the console.

### Problem 6: vmcore file not generated after kernel panic

**Symptoms**
- No crash dump files in `/var/crash/<timestamp>/`
- `sudo kdumpctl status` shows `kdump: Kdump is operational`
- Context: Alibaba Cloud Linux 3 or Anolis OS 8.x with `crash_kexec_post_notifiers=Y`

**Root Cause**
When `crash_kexec_post_notifiers` is enabled (`Y`), panic notifiers may interfere with the kdump crash kernel jump, preventing vmcore generation.

**Solution**
1. Temporarily disable the parameter:
   ```bash
   sudo sh -c 'echo N > /sys/module/kernel/parameters/crash_kexec_post_notifiers'
   ```
2. Make the change persistent across reboots:
   ```bash
   sudo grubby --update-kernel="/boot/vmlinuz-$(uname -r)" --args="crash_kexec_post_notifiers=N"
   sudo reboot
   ```

**Verification**
After reboot, verify the parameter is set:
```bash
cat /sys/module/kernel/parameters/crash_kexec_post_notifiers
```
Expected output: `N`. Then, if a crash occurs, check `/var/crash/` for vmcore files.

## FAQ

**Q: How do I check if kdump is properly configured and running?**  
A: Run `sudo kdumpctl status`. If it returns "kdump: Kdump is operational", the service is ready to capture vmcore on crash. If not, start it with `sudo kdumpctl start`.

**Q: What permissions are required to apply kernel hotfixes or update the kernel?**  
A: Root or sudo privileges are required to install packages via `yum` and to reboot the system. Ensure you have administrative access to the ECS instance.

**Q: How can I enable debug logging for kernel issues?**  
A: Increase kernel log verbosity by setting `kernel.printk` to higher values (e.g., `"7 4 1 7"`), but be cautious as high verbosity may trigger issues like printk deadlocks. Use `dmesg -T` to view timestamped logs.

**Q: What are common causes of kernel panics in Alibaba Cloud Linux?**  
A: Common causes include known kernel bugs (e.g., OverlayFS dentry leaks, Ftrace buffer overflows), hardware issues (rare in virtualized environments), NULL pointer dereferences, and deadlocks in kernel subsystems like printk or memory management.

**Q: How do I roll back a failed kernel update?**  
A: Alibaba Cloud Linux retains previous kernels in GRUB. Reboot the instance and select the previous kernel version from the GRUB menu. Then, remove the problematic kernel package using `sudo yum remove kernel-<version>`.