# alinux-instance

Part of **ALINUX**

# Alibaba Cloud Linux Instance Management Troubleshooting Guide

## Problem Index

| Problem | Symptom | Severity | Solution Summary |
|------|------|---------|------------|
| Kernel crash due to OverlayFS dentry leak | `Kernel panic - not syncing` or system hang | High | Upgrade kernel or install hotfix |
| YUM repository connection failure in classic network | `repo_invalid` or 404 errors on `yum update` | Medium | Update `/etc/yum.repos.d/aliyun-base.repo` with correct mirrors |
| System time desynchronization after reboot | Time differs by 8 hours from NTP server | Medium | Enable `CONFIG_RTC_HCTOSYS=y` or add `hwclock --hctosys` to `rc.local` |
| Pod deletion failure in containers | Processes stuck in D state, Pod won't terminate | High | Upgrade kernel or install specific hotfix |
| ext4 resize fails with "Device or resource busy" | `resize2fs` returns error during expansion | Medium | Ensure filesystem is unmounted or use `growpart` first |
| DNF segmentation fault | `Segmentation fault` when running `dnf` commands | Medium | Upgrade SysAK package to latest version |
| unbound service timeout on startup | `Job for unbound.service failed because a timeout was exceeded` | Low | Set `DISABLE_UNBOUND_ANCHOR=yes` in `/etc/sysconfig/unbound` |

## Problem Details

### Problem 1: Kernel Crash Due to OverlayFS dentry Leak

**Symptoms**
- Error message: `Kernel panic - not syncing: softlockup: hung tasks`
- Behavior: System becomes unresponsive or reboots unexpectedly
- Context: Occurs under high I/O load with OverlayFS in Alibaba Cloud Linux 2

**Root Cause**
A race condition in OverlayFS causes dentry reference counts to leak, leading to memory exhaustion and eventual kernel panic. This affects kernel versions up to `4.19.91-22.2.al7`.

**Solution**
1. Check current kernel version:
   ```bash
   uname -r
   ```
2. If using an affected version, upgrade the kernel:
   ```bash
   sudo yum update kernel
   sudo reboot
   ```
3. Alternatively, install a targeted hotfix without reboot (if available for your kernel):
   ```bash
   sudo yum install -y kernel-hotfix-4375449-`uname -r | awk -F"-" '{print $NF}'`
   ```

**Verification**
- After reboot, confirm new kernel is active with `uname -r`
- Monitor system stability under I/O load
- Check `dmesg` for absence of overlay-related errors

### Problem 2: YUM Repository Connection Failure in Classic Network

**Symptoms**
- Error message: `repo_invalid` or HTTP 404 when running `yum update`
- Behavior: Package installation or updates fail consistently
- Context: Affects Alibaba Cloud Linux 2 instances in classic network (non-VPC)

**Root Cause**
Default YUM repository URLs in `/etc/yum.repos.d/aliyun-base.repo` point to endpoints unreachable from classic network due to network architecture restrictions.

**Solution**
1. Backup existing repo file:
   ```bash
   sudo cp /etc/yum.repos.d/aliinux-base.repo /etc/yum.repos.d/aliyun-base.repo.bak
   ```
2. Replace content with updated configuration:
   ```bash
   sudo tee /etc/yum.repos.d/aliyun-base.repo << 'EOF'
[base]
name=AliYun-$releasever - Base - mirrors.aliyun.com
baseurl=http://mirrors.aliyun.com/alinux/$releasever/os/$basearch/
gpgcheck=1
gpgkey=http://mirrors.aliyun.com/alinux/RPM-GPG-KEY-ALIYUN

#released updates
[updates]
name=AliYun-$releasever - Updates - mirrors.aliyun.com
baseurl=http://mirrors.aliyun.com/alinux/$releasever/updates/$basearch/
gpgcheck=1
gpgkey=http://mirrors.aliyun.com/alinux/RPM-GPG-KEY-ALIYUN

#additional packages that may be useful
[extras]
name=AliYun-$releasever - Extras - mirrors.aliyun.com
baseurl=http://mirrors.aliyun.com/alinux/$releasever/extras/$basearch/
gpgcheck=1
gpgkey=http://mirrors.aliyun.com/alinux/RPM-GPG-KEY-ALIYUN

# plus packages provided by Aliyun Linux dev team
[plus]
name=AliYun-$releasever - Plus - mirrors.aliyun.com
baseurl=http://mirrors.aliyun.com/alinux/$releasever/plus/$basearch/
gpgcheck=1
gpgkey=http://mirrors.aliyun.com/alinux/RPM-GPG-KEY-ALIYUN
EOF
   ```
3. Clean and refresh YUM cache:
   ```bash
   sudo yum clean all
   sudo yum makecache
   ```

**Verification**
- Run `sudo yum update` — should complete without 404 errors
- Confirm packages can be installed successfully

### Problem 3: System Time Desynchronization After Reboot

**Symptoms**
- Error message: `TimeSyncError` (8-hour difference from NTP server)
- Behavior: System clock resets to incorrect time after restart
- Context: Affects Alibaba Cloud Linux 2 with kernel ≤ `4.19.24-10.al7.x86_64`

**Root Cause**
Early kernel versions lack proper RTC-to-system time synchronization at boot (`CONFIG_RTC_HCTOSYS` not enabled), causing drift.

**Solution**
**Option A: Kernel configuration (permanent fix)**
1. Edit kernel config file (replace `<YourCurrentKernelVersion>` with actual version from `uname -r`):
   ```bash
   sudo vim /boot/config-<YourCurrentKernelVersion>
   ```
2. Ensure these lines are present and set:
   ```text
   CONFIG_RTC_HCTOSYS=y
   CONFIG_RTC_HCTOSYS_DEVICE="rtc0"
   ```
3. Reboot to apply.

**Option B: Startup script workaround**
1. Add time sync command to rc.local:
   ```bash
   sudo echo "hwclock --hctosys" >> /etc/rc.d/rc.local
   sudo chmod +x /etc/rc.d/rc.local
   ```
2. Reboot:
   ```bash
   sudo reboot
   ```

**Verification**
- After reboot, run `timedatectl status` — system time should match hardware clock
- Compare with NTP server using `ntpdate -q pool.ntp.org`

### Problem 4: Pod Deletion Failure in Containers

**Symptoms**
- Error message: `D state process` in process list
- Behavior: Kubernetes Pods remain in "Terminating" state indefinitely
- Context: Occurs on Alibaba Cloud Linux 2 with kernel ≤ `4.19.91-24.1.al7.x86_64`

**Root Cause**
A kernel bug causes processes to enter uninterruptible sleep (D state) during cgroup cleanup, preventing Pod termination.

**Solution**
1. Check kernel version:
   ```bash
   uname -r
   ```
2. Upgrade kernel if outdated:
   ```bash
   sudo yum update kernel
   sudo reboot
   ```
3. Or install hotfix for immediate resolution (no reboot needed for some versions):
   ```bash
   sudo yum install -y kernel-hotfix-3915544-`uname -r | awk -F"-" '{print $NF}'`
   ```

**Verification**
- Attempt to delete problematic Pod — should complete within seconds
- Confirm no processes in D state with `ps aux | grep " D "`

### Problem 5: ext4 Filesystem Resize Fails

**Symptoms**
- Error message: `Device or resource busy` or `The filesystem is already ... Nothing to do!`
- Behavior: `resize2fs` fails to expand filesystem after disk resize
- Context: Common after increasing cloud disk size

**Root Cause**
Filesystem is mounted during resize attempt, or partition wasn't expanded before filesystem resize.

**Solution**
1. Unmount filesystem (if possible):
   ```bash
   sudo umount /mount/point
   ```
2. Expand partition first using `growpart`:
   ```bash
   sudo growpart /dev/vda 1  # Adjust device and partition number
   ```
3. Then resize filesystem:
   ```bash
   sudo resize2fs /dev/vda1
   ```
4. Remount:
   ```bash
   sudo mount /dev/vda1 /mount/point
   ```

**Verification**
- Run `df -h` — filesystem size should reflect new disk capacity
- Check for errors in `dmesg | tail`

### Problem 6: DNF Segmentation Fault

**Symptoms**
- Error message: `Segmentation fault` when running any `dnf` command
- Behavior: Package management completely broken
- Context: Caused by SysAK 2.2.0 conflict with libyaml

**Root Cause**
SysAK 2.2.0 includes a conflicting libyaml library that interferes with DNF's memory management.

**Solution**
1. Update SysAK to patched version:
   ```bash
   sudo yum update -y sysak
   ```
2. Verify version:
   ```bash
   rpm -qa sysak
   ```

**Verification**
- Run `sudo dnf list installed` — should return package list without crashing
- Confirm SysAK version ≥ 2.2.1

### Problem 7: unbound Service Timeout on Startup

**Symptoms**
- Error message: `Job for unbound.service failed because a timeout was exceeded`
- Behavior: DNS resolver fails to start in isolated environments
- Context: Occurs in VPCs without public internet access

**Root Cause**
unbound tries to fetch DNSSEC root trust anchors from public servers during startup, which fails in private networks.

**Solution**
1. Disable anchor update in configuration:
   ```bash
   sudo tee -a /etc/sysconfig/unbound << 'EOF'
DISABLE_UNBOUND_ANCHOR=yes
EOF
   ```
2. Start service:
   ```bash
   sudo systemctl start unbound
   ```

**Verification**
- Check service status: `sudo systemctl status unbound` — should show "active (running)"
- Test DNS resolution: `dig @127.0.0.1 example.com`

## FAQ

**Q: How do I check if my kernel has known stability issues?**
A: Compare your kernel version (`uname -r`) against documented affected versions in Alibaba Cloud Linux release notes. Use `sudo yum check-update kernel` to see available updates. For critical systems, consider enabling Kernel Live Patching to apply fixes without reboot.

**Q: What permissions are required to modify GRUB parameters in Alibaba Cloud Linux 3?**
A: Root privileges are required. Additionally, since Alibaba Cloud Linux 3 uses BLS (Boot Loader Specification) by default, you must either use `grubby` to modify kernel arguments, regenerate BLS entries with `kernel-install`, or disable BLS in `/etc/default/grub` before running `grub2-mkconfig`.

**Q: How can I verify if Transparent Huge Pages (THP) is causing performance issues?**
A: Check THP status with `cat /sys/kernel/mm/transparent_hugepage/enabled`. If set to `[always]`, it may cause latency spikes. Monitor for `soft lockup` messages in `dmesg`. To mitigate, set to `madvise` mode: `echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/enabled`.

**Q: Why does my NFS performance degrade after upgrading to Alibaba Cloud Linux 3?**
A: Kernel 5.4+ changed the default `read_ahead_kb` value from 15,360 KB to 128 KB. Restore performance by setting it back via udev rules (`/etc/udev/rules.d/99-nfs.rules`) or in `/etc/nfs.conf` under `[nfsrahead]` section (requires nfs-utils ≥ 2.3.3-57.0.1.al8.1).

**Q: How do I recover from a "Structure needs cleaning" ext4 error?**
A: First unmount the filesystem (`sudo umount /dev/xxx`). Then run filesystem check: `sudo fsck.ext4 -y /dev/xxx`. Always backup data before attempting repairs, as this indicates potential filesystem corruption.