# alinux-monitoring

Part of **ALINUX**

# Alibaba Cloud Linux System Monitoring and Logging Troubleshooting Guide

## Problem Index

| Problem | Symptom | Severity | Solution Summary |
|--------|--------|---------|------------------|
| Disk space discrepancy between `df` and `du` | `df` shows high usage (e.g., 90%) while `du` shows low usage (e.g., 30%) | High | Identify and release deleted but open files or unmount overlay mounts |
| Deleted files still consuming disk space | Error: `ENOSPC` despite apparent free space | High | Use `lsof` to find processes holding deleted files and terminate them |
| Subdirectory mounted over existing path | `du` underreports usage in parent directory | Medium | Detect overlay mounts with `mount` and unmount conflicting paths |

## Problem Details

### Problem 1: Disk Space Discrepancy Between `df` and `du`

**Symptoms**
- Error message: `No space left on device` or `ENOSPC` despite `du -hs /path` showing ample free space
- Behavior: `df -h` reports high disk usage (e.g., 95% used), but `du -sh /mount/point` shows significantly less usage
- Context: Typically occurs after large log or temporary files are deleted while still open by a running process, or when a filesystem is mounted over a non-empty directory

**Root Cause**
- **Deleted but open files**: When a file is deleted while a process still holds an open file descriptor, the disk blocks remain allocated until the process closes the file or terminates. `df` accounts for these blocks, but `du` does not traverse deleted inodes.
- **Overlay mounts**: If a new filesystem is mounted over a directory that already contains data, `du` only sees the contents of the mounted filesystem, while `df` reflects total usage of the underlying block device.

**Solution**
1. Check for deleted files still held open:
   ```bash
   sudo lsof | grep deleted
   ```
2. If output shows entries like `/var/log/app.log (deleted)`, note the PID and terminate the process:
   ```bash
   sudo kill -9 <PID>
   ```
3. Alternatively, if safe, restart the associated service instead of force-killing.
4. Check for overlay mounts in the affected directory:
   ```bash
   sudo mount | grep "/your/mount/point"
   ```
5. If a sub-mount (e.g., `/mnt/vdb/tmp`) exists within the directory, unmount it after ensuring no critical data is in use:
   ```bash
   sudo umount /mnt/vdb/tmp
   ```

**Verification**
- Re-run both commands to confirm alignment:
  ```bash
  df -h /mount/point
  du -sh /mount/point
  ```
- Expected result: Values should now be consistent (within expected metadata overhead).
- Confirm `ENOSPC` errors no longer occur during file operations.

### Problem 2: Deleted Files Still Consuming Disk Space

**Symptoms**
- Error message: `ENOSPC` (No space left on device)
- Behavior: Applications fail to write files even though `du` suggests sufficient space
- Context: Occurs after log rotation or manual deletion of large files without restarting the owning process

**Root Cause**
- The Linux kernel retains disk blocks for a file as long as at least one process holds an open file descriptor to it—even after the directory entry is removed. This space is invisible to `du` but counted by `df`.

**Solution**
1. List all open deleted files:
   ```bash
   sudo lsof +L1
   ```
   (This shows files with link count < 1, i.e., deleted)
2. Identify the process and file:
   ```text
   COMMAND   PID USER   FD   TYPE DEVICE SIZE/OFF    NODE NAME
   nginx    1234 root    6w   REG  253,0 500000000 1234567 /var/log/nginx/access.log (deleted)
   ```
3. Either restart the service gracefully:
   ```bash
   sudo systemctl restart nginx
   ```
   or, if urgent, kill the process:
   ```bash
   sudo kill -9 1234
   ```

**Verification**
- Run:
  ```bash
   sudo lsof +L1
   ```
- Expected output: No lines returned.
- Confirm disk space is freed via `df -h`.

### Problem 3: Subdirectory Mounted Over Existing Path

**Symptoms**
- `du -sh /data` reports low usage (e.g., 5 GB)
- `df -h /data` shows high usage (e.g., 80 GB used on 100 GB device)
- Behavior: Files written to `/data/subdir` appear to vanish or are inaccessible from the parent view

**Root Cause**
- A filesystem was mounted directly onto a subdirectory (e.g., `/data/cache`) that previously contained data. The original contents are hidden but still consume space on the underlying filesystem. `du` only scans the mounted filesystem, missing the obscured data.

**Solution**
1. List all mounts under the parent directory:
   ```bash
   sudo mount | grep "^/.* on /data/"
   ```
2. Identify unexpected sub-mounts (e.g., `/dev/vdb1 on /data/tmp type ext4`).
3. Backup any data on the mounted filesystem if needed.
4. Unmount the overlay:
   ```bash
   sudo umount /data/tmp
   ```
5. Optionally, move the original hidden data elsewhere or delete it if obsolete.

**Verification**
- After unmounting, run:
  ```bash
  du -sh /data
  df -h /data
  ```
- The `du` output should now reflect total usage closer to `df`.
- Hidden files previously masked by the mount will reappear.

## FAQ

**Q: How do I check if deleted files are still using disk space?**  
A: Run `sudo lsof +L1` or `sudo lsof | grep deleted`. Any output indicates files that have been unlinked but are still open and consuming space.

**Q: Why does `df` show more used space than `du`?**  
A: This usually happens due to (1) deleted files still held open by processes, or (2) a filesystem mounted over a directory that already contained data. Both cause `df` (which queries the filesystem) and `du` (which walks the directory tree) to report different values.

**Q: How can I safely release space from deleted open files without killing processes?**  
A: Restart the associated service gracefully (e.g., `systemctl restart servicename`). This closes file descriptors cleanly. Avoid `kill -9` unless absolutely necessary, as it may cause data loss or instability.

**Q: What permissions are required to run these diagnostic commands?**  
A: You need `root` privileges or `sudo` access to run `lsof`, view all open files, and unmount filesystems. Regular users may see incomplete results.

**Q: Does this issue occur on all filesystem types (ext4, XFS, etc.)?**  
A: Yes. The behavior is inherent to the Linux VFS layer and applies to all local filesystems (ext4, XFS, Btrfs, etc.). It also affects network filesystems like NFS if files are deleted remotely while open locally.