# alinux-gpu

Part of **ALINUX**

<!-- intent-backlink:auto -->

> 💡 **Path Selection**: This skill is one implementation path for [Deploy AI models for inference or training](../../intent/alinux-deploy-model/SKILL.md). If you're unsure which path to take, check the routing skill first.

# Alibaba Cloud Linux AI and GPU Workloads Troubleshooting Guide

## Problem Index

| Problem | Symptom | Severity | Solution Summary |
|--------|--------|----------|------------------|
| GPU Driver Fails to Load After Kernel Update | `modprobe: FATAL: Module nvidia not found` or GPU not detected after reboot | High | Rebuild NVIDIA driver using DKMS or manually with kernel-devel packages |
| Container Cannot Access GPU Devices | `nvidia-smi` fails inside container; "GPU Access Denied" error | High | Upgrade systemd to version >= systemd-239-68.0.2.al8.1 and reboot |
| GPU Performance Degradation or Errors | `GPU_TIMEOUT`, `OUT_OF_MEMORY`, or `DRIVER_ERROR` in logs | Medium | Monitor metrics, update drivers, adjust workload parameters |

## Problem Details

### Problem 1: GPU Driver Fails to Load After Kernel Update

**Symptoms**
- Error message: `modprobe: FATAL: Module nvidia not found`
- Behavior: `nvidia-smi` returns "NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver"
- Context: Occurs after updating the kernel on Alibaba Cloud Linux (e.g., via `yum update`)

**Root Cause**
- The NVIDIA GPU driver was compiled against an older kernel version.
- Due to kABI (Kernel Application Binary Interface) incompatibility, the pre-built kernel object (`.ko`) file cannot be loaded into the new kernel.
- DKMS (Dynamic Kernel Module Support) may not have automatically rebuilt the module if not properly configured.

**Solution**
1. Ensure the `kernel-devel` and `kernel-core` packages for the current kernel are installed:
   ```bash
   sudo yum install -y kernel-devel-$(uname -r) kernel-core-$(uname -r)
   ```
2. Verify that DKMS is installed:
   ```bash
   sudo yum install -y dkms
   ```
3. Rebuild the NVIDIA driver using DKMS:
   ```bash
   sudo dkms autoinstall
   ```
4. If DKMS is not used, manually reinstall the NVIDIA driver from the official package, which will trigger a rebuild against the current kernel.

**Verification**
- Reboot the instance:
  ```bash
  sudo reboot
  ```
- After reboot, run:
  ```bash
  nvidia-smi
  ```
- Expected output: A table showing GPU details (name, temperature, utilization, etc.) without errors.

### Problem 2: Container Cannot Access GPU Devices

**Symptoms**
- Error message: `GPU Access Denied` (observed when running `nvidia-smi` inside a container)
- Behavior: Containers start successfully but cannot use GPU resources; `nvidia-smi` fails with permission or device access errors
- Context: Occurs on Alibaba Cloud Linux 3 systems with systemd versions below `systemd-239-68.0.2.al8.1`, especially after running `systemctl daemon-reload`

**Root Cause**
- Older versions of systemd do not properly apply device cgroup permissions required for GPU access in containers.
- The `FullDelegationDeviceCGroup` setting is not honored correctly, causing the container runtime (e.g., runc) to lose access to `/dev/nvidia*` devices.

**Solution**
1. Upgrade systemd to a compatible version:
   ```bash
   sudo yum upgrade systemd
   ```
   Confirm with `y` when prompted.
2. Reboot the ECS instance to ensure all services and cgroups are reinitialized:
   ```bash
   sudo reboot
   ```

**Verification**
- After reboot, check the systemd version:
  ```bash
  rpm -qa systemd
  ```
  Ensure it is `systemd-239-68.0.2.al8.1` or higher.
- Launch a GPU-enabled container and run:
  ```bash
  nvidia-smi
  ```
- Expected output: Successful display of GPU information inside the container.

### Problem 3: GPU Performance Degradation or Runtime Errors

**Symptoms**
- Error messages:
  - `GPU_TIMEOUT`: GPU task exceeded expected execution time
  - `OUT_OF_MEMORY`: Unable to allocate required GPU memory
  - `DRIVER_ERROR`: General GPU driver malfunction during computation
- Behavior: Training jobs hang, inference latency spikes, or applications crash unexpectedly
- Context: During AI/ML workloads, especially with large models or high batch sizes

**Root Cause**
- `GPU_TIMEOUT`: Often caused by driver hangs due to bugs or thermal throttling.
- `OUT_OF_MEMORY`: Workload exceeds available GPU memory (VRAM).
- `DRIVER_ERROR`: Outdated, incompatible, or corrupted NVIDIA driver.

**Solution**
1. **For `OUT_OF_MEMORY`**:
   - Reduce model batch size or use gradient checkpointing.
   - Monitor memory usage before scaling:
     ```bash
     nvidia-smi --query-gpu=memory.used,memory.total --format=csv
     ```
2. **For `GPU_TIMEOUT` or `DRIVER_ERROR`**:
   - Update to the latest stable NVIDIA driver supported by Alibaba Cloud Linux.
   - Check system logs for hardware issues:
     ```bash
     dmesg | grep -i nvidia
     ```
   - Ensure adequate cooling and power supply; monitor temperature:
     ```bash
     nvidia-smi --query-gpu=temperature.gpu --format=csv
     ```
3. Enable persistent mode for stability (optional):
   ```bash
   sudo nvidia-smi -pm 1
   ```

**Verification**
- Run a representative workload and monitor metrics:
  ```bash
  watch -n 1 nvidia-smi
  ```
- Expected behavior: Stable utilization, no error messages, memory usage within limits, and temperature below throttling thresholds (typically < 85°C).

## FAQ

**Q: How do I check if my GPU driver is compatible with the current kernel?**  
A: Run `modinfo /lib/modules/$(uname -r)/extra/nvidia.ko` (or similar path). If the file doesn’t exist or `modprobe nvidia` fails, the driver is not built for the current kernel. Rebuild using DKMS as described in Problem 1.

**Q: What permissions are required for containers to access GPUs?**  
A: The container runtime must have access to `/dev/nvidia*` devices and the NVIDIA driver libraries. On Alibaba Cloud Linux 3, this also requires systemd ≥ 239-68.0.2.al8.1 to correctly delegate device cgroups. No additional user permissions are needed beyond standard container execution rights.

**Q: How can I enable detailed GPU diagnostics?**  
A: Use `nvidia-smi` for real-time monitoring. For deeper profiling, use NVIDIA tools like `dcgmi` (Data Center GPU Manager) or `nsight systems`. Ensure the NVIDIA driver includes debugging symbols and that the `nvidia-fs` and `nvidia-peermem` modules are loaded if using multi-GPU or NVLink.

**Q: Why does my GPU show 0% utilization even when running a workload?**  
A: This may indicate CPU bottlenecks, inefficient data loading, or framework-level issues (e.g., TensorFlow/PyTorch not using GPU). Verify with `nvidia-smi` while the process is active, and check that tensors are explicitly placed on the GPU (e.g., `.cuda()` in PyTorch).

**Q: Can I avoid rebuilding the GPU driver after every kernel update?**  
A: Yes—ensure DKMS is installed and the NVIDIA driver is registered with DKMS. When properly configured, DKMS automatically rebuilds the driver module whenever a new kernel is installed, eliminating manual intervention.