# alinux-cluster

Part of **ALINUX**

# Alibaba Cloud Linux Cluster Management Troubleshooting Guide

## Problem Index

| Problem | Symptom | Severity | Solution Summary |
|--------|--------|---------|------------------|
| Unresponsive Cluster Node | Node appears offline in monitoring; `kubectl get nodes` shows `NotReady` status | High | Use node diagnostics to run health checks and identify underlying resource or service failures |
| High CPU or Memory Usage on Node | Performance degradation; alerts triggered for resource saturation | Medium | Enable performance analysis via node diagnostics to profile processes and identify resource hogs |
| Missing or Incomplete Diagnostic Logs | Log analysis fails or returns empty results | Low | Verify logging agent status and ensure proper permissions for log collection |

## Problem Details

### Problem 1: Unresponsive Cluster Node

**Symptoms**
- Error message: `Node condition Ready is false`
- Behavior: Pod scheduling fails; applications experience downtime
- Context: Occurs after system updates, kernel panics, or network partition events

**Root Cause**
- The kubelet service may have crashed or become unresponsive
- Underlying issues include disk pressure, memory exhaustion, or network interface failure
- System services required for node operation (e.g., container runtime) may be stopped

**Solution**
1. Access the Alibaba Cloud Console and navigate to **Container Service > Clusters > [Your Cluster] > Nodes**
2. Select the affected node and click **Diagnose** to launch the built-in node diagnostics tool
3. Run the **Health Check** diagnostic module to assess core node components
4. If kubelet is down, SSH into the node and restart it:
   ```bash
   sudo systemctl restart kubelet
   ```
5. Check system logs for critical errors:
   ```bash
   sudo journalctl -u kubelet -n 100 --no-pager
   ```

**Verification**
- Run `kubectl get nodes` and confirm the node status changes to `Ready`
- Observe that new pods are scheduled successfully on the node
- Diagnostic report should show all health checks passing

### Problem 2: High CPU or Memory Usage on Node

**Symptoms**
- Monitoring dashboards show sustained CPU >90% or memory usage near capacity
- Application latency increases; some pods may be evicted
- `top` or `htop` shows unexpected process activity

**Root Cause**
- A user workload or system daemon is consuming excessive resources
- Memory leaks in long-running applications
- Misconfigured resource limits allowing containers to over-consume

**Solution**
1. In the Alibaba Cloud Console, go to the target node and initiate **Node Diagnostics**
2. Select **Performance Analysis** and specify a duration (e.g., 60 seconds)
3. After analysis completes, review the generated flame graph or process list to identify top consumers
4. For runaway processes, either:
   - Terminate non-critical processes:
     ```bash
     sudo kill -9 <PID>
     ```
   - Or adjust pod resource requests/limits in your Kubernetes manifests
5. Consider enabling vertical pod autoscaling if workloads have variable demand

**Verification**
- Resource usage metrics return to baseline levels (<70% CPU, <80% memory)
- No further pod evictions due to resource pressure
- Performance analysis report shows balanced process activity

### Problem 3: Missing or Incomplete Diagnostic Logs

**Symptoms**
- Log analysis in node diagnostics returns "No logs found" or truncated data
- Critical system events are not visible in the diagnostic report
- Manual log inspection shows gaps in `/var/log/messages` or journal

**Root Cause**
- The logging agent (e.g., Logtail or journald forwarder) is not running
- Insufficient permissions prevent log collection from sensitive directories
- Disk space exhaustion caused log rotation or truncation

**Solution**
1. Confirm the logging agent is active:
   ```bash
   sudo systemctl status aliyun-logtail
   ```
2. If inactive, start and enable it:
   ```bash
   sudo systemctl start aliyun-logtail
   sudo systemctl enable aliyun-logtail
   ```
3. Verify disk space availability:
   ```bash
   df -h /var/log
   ```
4. Ensure the node role has necessary permissions for log access (via RAM policy attached to instance)
5. Re-run node diagnostics with **Log Analysis** enabled after agent recovery

**Verification**
- Log analysis module returns recent entries from system and application logs
- Diagnostic report includes kernel messages, kubelet logs, and container runtime events
- Continuous log streaming resumes in the console monitoring view

## FAQ

**Q: How do I check if a cluster node is healthy?**  
A: Use the built-in node diagnostics feature in the Alibaba Cloud Console. Navigate to your cluster’s node list, select a node, and run the Health Check diagnostic. Alternatively, use `kubectl describe node <node-name>` to inspect conditions like Ready, MemoryPressure, and DiskPressure.

**Q: What permissions are required to run node diagnostics?**  
A: The user must have the `cs:DescribeClusterNode` and `cs:RunNodeDiagnostics` permissions in their RAM policy. The node’s instance role must also allow log collection and system introspection via the appropriate service-linked roles.

**Q: How do I enable debug logging for kubelet on Alibaba Cloud Linux?**  
A: Edit the kubelet configuration file (typically `/etc/systemd/system/kubelet.service.d/10-kubeadm.conf`) and add `--v=4` to the `ExecStart` line. Then reload and restart:
```bash
sudo systemctl daemon-reload
sudo systemctl restart kubelet
```
Logs will appear in `journalctl -u kubelet`.

**Q: Can I automate node diagnostics using APIs?**  
A: Yes. The Alibaba Cloud Container Service API supports triggering node diagnostics programmatically via the `RunNodeDiagnostics` action. Refer to the official API documentation for request parameters and authentication details.

**Q: Are there any costs associated with using node diagnostics?**  
A: No. The node diagnostics feature is provided at no additional charge. There are no usage quotas or billing implications for running health checks, performance analysis, or log analysis on cluster nodes.