# alinux-cluster Part of **ALINUX** # Alibaba Cloud Linux Cluster Management Troubleshooting Guide ## Problem Index | Problem | Symptom | Severity | Solution Summary | |--------|--------|---------|------------------| | Unresponsive Cluster Node | Node appears offline in monitoring; `kubectl get nodes` shows `NotReady` status | High | Use node diagnostics to run health checks and identify underlying resource or service failures | | High CPU or Memory Usage on Node | Performance degradation; alerts triggered for resource saturation | Medium | Enable performance analysis via node diagnostics to profile processes and identify resource hogs | | Missing or Incomplete Diagnostic Logs | Log analysis fails or returns empty results | Low | Verify logging agent status and ensure proper permissions for log collection | ## Problem Details ### Problem 1: Unresponsive Cluster Node **Symptoms** - Error message: `Node condition Ready is false` - Behavior: Pod scheduling fails; applications experience downtime - Context: Occurs after system updates, kernel panics, or network partition events **Root Cause** - The kubelet service may have crashed or become unresponsive - Underlying issues include disk pressure, memory exhaustion, or network interface failure - System services required for node operation (e.g., container runtime) may be stopped **Solution** 1. Access the Alibaba Cloud Console and navigate to **Container Service > Clusters > [Your Cluster] > Nodes** 2. Select the affected node and click **Diagnose** to launch the built-in node diagnostics tool 3. Run the **Health Check** diagnostic module to assess core node components 4. If kubelet is down, SSH into the node and restart it: ```bash sudo systemctl restart kubelet ``` 5. Check system logs for critical errors: ```bash sudo journalctl -u kubelet -n 100 --no-pager ``` **Verification** - Run `kubectl get nodes` and confirm the node status changes to `Ready` - Observe that new pods are scheduled successfully on the node - Diagnostic report should show all health checks passing ### Problem 2: High CPU or Memory Usage on Node **Symptoms** - Monitoring dashboards show sustained CPU >90% or memory usage near capacity - Application latency increases; some pods may be evicted - `top` or `htop` shows unexpected process activity **Root Cause** - A user workload or system daemon is consuming excessive resources - Memory leaks in long-running applications - Misconfigured resource limits allowing containers to over-consume **Solution** 1. In the Alibaba Cloud Console, go to the target node and initiate **Node Diagnostics** 2. Select **Performance Analysis** and specify a duration (e.g., 60 seconds) 3. After analysis completes, review the generated flame graph or process list to identify top consumers 4. For runaway processes, either: - Terminate non-critical processes: ```bash sudo kill -9 ``` - Or adjust pod resource requests/limits in your Kubernetes manifests 5. Consider enabling vertical pod autoscaling if workloads have variable demand **Verification** - Resource usage metrics return to baseline levels (<70% CPU, <80% memory) - No further pod evictions due to resource pressure - Performance analysis report shows balanced process activity ### Problem 3: Missing or Incomplete Diagnostic Logs **Symptoms** - Log analysis in node diagnostics returns "No logs found" or truncated data - Critical system events are not visible in the diagnostic report - Manual log inspection shows gaps in `/var/log/messages` or journal **Root Cause** - The logging agent (e.g., Logtail or journald forwarder) is not running - Insufficient permissions prevent log collection from sensitive directories - Disk space exhaustion caused log rotation or truncation **Solution** 1. Confirm the logging agent is active: ```bash sudo systemctl status aliyun-logtail ``` 2. If inactive, start and enable it: ```bash sudo systemctl start aliyun-logtail sudo systemctl enable aliyun-logtail ``` 3. Verify disk space availability: ```bash df -h /var/log ``` 4. Ensure the node role has necessary permissions for log access (via RAM policy attached to instance) 5. Re-run node diagnostics with **Log Analysis** enabled after agent recovery **Verification** - Log analysis module returns recent entries from system and application logs - Diagnostic report includes kernel messages, kubelet logs, and container runtime events - Continuous log streaming resumes in the console monitoring view ## FAQ **Q: How do I check if a cluster node is healthy?** A: Use the built-in node diagnostics feature in the Alibaba Cloud Console. Navigate to your cluster’s node list, select a node, and run the Health Check diagnostic. Alternatively, use `kubectl describe node ` to inspect conditions like Ready, MemoryPressure, and DiskPressure. **Q: What permissions are required to run node diagnostics?** A: The user must have the `cs:DescribeClusterNode` and `cs:RunNodeDiagnostics` permissions in their RAM policy. The node’s instance role must also allow log collection and system introspection via the appropriate service-linked roles. **Q: How do I enable debug logging for kubelet on Alibaba Cloud Linux?** A: Edit the kubelet configuration file (typically `/etc/systemd/system/kubelet.service.d/10-kubeadm.conf`) and add `--v=4` to the `ExecStart` line. Then reload and restart: ```bash sudo systemctl daemon-reload sudo systemctl restart kubelet ``` Logs will appear in `journalctl -u kubelet`. **Q: Can I automate node diagnostics using APIs?** A: Yes. The Alibaba Cloud Container Service API supports triggering node diagnostics programmatically via the `RunNodeDiagnostics` action. Refer to the official API documentation for request parameters and authentication details. **Q: Are there any costs associated with using node diagnostics?** A: No. The node diagnostics feature is provided at no additional charge. There are no usage quotas or billing implications for running health checks, performance analysis, or log analysis on cluster nodes.