# rds-monitoring

Part of **RDS**

# ApsaraDB RDS Monitoring and Alerts Troubleshooting Guide

## Problem Index

| Problem | Symptom | Severity | Solution Summary |
|--------|--------|---------|------------------|
| High CPU Utilization | CPU usage consistently >80% in CloudMonitor dashboard | High | Identify and optimize slow queries or reduce concurrent connections |
| High Memory Usage | Memory usage near instance limit; swap activity observed | Medium | Tune database buffer settings or scale instance vertically |
| High Disk Usage | Disk usage >90%; write operations fail or slow | High | Clean up unnecessary data or expand storage capacity |
| High IOPS Consumption | IOPS sustained at or near provisioned limit | Medium | Optimize query patterns or upgrade IOPS-capable instance type |

## Problem Details

### Problem 1: High CPU Utilization

**Symptoms**
- Error message: `CPU utilization exceeds 80% for more than 5 minutes`
- Behavior: Queries become slow or time out; application latency increases
- Context: Occurs during peak traffic or after deployment of inefficient queries

**Root Cause**
- Long-running or unoptimized SQL queries consume excessive CPU cycles
- High concurrency without connection pooling overwhelms the database engine
- Missing or stale indexes force full table scans

**Solution**
1. Access the **CloudMonitor** console and navigate to your RDS instance’s **Performance Monitoring** tab.
2. Use the **Slow Query Log** feature to identify top CPU-consuming queries:
   ```bash
   # Enable slow query log if not already enabled (for MySQL/MariaDB)
   SET GLOBAL slow_query_log = 'ON';
   SET GLOBAL long_query_time = 1;
   ```
3. Analyze slow queries using `EXPLAIN` and add appropriate indexes:
   ```sql
   EXPLAIN SELECT * FROM orders WHERE customer_id = 12345;
   ```
4. Reduce concurrency by implementing application-level connection pooling.
5. If workload is legitimate, consider scaling to a higher-spec instance via the RDS console: **Instance Details > Change Specifications**.

**Verification**
- After optimization, CPU utilization should drop below 70% during similar load periods.
- Confirm via CloudMonitor that no new slow queries appear in the last 15 minutes.

### Problem 2: High Memory Usage

**Symptoms**
- Error message: `Memory usage exceeds 90%`
- Behavior: Increased response times; possible OOM (Out-of-Memory) kills in logs
- Context: Common after enabling large result sets or increasing buffer pool size beyond capacity

**Root Cause**
- Database buffer/cache settings (e.g., `innodb_buffer_pool_size`) exceed available memory
- Large temporary tables or sorts spill to disk, increasing memory pressure
- Memory leak in application logic causing repeated large fetches

**Solution**
1. Check current memory allocation in parameter settings:
   - For MySQL: Review `innodb_buffer_pool_size`, `key_buffer_size`
   - For PostgreSQL: Review `shared_buffers`, `work_mem`
2. Adjust parameters via **Parameter Templates** in RDS console:
   - Set `innodb_buffer_pool_size` to ≤ 75% of total instance memory
3. Avoid `SELECT *` on wide tables; paginate large result sets in application code.
4. If memory pressure persists, upgrade to an instance type with more RAM through **Change Specifications**.

**Verification**
- Monitor **Memory Usage %** metric in CloudMonitor; should stabilize below 85%
- Check error logs for absence of `Out of memory` or `Killed process` entries:
  ```text
  grep -i "killed process" /var/log/mysqld.log
  ```

### Problem 3: High Disk Usage

**Symptoms**
- Error message: `Disk usage exceeds 90%`
- Behavior: Write operations fail with `ERROR 3 (HY000): Error writing file`; backups may fail
- Context: Accumulation of binary logs, audit logs, or unused tables over time

**Discard old binary logs (MySQL):**
```sql
PURGE BINARY LOGS BEFORE '2023-01-01 00:00:00';
```
2. Delete unnecessary tables or archive historical data:
   ```sql
   DROP TABLE IF EXISTS old_logs_2020;
   ```
3. Enable automatic log cleanup via RDS parameter:
   - Set `binlog_expire_logs_seconds` to 604800 (7 days) for MySQL 8.0+
4. If cleanup is insufficient, expand disk capacity in RDS console: **Instance Details > Change Specifications > Storage**

**Verification**
- Disk usage metric in CloudMonitor drops below 80%
- New writes succeed without `disk full` errors

### Problem 4: High IOPS Consumption

**Symptoms**
- Error message: `IOPS usage reaches provisioned limit`
- Behavior: Latency spikes; throughput plateaus despite increased load
- Context: Heavy random read/write workloads (e.g., analytics on transactional DB)

**Root Cause**
- Workload exceeds baseline IOPS capacity of current instance class
- Inefficient queries cause excessive disk seeks (e.g., full scans without indexes)
- Small I/O block sizes increase IOPS count unnecessarily

**Solution**
1. Identify I/O-intensive queries using performance insights or slow logs.
2. Add covering indexes to convert disk reads to memory lookups:
   ```sql
   CREATE INDEX idx_customer_order ON orders(customer_id, order_date);
   ```
3. Batch small writes into larger transactions to reduce I/O operations.
4. Upgrade to an instance type with higher baseline or burst IOPS (e.g., from general-purpose to dedicated instance).

**Verification**
- IOPS metric stabilizes below 90% of provisioned limit
- Average I/O latency decreases in CloudMonitor charts

## FAQ

**Q: How do I check real-time resource usage for my RDS instance?**  
A: Go to the **CloudMonitor** console, select your RDS instance, and view the **Performance Monitoring** dashboard. Metrics include CPU%, Memory%, Disk Usage%, and IOPS updated every minute.

**Q: Why am I not receiving alert notifications when thresholds are breached?**  
A: Verify that: (1) An alert rule is created in **CloudMonitor > Alert Rules**, (2) The contact group has valid notification channels (email/SMS), and (3) The metric threshold duration (e.g., 5 minutes) has been exceeded.

**Q: Can I customize which metrics trigger alerts?**  
A: Yes. In the CloudMonitor console, create a custom alert rule by selecting your RDS instance, choosing a metric (e.g., `CPUUtilization`), setting a threshold (e.g., >80%), and specifying evaluation period and notification targets.

**Q: What is the difference between basic and enhanced monitoring for RDS?**  
A: Basic monitoring provides 1-minute granularity for core metrics. Enhanced monitoring (if enabled) offers OS-level metrics like disk queue depth and network throughput at 1-second granularity, but requires additional permissions and may incur costs.

**Q: How can I reduce false alerts during maintenance windows?**  
A: Use **Alert Rule Maintenance Windows** in CloudMonitor to temporarily disable notifications during scheduled activities like batch jobs or schema migrations.