# oceanbase-sqlperf

Part of **OCEANBASE**

<!-- intent-backlink:auto -->

> 💡 **Path Selection**: This skill is one implementation path for [Optimize a slow-running SQL query](../../intent/oceanbase-optimize-query/SKILL.md). If you're unsure which path to take, check the routing skill first.

# OceanBase SQL Optimization and Performance Troubleshooting Guide

## Problem Index

| Problem | Symptom | Severity | Solution Summary |
|--------|--------|---------|------------------|
| Slow SQL Query Execution | Queries consistently take longer than expected; high `EXECUTE_TIME` in `gv$sql_audit` | High | Analyze execution plan with `EXPLAIN`, verify index usage, and check data statistics accuracy |
| Sudden SQL Response Time (RT) Jitter | Intermittent spikes in query latency; high `QUEUE_TIME` or `GET_PLAN_TIME` | High | Query `gv$sql_audit` during jitter window, disable `ob_enable_sql_audit` temporarily, check for follower-to-leader switchovers |
| Suboptimal Execution Plan Selection | Query uses full table scan despite available index; high `COST` or low `EST. ROWS` accuracy | Medium | Update table statistics via major compaction, use optimizer hints, or create covering indexes |
| Excessive Hard Parsing Due to Literal Values | High CPU usage; repeated similar queries with different literals | Medium | Rewrite queries using bind variables to enable soft parsing |
| OBProxy Routing to Non-Leader Replica | Increased latency due to cross-node forwarding; elevated `RETRY_CNT` | Medium | Verify OBProxy configuration and ensure it routes to leader replicas |

## Problem Details

### Problem 1: Slow SQL Query Execution

**Symptoms**
- Error message: Not applicable (performance degradation, not a crash)
- Behavior: Queries run significantly slower than baseline; observed via application logs or monitoring dashboards
- Context: Occurs during normal operation without recent schema or workload changes

**Root Cause**
- Inefficient execution plan (e.g., missing index, poor join order)
- Outdated or inaccurate table/column statistics leading to incorrect cost estimation
- High logical I/O due to excessive block reads (`DISK_READS`, `BLOCK_CACHE_HIT`)

**Solution**
1. Identify the slow query using the `gv$sql_audit` view:
   ```sql
   SELECT /*+ READ_CONSISTENCY(WEAK) */ 
     query_sql, 
     avg_rt, 
     execute_time, 
     queue_time, 
     get_plan_time,
     retry_cnt,
     is_hit_plan
   FROM gv$sql_audit
   WHERE request_time BETWEEN '2024-01-01 10:00:00' AND '2024-01-01 10:05:00'
   ORDER BY avg_rt DESC
   LIMIT 10;
   ```
2. Obtain the execution plan using `EXPLAIN`:
   ```sql
   EXPLAIN FORMAT = 'EXTENDED' <your_slow_query>;
   ```
3. Check if indexes are used. If not, verify:
   - Index exists on filter/join columns
   - Statistics are up to date (trigger a major compaction if needed)
4. If statistics are stale, initiate a major compaction:
   ```sql
   ALTER SYSTEM MAJOR FREEZE;
   ```

**Verification**
- Re-run the query and confirm reduced `avg_rt` in `gv$sql_audit`
- Execution plan should show index access (e.g., `INDEX RANGE SCAN`) instead of `TABLE SCAN`

### Problem 2: Sudden SQL Response Time (RT) Jitter

**Symptoms**
- Error message: Not applicable
- Behavior: Query latency suddenly spikes from milliseconds to seconds intermittently
- Context: Often coincides with system maintenance, failover events, or audit logging overhead

**Root Cause**
- High `QUEUE_TIME`: Query waits in scheduler queue due to resource contention
- High `GET_PLAN_TIME`: Plan generation delay due to hard parse or cache miss
- Follower-to-leader replica switchovers causing retries (`RETRY_CNT > 0`)
- Overhead from `sql_audit` logging during high-load periods

**Solution**
1. Confirm jitter timing and isolate top queries:
   ```sql
   SELECT query_sql, avg_rt, queue_time, get_plan_time, retry_cnt, is_hit_plan
   FROM gv$sql_audit
   WHERE request_time BETWEEN '<start_of_jitter>' AND '<end_of_jitter>'
   ORDER BY avg_rt DESC
   LIMIT 5;
   ```
2. Temporarily disable SQL audit to reduce overhead:
   ```sql
   ALTER SYSTEM SET ob_enable_sql_audit = 0;
   ```
   > **Note**: Re-enable after diagnosis: `ALTER SYSTEM SET ob_enable_sql_audit = 1;`
3. Check for replica role changes by examining observer logs for "leader revoke" or "election" events
4. If `RETRY_CNT > 0`, verify OBProxy is routing to leader replicas (see Problem 5)

**Verification**
- After disabling `sql_audit`, monitor RT—spikes should diminish if audit was the cause
- Post-fix, `QUEUE_TIME` and `GET_PLAN_TIME` should return to baseline (< 10% of total RT)

### Problem 3: Suboptimal Execution Plan Selection

**Symptoms**
- Error message: Not applicable
- Behavior: Query performs full table scan despite selective filters; `EST. ROWS` in `EXPLAIN` vastly differs from actual rows
- Context: Common after large data loads or before first major compaction

**Root Cause**
- Optimizer relies on outdated statistics, leading to incorrect cost model decisions
- Missing covering index forces table lookups after index scan
- Join order or algorithm chosen inefficiently due to skewed data distribution estimates

**Solution**
1. Run `EXPLAIN` and compare `EST. ROWS` with actual row counts:
   ```sql
   EXPLAIN FORMAT = 'EXTENDED' SELECT * FROM orders WHERE customer_id = 123;
   ```
2. If statistics are inaccurate, trigger a major compaction to refresh them:
   ```sql
   ALTER SYSTEM MAJOR FREEZE;
   ```
3. Consider creating a covering index if query selects specific columns:
   ```sql
   CREATE INDEX idx_orders_cust_status ON orders(customer_id, status) INCLUDE (order_date, amount);
   ```
4. Use optimizer hints as a last resort (e.g., `/*+ USE_INDEX(orders idx_orders_cust_status) */`)

**Verification**
- Post-compaction, `EST. ROWS` should align with actual results
- Execution plan shows `INDEX ONLY SCAN` or reduced `COST` value

### Problem 4: Excessive Hard Parsing Due to Literal Values

**Symptoms**
- Error message: Not applicable
- Behavior: High CPU usage on observer nodes; many similar queries with different literal values in `gv$sql_audit`
- Context: Applications concatenating user input directly into SQL strings

**Root Cause**
- Each unique literal value causes a hard parse, consuming CPU and plan cache space
- Prevents reuse of parsed plans (soft parsing), reducing scalability

**Solution**
1. Identify non-parameterized queries in `gv$sql_audit`:
   ```sql
   SELECT query_sql, COUNT(*) 
   FROM gv$sql_audit 
   WHERE request_time > NOW() - INTERVAL '1' HOUR
   GROUP BY query_sql
   HAVING COUNT(*) = 1  -- likely literals
   ORDER BY COUNT(*) ASC
   LIMIT 20;
   ```
2. Modify application code to use bind variables:
   ```python
   # Instead of: cursor.execute(f"SELECT * FROM users WHERE id = {user_id}")
   cursor.execute("SELECT * FROM users WHERE id = %s", (user_id,))
   ```
3. Enable plan cache monitoring to confirm improvement:
   ```sql
   SELECT hit_count, total_count FROM gv$plan_cache_stat;
   ```

**Verification**
- Plan cache hit ratio (`hit_count / total_count`) increases (> 90%)
- CPU usage drops during peak query load

### Problem 5: OBProxy Routing to Non-Leader Replica

**Symptoms**
- Error message: Not applicable
- Behavior: Elevated `RETRY_CNT` in `gv$sql_audit`; increased latency for write-heavy or strongly consistent reads
- Context: Multi-replica OceanBase cluster with OBProxy in front

**Root Cause**
- OBProxy misconfigured or stale metadata causes routing to follower replicas
- Follower cannot serve strong-consistency requests, triggering retry to leader

**Solution**
1. Check `RETRY_CNT` in `gv$sql_audit`:
   ```sql
   SELECT query_sql, retry_cnt, svr_ip 
   FROM gv$sql_audit 
   WHERE retry_cnt > 0 
   ORDER BY request_time DESC 
   LIMIT 10;
   ```
2. Verify OBProxy is using up-to-date location cache:
   - Ensure `obproxy_config_server_url` points to correct config server
   - Restart OBProxy if metadata is stale
3. Confirm read consistency level—use `READ_CONSISTENCY(WEAK)` for non-critical reads to allow follower reads

**Verification**
- `RETRY_CNT` drops to 0 for new queries
- Latency stabilizes across all queries

## FAQ

**Q: How do I check if my SQL query is using an index?**  
A: Use `EXPLAIN FORMAT = 'EXTENDED' <your_query>`. Look for operators like `INDEX RANGE SCAN` or `INDEX ONLY SCAN`. Avoid `TABLE SCAN` unless the table is small.

**Q: What permissions are required to query `gv$sql_audit`?**  
A: You need administrative privileges (typically the `SYS` tenant or a user with `SELECT` on system views). Regular application users cannot access audit tables by default.

**Q: How do I enable debug logging for SQL execution?**  
A: OceanBase does not use traditional debug logs for SQL tuning. Instead, rely on `gv$sql_audit`, `EXPLAIN`, and observer trace logs (accessible via `ob_admin` tools) for deep diagnostics.

**Q: Why is my query slow even though the execution plan looks good?**  
A: Check `QUEUE_TIME` and `GET_PLAN_TIME` in `gv$sql_audit`. High values indicate system-level contention (CPU, memory) or plan generation overhead—not query logic issues.

**Q: How often should I run major compaction for statistics?**  
A: Major compaction refreshes statistics automatically. Schedule it after large data imports or daily during off-peak hours using `ALTER SYSTEM MAJOR FREEZE`. Avoid frequent compactions in high-write workloads.