DaaS / Products / Debug slow AI job querying database

Debug slow AI job querying database

An AI training or inference job on PAI reads data from OceanBase (or RDS) and runs slowly; the developer monitors the PAI job to identify resource bottlenecks, discovers slow database queries are the cause, then optimizes those SQL queries in OceanBase.

Products involved

Scenario

When a PAI training or inference job shows high io_wait or prolonged GPU idle time, the bottleneck typically stems from data ingestion rather than compute. This workflow correlates PAI resource telemetry with OceanBase query diagnostics to isolate slow SQL, apply targeted optimizations, and restore pipeline throughput.

Integration steps

  1. Retrieve PAI job metrics: Use the PAI CLI to pull compute and I/O stats for your TrainingJobId.
  2. ``bash pai-cli training-job describe --job-id <TrainingJobId> --metrics cpu,gpu,io_wait ` Flag jobs where io_wait > 40%` during the data-loading epoch.

  3. Extract the offending SQL: Fetch job logs via the PAI API endpoint /api/v1/jobs/{TrainingJobId}/logs. Filter for JDBC execution traces and copy the exact query string.
  4. Analyze execution plan in OceanBase: Connect to your cluster and run:
  5. ``sql EXPLAIN SELECT * FROM training_data WHERE feature_ts BETWEEN '2024-01-01' AND '2024-01-02'; ` Look for table_scan or high cost` in the plan output.

  6. Verify runtime performance: Query OceanBase’s audit view to confirm scan type and latency:
  7. ``sql SELECT query_sql, elapsed_time, scan_type FROM oceanbase.GV$OB_SQL_AUDIT WHERE query_sql LIKE '%training_data%'; ``

  8. Apply optimization: Add a covering index aligned with filter predicates, then clear stale plans:
  9. ``sql CREATE INDEX idx_feature_ts ON training_data(feature_ts, label_id); ALTER SYSTEM FLUSH PLAN CACHE; ``

  10. Validate: Restart the PAI job. Re-run step 1 and confirm io_wait drops below 15% and GPU utilization stabilizes.

Architecture

PAI orchestrates compute workloads and pulls training batches via JDBC/ODBC. OceanBase executes SQL queries, manages indexes, and streams result sets back to PAI containers. Monitoring flows bidirectionally: PAI emits node-level metrics to CloudMonitor, while OceanBase exposes query execution plans and audit trails. The integration bridges PAI’s I/O telemetry with OceanBase’s SQL diagnostics to isolate data-fetch latency.

Prerequisites

Common pitfalls

Typical questions