# pai-monitor

Part of **PAI**

<!-- intent-backlink:auto -->

> 💡 **Path Selection**: This skill is one implementation path for [Monitor and debug AI jobs](../../intent/pai-monitor-jobs/SKILL.md). If you're unsure which path to take, check the routing skill first.

# Platform for AI (PAI) Monitoring & Observability

## Capabilities Overview

| Sub-capability | Calling Mode | Description |
|----------------|--------------|-------------|
| Get User Metrics | Synchronous | Retrieves metric data for a user, including resource group details, user monitoring data, and usage statistics. Supports pagination, sorting, and filtering by time step. |
| Get Metric Data | Synchronous | Retrieves metric data with timestamp and value information used in PAI monitoring systems. |
| Get Monitoring Data | Synchronous | Retrieves general monitoring data represented as timestamp-value pairs from the system. |
| Query System Logs | Synchronous | Queries system logs with parameters for time range, node ID, and pagination. Returns paginated log entries with metadata. |
| Manage Diagnostic Task | Synchronous | Creates diagnostic tasks and lists or queries their results for cluster and node analysis. |
| Manage Resource Log Configurations | Synchronous | Enables, disables, or queries log shipper configurations for resource groups using Log Service. |
| Get XTrace Token | Synchronous | Retrieves authentication token and endpoints required for uploading trace data to the XTrace service. |
| Retrieve Event Details | Synchronous | Gets detailed information about specific events in PAI, including trigger source, type, and content. |

## API Calling Patterns

### Authentication
The primary authentication method is **Bearer Token**.

- Include the header: `Authorization: Bearer <your_api_key>`
- Store your API key in the environment variable: `DASHSCOPE_API_KEY`
- Example: `export DASHSCOPE_API_KEY=sk-xxxxxx`

While other authentication mechanisms may exist in Alibaba Cloud, all PAI Monitoring & Observability APIs documented here use Bearer Token via the `Authorization` header.

### Service Endpoint (Endpoint)
APIs use region-specific endpoints under two base domains:

- **China regions**: `https://api.aliyun.com/api/{service}/{version}/{operation}`
- **International regions**: `https://api.alibabacloud.com/api/{service}/{version}/{operation}`

Common regions include:
- `cn-hangzhou`
- `cn-shanghai`
- `cn-beijing`

Note: Some APIs (e.g., GetUserViewMetrics) use a single global endpoint (`api.aliyun.com`), while others (e.g., ListSyslogs, CreateDiagnosticTask) have separate China/international endpoints.

### Synchronous API Pattern
All functions in this domain follow a **Synchronous** calling pattern:

1. Send an HTTP request (GET, POST, or DELETE) to the appropriate endpoint
2. Include required parameters as query string (GET/DELETE) or JSON body (POST)
3. Provide authentication via the `Authorization: Bearer` header
4. Receive a complete JSON response immediately
5. Parse the response for success data or error codes

No polling, streaming, or async task handling is required—responses are returned in a single round trip.

## Parameter Reference

### Get User Metrics

| Parameter | Type | Required | Default | Constraints | Description |
|----------|------|----------|---------|-------------|-------------|
| ResourceGroupID | string | Yes | — | — | Resource group ID. Each resource group has a globally unique ID. |
| WorkspaceId | string | No | — | — | Workspace ID. |
| PageNumber | string | Yes | — | — | Page number of the current page. |
| PageSize | string | Yes | — | — | Number of items per page. |
| SortBy | string | No | — | — | Field to sort by. |
| Order | string | No | — | one of: asc, desc | Sort order. Valid values: `asc` for ascending and `desc` for descending. |
| UserId | string | No | — | — | The ID of the user's Alibaba Cloud account. |
| TimeStep | string | No | 5m | supported units: h, m, s | Time step. Default: `5m`. Supported units: `h` (hours), `m` (minutes), `s` (seconds). |

### Query System Logs

| Parameter | Type | Required | Default | Constraints | Description |
|----------|------|----------|---------|-------------|-------------|
| ToTime | string | Yes | — | — | The end time (UNIX timestamp). |
| FromTime | string | Yes | — | — | The start time (UNIX timestamp). |
| NodeId | string | Yes | — | — | The node ID. |
| Query | string | No | — | — | The query condition. |
| Reverse | boolean | No | — | — | Specifies whether to sort results by time in descending order. |
| NextToken | string | No | — | — | Token for the next page of results. |

### Manage Diagnostic Task

| Parameter | Type | Required | Default | Constraints | Description |
|----------|------|----------|---------|-------------|-------------|
| ClusterId | string | No | — | — | The cluster ID. |
| NodeIds | array | No | — | — | The IDs of the nodes. |
| DiagnosticType | string | No | — | — | The diagnostics type. |
| StartTime | string | No | — | timestamp format, accurate to the minute | Start time in seconds (UNIX timestamp). |
| EndTime | string | No | — | timestamp format, accurate to the minute | End time in seconds (UNIX timestamp). |
| DiagnosticId | string | No | — | — | The diagnostic task ID (for Describe/List operations). |

### Manage Resource Log Configurations

| Parameter | Type | Required | Default | Constraints | Description |
|----------|------|----------|---------|-------------|-------------|
| ClusterId | string | Yes | — | — | The ID of the region (e.g., `cn-shanghai`). |
| ResourceId | string | Yes | — | — | The ID of the resource group. |
| ProjectName | string | Yes | — | — | The Log Service project name. |
| LogStore | string | Yes | — | — | The Logstore name in Log Service. |

## Code Examples

### Get User Metrics - Python - All Regions

```python
import requests

url = "https://api.aliyun.com/api/PaiStudio/2022-01-12/GetUserViewMetrics"
params = {
    "ResourceGroupID": "rgf0zhfqn1d4ity2",
    "PageNumber": "1",
    "PageSize": "10",
    "SortBy": "GmtModified",
    "Order": "DESC"
}
headers = {
    "Authorization": "Bearer $DASHSCOPE_API_KEY",
    "Content-Type": "application/json"
}

response = requests.get(url, params=params, headers=headers)
print(response.json())
```

### Query System Logs - Bash - China Region

```bash
curl -X POST 'https://api.aliyun.com/api/eflo-controller/2022-12-15/ListSyslogs' \
-H 'Authorization: Bearer $DASHSCOPE_API_KEY' \
-H 'Content-Type: application/json' \
-d '{
  "FromTime": "1687363300",
  "ToTime": "1687363400",
  "NodeId": "e01-cn-9lb36u4s601"
}'
```

### Enable Log Shipper - Bash - All Regions

```bash
curl -X POST https://api.aliyun.com/api/eas/2021-07-01/CreateResourceLog \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
  "ClusterId": "cn-shanghai",
  "ResourceId": "eas-r-asdasdasd****",
  "body": {
    "ProjectName": "eas-r-asdasdasd-sls",
    "LogStore": "access_log"
  }
}'
```

### Get XTrace Token - Bash - All Regions

```bash
curl -X GET 'https://api.alibabacloud.com/api/PaiLLMTrace/2024-03-11/GetXtraceToken' \
-H 'Authorization: Bearer $DASHSCOPE_API_KEY' \
-H 'Content-Type: application/json'
```

### Disable Log Shipper - Bash - All Regions

```bash
curl -X DELETE \
  'https://api.aliyun.com/api/eas/2021-07-01/DeleteResourceLog?ClusterId=cn-shanghai&ResourceId=eas-r-asdas****' \
  -H "Authorization: Bearer $DASHSCOPE_API_KEY"
```

### Describe Resource Log Configuration - Bash - All Regions

```bash
curl -X GET 'https://api.aliyun.com/api/eas/2021-07-01/DescribeResourceLog?ClusterId=cn-shanghai&ResourceId=eas-r-asdas****' \
-H 'Authorization: Bearer $DASHSCOPE_API_KEY' \
-H 'Content-Type: application/json'
```

## Response Format

```json
{
  "ResourceGroupId": "rgf0zhfqn1d4ity2",
  "UserMetrics": [
    {
      "ResourceGroupId": "rg17tmvwiokh****",
      "TotalCPU": 1000,
      "CPUUsageRate": "59",
      "GPUUsageRate": "10",
      "TotalMemory": 10240,
      "TotalGPU": 1,
      "MemoryUsageRate": "20",
      "RequestCPU": 100,
      "RequestGPU": 10,
      "RequestMemory": 102400000,
      "NetworkInputRate": "1",
      "NetworkOutputRate": "1",
      "DiskReadRate": "22",
      "DiskWriteRate": "22",
      "JobType": "PyTorch",
      "UserId": "16111111****",
      "CPUNodeNumber": 2,
      "GPUNodeNumber": 1,
      "CpuJobNames": [
        "dlcxxxxx"
      ],
      "GpuJobNames": [
        "dlcyyyyy"
      ],
      "NodeNames": [
        "lrnxxxxxx"
      ],
      "CpuNodeNames": [
        "ecixxxxxx"
      ],
      "GpuNodeNames": [
        "lrnxxxxxxx"
      ]
    }
  ],
  "Summary": {
    "ResourceGroupId": "rg17tmvwiokh****",
    "TotalCPU": 1000,
    "CPUUsageRate": "59",
    "GPUUsageRate": "10",
    "TotalMemory": 10240,
    "TotalGPU": 1,
    "MemoryUsageRate": "20",
    "RequestCPU": 100,
    "RequestGPU": 10,
    "RequestMemory": 102400000,
    "NetworkInputRate": "1",
    "NetworkOutputRate": "1",
    "DiskReadRate": "22",
    "DiskWriteRate": "22",
    "JobType": "PyTorch",
    "UserId": "16111111****",
    "CPUNodeNumber": 2,
    "GPUNodeNumber": 1,
    "CpuJobNames": [
      "dlcxxxxx"
    ],
    "GpuJobNames": [
      "dlcyyyyy"
    ],
    "NodeNames": [
      "lrnxxxxxx"
    ],
    "CpuNodeNames": [
      "ecixxxxxx"
    ],
    "GpuNodeNames": [
      "lrnxxxxxxx"
    ]
  },
  "Total": 2
}
```

**Key Fields**:
- `ResourceGroupId` — Identifier of the resource group being monitored
- `UserMetrics[].CPUUsageRate` — CPU utilization percentage as a string
- `UserMetrics[].GPUUsageRate` — GPU utilization percentage as a string
- `UserMetrics[].MemoryUsageRate` — Memory utilization percentage as a string
- `UserMetrics[].JobType` — Framework type (e.g., PyTorch, TensorFlow)
- `UserMetrics[].CpuJobNames` / `GpuJobNames` — Names of jobs running on CPU/GPU
- `Summary` — Aggregated metrics across all users in the resource group
- `Total` — Total number of user records matching the query

## Error Handling

| Error Code (Code) | Description (Description) | Recommended Action (Recommended Action) |
|-------------------|----------------------------|----------------------------------------|
| 400 | Bad Request. The request parameters are invalid or missing required fields. | Validate all required parameters and ensure correct formatting (e.g., timestamps, enums). |
| 401 | Unauthorized. The API key or authentication token is invalid or missing. | Check that `DASHSCOPE_API_KEY` is set correctly and the token is valid. |
| 403 | Forbidden. The user does not have permission to access the requested resource. | Verify RAM policies grant access to the specified resource group or cluster. |
| 404 | Not Found. The specified ResourceGroupID or WorkspaceId does not exist. | Confirm the resource ID exists and is spelled correctly. |
| 429 | Too Many Requests. The request rate exceeds the allowed limit. Wait before retrying. | Implement exponential backoff; reduce request frequency. |
| 500 | Internal Server Error. An unexpected error occurred on the server side. | Retry with backoff; contact support if persistent. |
| 503 | Service Unavailable. The service is temporarily unavailable due to maintenance or overload. | Wait and retry later. |
| InvalidParameter | One or more parameters are invalid. Check that required parameters are provided and values are within valid ranges. | Review parameter constraints (e.g., time formats, enum values). |
| UnauthorizedOperation | The caller does not have sufficient permissions to perform this operation. Ensure the API key has proper access rights. | Assign appropriate RAM permissions for diagnostic or log operations. |
| Throttling | Request rate exceeds the allowed limit. Reduce the frequency of requests or implement exponential backoff. | Add delay between requests; use pagination to reduce load. |
| NoPermission | The caller does not have sufficient permissions to access the Xtrace service. Ensure the RAM user or role has the required policy attached. | Attach `AliyunARMSFullAccess` or equivalent tracing policy to the RAM role. |

### Rate Limits & Retry
- **GetUserViewMetrics**: 100 QPS per user
- **DeleteResourceLog**: 10 QPS per account
- **DescribeResourceLog**: 100 QPS per account
- **Diagnostic APIs**: 100 QPS

When encountering `429` or `Throttling` errors:
- Use exponential backoff (e.g., wait 1s, 2s, 4s, 8s...)
- Respect the `Retry-After` header if present
- For paginated responses, process pages sequentially rather than in parallel

## Environment Requirements

- Set the environment variable: `export DASHSCOPE_API_KEY=your_api_key_here`
- For Python: `pip install requests` (used in examples)
- No specific SDK is required—all APIs are RESTful and work with standard HTTP clients
- Ensure your Alibaba Cloud account has appropriate RAM permissions for PAI Monitoring & Observability actions

## FAQ

Q: How do I authenticate API requests?
A: Use a Bearer Token in the `Authorization` header: `Authorization: Bearer $DASHSCOPE_API_KEY`. Set your API key in the `DASHSCOPE_API_KEY` environment variable.

Q: Why am I getting a 403 Forbidden error when querying logs or metrics?
A: Your Alibaba Cloud RAM user or role lacks permissions for the target resource group or cluster. Attach policies like `AliyunPAIFullAccess` or custom policies granting `pai:Get*`, `pai:List*`, and `pai:Describe*` actions.

Q: What time format should I use for log queries?
A: Use UNIX timestamps in seconds (e.g., `1687363300`) for `FromTime` and `ToTime` in system log queries. For diagnostic tasks, timestamps must be accurate to the minute.

Q: Can I get real-time streaming metrics from PAI?
A: No—all monitoring APIs in this domain are synchronous and return point-in-time snapshots. Poll periodically (respecting rate limits) for near-real-time data.

Q: Is the XTrace token API free to use?
A: Yes—the `GetXtraceToken` API is free with no usage limits, though you still need valid authentication and permissions.

## Pricing & Billing

### Billing Model
All APIs use **per-request billing**, meaning each successful or failed API call counts as one billable request, regardless of response size.

### Price Reference

| Tier / Specification | Input Price | Output Price |
|----------------------|-------------|--------------|
| GetUserViewMetrics (standard) | 0.001 / | 0.001 / |
| ListSyslogs (default) | 0.0001 / | 0.0001 / |
| CreateDiagnosticTask (standard) | 0.001 / | 0.001 / |
| DeleteResourceLog (standard) | 0.001 / | 0.001 / |
| DescribeResourceLog (standard) | 0.0001 / | 0.0001 / |

### Free Tier
- **GetUserViewMetrics**: 1000 free calls per month
- **ListSyslogs**: 1000 free requests per month
- **CreateDiagnosticTask**: 100 free requests per month
- **DescribeDiagnosticResult**: 1000 free calls per month
- **DeleteResourceLog**: 100 free calls per month
- **DescribeResourceLog**: 1000 free calls per month
- **GetXtraceToken**: Free with no usage limits

### Usage Limits
- **GetUserViewMetrics**: 100 QPS, max 1000 records per request
- **ListSyslogs**: Max 100 log entries per request
- **Diagnostic APIs**: 100 QPS
- **DeleteResourceLog**: 10 QPS

### Billing Notes
- Billing occurs on every API call attempt, even if it fails (e.g., 4xx/5xx errors)
- Free tier quotas reset monthly
- Prices shown are in Chinese Yuan 
- Exceeding free tier results in standard per-request charges