# pai-training-job

Part of **PAI**

<!-- intent-backlink:auto -->

> 💡 **Path Selection**: This skill is one implementation path for the following routing skills. If you're unsure which path to take, check the corresponding routing skill:

> - [Monitor and debug AI jobs](../../intent/pai-monitor-jobs/SKILL.md)
> - [Train a machine learning model](../../intent/pai-train-model/SKILL.md)

# Platform for AI (PAI) Training Job Management

## Capabilities Overview

| Sub-capability | Calling Mode | Description |
|----------------|--------------|-------------|
| Manage Training Job Labels | Synchronous | Update and delete labels associated with training jobs. |
| Get Training Job Error Information | Synchronous | Retrieve detailed error information for failed training jobs. |
| Get Training Job Logs | Synchronous | Retrieve logs from training job execution. |
| List Training Job Events | Synchronous | List events associated with training jobs and their instances. |
| Manage Training Jobs | Synchronous | Create and list training jobs. |
| List Training Jobs | Synchronous | Retrieve a list of all training jobs in the system. |
| Configure Job Settings | Synchronous | Set up general configuration parameters for PAI jobs. |
| Manage Job Templates | Synchronous | Create, list, get, update, delete, and set default versions of job templates. |
| Manage Job Sanity Checks | Synchronous | Retrieve and list sanity check results for jobs. |
| Get Training Job Output Models | Synchronous | Retrieve models produced by completed training jobs. |
| Define Hyperparameters | Synchronous | Define hyperparameters for model training configurations. |
| Define Training Metrics | Synchronous | Define custom metrics to track during model training. |
| Configure TensorBoard Data Source | Synchronous | Set up the data source configuration for TensorBoard during model training. |
| Get Training Job Metrics | Synchronous | Retrieve latest and historical metrics from training jobs. |
| Get Job Metrics | Synchronous | Retrieve performance and monitoring metrics for jobs and pods. |

## API Calling Mode

### Authentication
Use Bearer Token authentication as the primary method.

- Header format: `Authorization: Bearer <your_api_key>`
- Environment variable: `DASHSCOPE_API_KEY`
- Some endpoints may not require authentication, but Bearer Token is recommended for most operations.

### Service Endpoint (Endpoint)
APIs use region-specific endpoints with the following pattern:

`https://api.aliyun.com/api/{service}/{version}/{operation}` (for China regions)  
`https://api.alibabacloud.com/api/{service}/{version}/{operation}` (for international regions)

Common regions include:
- `cn-hangzhou`
- `cn-shanghai`
- `cn-beijing`

Service names include `PaiStudio`, `pai-dlc`, and `PaiPlugin` depending on the specific API.

### Synchronous API Pattern
All APIs in this domain follow a synchronous calling pattern:

1. Send an HTTP request (GET, POST, PUT, or DELETE) to the appropriate endpoint
2. Include required parameters in the URL query string, path, or request body
3. Add the `Authorization: Bearer <your_api_key>` header
4. Receive an immediate JSON response with the requested data or operation result
5. Handle any errors based on HTTP status codes and error messages in the response

For operations that retrieve large datasets (like logs or metrics), use pagination parameters (`PageNumber`, `PageSize`) to manage response size.

## Parameter Reference

### Manage Training Job Labels

| Parameter | Type | Required | Default | Constraints | Description |
|-----------|------|----------|---------|-------------|-------------|
| TrainingJobId | string | true | | | The ID of the training task. |
| Keys | string | true | | | The keys of the labels. |
| Labels | array<object> | false | | | The list of labels. |
| Key | string | false | | | The tag key. |
| Value | string | false | | | The tag value. |

### Get Training Job Error Information

| Parameter | Type | Required | Default | Constraints | Description |
|-----------|------|----------|---------|-------------|-------------|
| TrainingJobId | string | true | | | Training task ID. |

### Get Training Job Logs

| Parameter | Type | Required | Default | Constraints | Description |
|-----------|------|----------|---------|-------------|-------------|
| TrainingJobId | string | true | | | The ID of the training task. |
| InstanceId | string | false | | | The instance ID. |
| PageNumber | integer | false | | | The page number. |
| PageSize | integer | false | | | The page size. |
| StartTime | string | false | | ISO 8601 format | The start UTC time in ISO 8601 format. If empty, the task start time is used. |
| EndTime | string | false | | ISO 8601 format | The end UTC time in ISO 8601 format. If empty, the current time is used. |

### List Training Job Events

| Parameter | Type | Required | Default | Constraints | Description |
|-----------|------|----------|---------|-------------|-------------|
| TrainingJobId | string | true | | | Training task ID. |
| PageNumber | integer | false | 1 | | Page number (default is 1). |
| PageSize | integer | false | 100 | | Page size (default is 100). |
| StartTime | string | false | | ISO8601 format | Start UTC time (ISO8601 format). If empty, it is the task start time. |
| EndTime | string | false | | ISO8601 format | End UTC time (ISO8601 format). If empty, it is the current time. |

### Manage Training Jobs

| Parameter | Type | Required | Default | Constraints | Description |
|-----------|------|----------|---------|-------------|-------------|
| Algorithm | string | false | | | ID of the associated algorithm. |
| CampaignId | string | false | | | ID of the associated operational activity. |
| DataPath | string | false | | | Training data path. |
| Name | string | false | | | Training job name. |
| Remark | string | false | | | Remarks. |
| UserConfig | string | false | | | User configuration used to set parameters such as start_date and end_date to define the time range of modeling data. |
| Status | integer | false | | one of: 0, 1, 2, 3, 4 | Filter by training job status. Valid values: 0: In queue, 1: Submitted, 2: Running, 3: Succeeded, 4: Failed. |

### Configure Job Settings

| Parameter | Type | Required | Default | Constraints | Description |
|-----------|------|----------|---------|-------------|-------------|
| BusinessUserId | string | false | | | User ID associated with the job. |
| Caller | string | false | | | Caller. |
| Tags | object | false | | | Custom tags. |
| PipelineId | string | false | | | Workflow ID. |
| EnableTideResource | boolean | false | | | Enable the job to use tide resources. |
| EnableErrorMonitoringInAIMaster | boolean | false | | | Enable job fault tolerance monitoring. |
| ErrorMonitoringArgs | string | false | | | Specify configuration parameters for fault tolerance monitoring, such as whether to enable log hang-based detection. |
| EnableRDMA | boolean | false | | | Enable the job to use RDMA. |
| EnableOssAppend | boolean | false | | | Enable OSS append writes. |
| OversoldType | string | false | | | Job's oversold resource usage method (not accepted, accepted, or only accepted). |
| AdvancedSettings | object | false | | | Additional advanced parameter settings. |
| Driver | string | false | | | NVIDIA driver configuration. |
| EnableSanityCheck | boolean | false | | | Enable computing power health check for the job. |
| SanityCheckArgs | string | false | | | Configuration parameters for computing power health check. |
| JobReservedMinutes | integer | false | | | Duration in minutes to retain the job after completion. |
| JobReservedPolicy | string | false | | | Policy for retaining the job after completion. |

### Manage Job Templates

| Parameter | Type | Required | Default | Constraints | Description |
|-----------|------|----------|---------|-------------|-------------|
| TemplateName | string | true | | | The name of the job template. |
| Description | string | false | | | The description of the job template. |
| WorkspaceId | string | true | | | The ID of the workspace that contains the job template. |
| Metadata | object | false | | | User-defined key-value metadata. |
| Content | string | true | | | The configuration of the job template, which must be a JSON string containing the job configuration parameters. |
| Constraints | object | false | | | The field constraint rules. The key is a JSONPath expression, and the value is a constraint type: locked (cannot be overridden), overridable (can be overridden), or required (must be specified). |
| TemplateId | string | true | | | The template ID. |
| Version | integer/string | false | | one of: 'all', or a valid integer version number | The version to retrieve. If omitted, the default version is returned. Specify 'all' to retrieve all versions. |

### Get Training Job Metrics

| Parameter | Type | Required | Default | Constraints | Description |
|-----------|------|----------|---------|-------------|-------------|
| TrainingJobId | string | true | | | The ID of the training task. |
| Names | string | false | | | The name of the metric. |
| Name | string | false | | | The name of the metric. |
| PageNumber | integer | false | | | The page number. |
| PageSize | integer | false | | | The number of items per page. |
| StartTime | string | false | | ISO 8601 format | The start time in UTC, in ISO 8601 format. If you omit this parameter, the task start time is used. |
| EndTime | string | false | | ISO 8601 format | The end time in UTC, in ISO 8601 format. If you omit this parameter, the current time is used. |
| Order | string | false | | one of: ASC, DESC | The sort order of returned metrics. Valid values: ASC or DESC. |
| MetricType | string | true | | one of: GpuCoreUsage, GpuMemoryUsage, CpuCoreUsage, MemoryUsage, NetworkInputRate, NetworkOutputRate, DiskReadRate, DiskWriteRate | Metric type. |
| TimeStep | string | false | | one of: 1h, 30m, 5m, 10s | Time interval. |

### Define Hyperparameters

| Parameter | Type | Required | Default | Constraints | Description |
|-----------|------|----------|---------|-------------|-------------|
| DefaultValue | string | false | | | The default value of the hyperparameter. |
| Type | string | false | | | The type of the hyperparameter. |
| Description | string | false | | | The description of the hyperparameter. |
| Required | boolean | false | | | Specifies whether the parameter is required. |
| Name | string | false | | | The name of the parameter. |
| Range | object | false | | | The value range of the parameter. |
| DisplayName | string | false | | | The display name of the parameter. |

### Define Training Metrics

| Parameter | Type | Required | Default | Constraints | Description |
|-----------|------|----------|---------|-------------|-------------|
| Description | string | false | | | The description of the metric. |
| Regex | string | true | | | The regular expression to collect metrics from logs. |
| Name | string | true | | | The name of the metric. |

## Code Examples

### List Training Jobs - Python - All Regions

```python
import requests

url = "https://api.aliyun.com/api/PaiStudio/2022-01-12/ListTrainingJobs"
headers = {
    "Authorization": "Bearer YOUR_API_KEY",
    "Content-Type": "application/json"
}
params = {
    "WorkspaceId": "12345",
    "PageNumber": 1,
    "PageSize": 10
}

response = requests.get(url, headers=headers, params=params)
print(response.json())
```

### Get Training Job Logs - Bash - All Regions

```bash
curl -X GET \
  'https://api.aliyun.com/api/PaiStudio/2022-01-12/ListTrainingJobLogs?TrainingJobId=train129f212o89d&InstanceId=train129f212o89d-master-0&StartTime=2020-11-08T16:00:00Z&EndTime=2020-11-08T16:00:00Z&PageSize=100' \
  -H 'Authorization: Bearer YOUR_API_KEY'
```

### Create Job Template - Curl - All Regions

```bash
curl -X POST https://api.aliyun.com/api/pai-dlc/2020-12-03/CreateJobTemplate \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
  "TemplateName": "job-template-example-1778047****",
  "Description": "Template description",
  "WorkspaceId": "15****05",
  "Metadata": {},
  "Content": "{\"WorkspaceId\":\"15****05\",\"JobType\":\"PyTorchJob\",\"UserCommand\":\"echo hello\",\"JobSpecs\":[{\"Type\":\"Worker\",\"PodCount\":1,\"Image\":\"dsw-registry-vpc.cn-hangzhou.cr.aliyuncs.com/pai/pytorch:2.8.0-gpu-py313-cu129-ubuntu22.04-3995b779-1764361782\",\"EcsSpec\":\"ecs.gn7i-c8g1.2xlarge\"}],\"ResourceType\":\"ECS\",\"_ResourcePaymentType\":\"PostPaid\",\"CredentialConfig\":{\"EnableCredentialInject\":false},\"Accessibility\":\"PRIVATE\",\"Settings\":{\"JobReservedMinutes\":0,\"Tags\":{}}}",
  "Constraints": {"JobSpecs[0].Image":"locked","UserCommand":"locked","JobType":"locked"}
}'
```

### Delete Job Template - Python - All Regions

```python
import requests

def delete_job_template(template_id, api_key):
    url = f"https://api.aliyun.com/api/pai-dlc/2020-12-03/DeleteJobTemplate?TemplateId={template_id}"
    headers = {
        "Authorization": f"Bearer {api_key}",
        "Content-Type": "application/json"
    }
    response = requests.delete(url, headers=headers)
    return response.json()

# Example usage
result = delete_job_template("tplwk80096dw****", "your-api-key")
print(result)
```

### Get Training Job Latest Metrics - Bash - All Regions

```bash
curl -X GET 'https://api.aliyun.com/api/PaiStudio/2022-01-12/GetTrainingJobLatestMetrics?TrainingJobId=train129f212o89d&Names=loss' \
-H 'Authorization: Bearer $DASHSCOPE_API_KEY'
```

### Update Training Job Labels - Curl - All Regions

```bash
curl -X POST https://api.aliyun.com/api/PaiStudio/2022-01-12/UpdateTrainingJobLabels/train76rcaupa2cz/labels \
-H "Authorization: Bearer <your-access-token>" \
-H "Content-Type: application/json" \
-d '{"Labels":[{"Key":"RootModelID","Value":"model-ad8cv770kl"}]}'
```

### Get Job Sanity Check Result - Curl - All Regions

```bash
curl -X GET 'https://api.aliyun.com/api/pai-dlc/2020-12-03/GetJobSanityCheckResult?JobId=dlcl5qxoxxxxx5iq&SanityCheckNumber=1&SanityCheckPhase=DeviceCheck&Token=eyJhbG******zI1NiIsInR5cCI6IkpXVCJ9.eyJle****jE3MDk1Mzk0NDIsImlhdCI6MTcwODkzNDY0MiwidXNlcl9pZCI6IjE3NTgwNTQxNjI0Mzg2NTUiLCJ0YXJnZXRfaWQiOiJkbGM1OGh1a2xyYzZwdGMyIiwidGFyZ2V0X3R5cGUiOiJqb2IifQ.GNL7jo6****mgKKv0QeGIYgvBufSU-PH_EQttX****' \
-H 'Authorization: Bearer <your-api-key>'
```

### Get Job Metrics - Bash - All Regions

```bash
curl -X GET \
  'https://api.aliyun.com/api/pai-dlc/2020-12-03/GetJobMetrics?JobId=dlc-20210126170216-*******&MetricType=GpuMemoryUsage&StartTime=2020-11-08T16:00:00Z&EndTime=2020-11-09T16:00:00Z&TimeStep=5m' \
  -H 'Authorization: Bearer eyXXXX-XXXX.XXXXX'
```

## Response Format

```json
{
  "RequestId": "473469C7-AA6F-4DC5-B3DB-A3DC0DE3C83E",
  "ErrorInfo": {
    "Code": "200",
    "Message": "success",
    "AdditionalInfo": "additional info"
  }
}
```

**Key Fields**:
- `RequestId` — Unique identifier for the API request
- `ErrorInfo.Code` — Error code indicating success or failure
- `ErrorInfo.Message` — Human-readable error message
- `ErrorInfo.AdditionalInfo` — Additional error details if available

## Error Handling

| Error Code (Code) | Description (Description) | Recommended Action (Recommended Action) |
|-------------------|---------------------------|----------------------------------------|
| 400 | Bad Request: The request parameters are invalid or missing required fields. | Verify all required parameters are provided and formatted correctly. |
| 401 | Unauthorized - Authentication failed or the API key is invalid. | Check your API key and ensure it has the necessary permissions. |
| 403 | Forbidden - The user does not have permission to access the requested resource. | Verify your account has the required permissions for this operation. |
| 404 | Not Found - The specified training job ID does not exist. | Confirm the training job ID is correct and exists in your workspace. |
| 429 | Too Many Requests - The request rate exceeds the allowed limit. | Implement rate limiting in your client or wait before retrying. |
| 500 | Internal Server Error - An unexpected error occurred on the server side. | Retry the request after a short delay. If the issue persists, contact support. |
| 503 | Service Unavailable - The service is temporarily unavailable due to maintenance or overload. | Wait and retry the request after a short period. |

### Rate Limits & Retry
- Most APIs have a rate limit of 100 QPS (queries per second) per account
- Some APIs have lower limits (e.g., 10 QPS)
- Implement exponential backoff for retries when encountering 429 errors
- Respect the Retry-After header if provided in error responses

## Environment Requirements

- Set the environment variable: `export DASHSCOPE_API_KEY=your_api_key`
- For Python, install the requests library: `pip install requests`
- Ensure your system clock is synchronized for proper authentication

## FAQ

Q: How do I authenticate my API requests?
A: Use the Bearer Token authentication method by including the header `Authorization: Bearer <your_api_key>` in your requests. Store your API key securely in the environment variable `DASHSCOPE_API_KEY`.

Q: What should I do if I receive a 404 error when accessing a training job?
A: Verify that the TrainingJobId is correct and that the job exists in your workspace. Also check that you have the necessary permissions to access the job.

Q: How can I monitor the progress of my training job?
A: Use the GetTrainingJobLatestMetrics API to retrieve real-time metrics like loss values. You can also use ListTrainingJobLogs to view the training logs and ListTrainingJobEvents to see important events in the job lifecycle.

Q: What's the difference between ListTrainingJobs and the other job listing APIs?
A: ListTrainingJobs provides a comprehensive view of all training jobs with detailed configuration information, while other APIs like ListTrainingJobEvents or ListTrainingJobLogs focus on specific aspects of job monitoring and debugging.

Q: How do I handle pagination when retrieving large datasets like logs or metrics?
A: Use the PageNumber and PageSize parameters to control pagination. Start with PageNumber=1 and PageSize=100 (or another reasonable value), then increment PageNumber until you've retrieved all data (when TotalCount <= PageNumber * PageSize).

## Pricing & Billing

### Billing Model
Per-request billing model where each API call counts as one request regardless of success or failure.

### Price Reference

| Tier | Input Price | Output Price |
|------|-------------|--------------|
| standard | 0.001 / | 0.001 / |
| default | 0.001 / | 0.001 / |

### Free Tier
Monthly free quota of 1,000 requests for most APIs.

### Usage Limits
- Rate limits range from 10 to 100 QPS per account depending on the API
- Some APIs have additional constraints like maximum request size (e.g., 10KB)

### Billing Notes
- Each API call counts as one request, including failed requests
- Free tier resets monthly
- Pricing may vary slightly between different API endpoints