# pai-instance_management

Part of **PAI**

# Platform for AI (PAI) Instance Management Console Guide

## Operations Overview

| Operation | Console Entry | Prerequisites | Description |
|----------|---------------|---------------|-------------|
| Train Pose Model | Console > Machine Learning Platform for AI (PAI) > Machine Learning Designer > Components > Video Algorithm > Offline Training > pose detection | - An activated Object Storage Service (OSS) bucket<br>- Machine Learning Studio authorized to access OSS | Trains top-down human pose estimation models using HRNet or Lite-HRNet backbones with COCO-formatted data |
| Configure Resource Policies | Console > PAI > Workspace Settings > Resource Policies | - Workspace administrator privileges<br>- Understanding of RAM roles and permissions<br>- Knowledge of DLC and DSW modules | Sets compute quotas, job priorities, auto-shutdown rules, and cost controls for DLC jobs and DSW instances |

## Operation Steps

### Train Pose Model

**Navigation**: Console > Machine Learning Platform for AI (PAI) > Machine Learning Designer > Components > Video Algorithm > Offline Training > pose detection

**Prerequisites**:
- An activated Object Storage Service (OSS) bucket
- Machine Learning Studio authorized to access OSS — see Activate OSS and Grant the permissions that are required to use Machine Learning Designer

1. Label your data using **iTAG**
   - Element: **iTAG** (link) — located in the documentation or data preparation section
   - Notes: Annotation must follow COCO keypoint format

2. Add five **Read File Data** components from the component library
   - Element: **Read File Data** (panel) — found in the component library panel
   - Notes: These will provide training images, training annotations, evaluation images, evaluation annotations, and dataset info

3. Set the **OSS Data Path** parameter for each **Read File Data** component
   - Element: **OSS Data Path** (text_input) — in each component’s configuration panel
   - Notes: Assign paths in this order: training images, training annotations, evaluation images, evaluation annotations, dataset info file

4. Connect all five **Read File Data** components to the **pose detection** component on the canvas
   - Element: **pose detection** (panel) — placed on the workflow canvas
   - Notes: Connections must match input port expectations (training data → training port, etc.)

5. Configure the **pose detection** component with required parameters
   - Element: **pose detection** (panel) — open its configuration panel
   - Notes: A new configuration panel appears on the right side of the screen

6. Connect the output of **pose detection** to the **image prediction** component
   - Element: **image prediction** (panel) — placed on the canvas
   - Notes: This enables downstream inference testing

7. Set parameters on the **image prediction** component
   - Element: **image prediction** (panel) — in its configuration panel
   - Notes: Must align with pose detection settings (e.g., same OSS model path)

| Parameter | Type | Required | Options/Values | Description |
|-----------|------|----------|----------------|-------------|
| model type | dropdown | Yes | TopDown | Algorithm type. Only TopDown is supported. |
| oss dir to save model | text_input | No | — | OSS directory for saving the trained model. Example: oss://examplebucket/output_dir/ckpt/ |
| oss data path to training | text_input | No | — | OSS path to training images. Required if the training input port is not connected. |
| oss annotation path for training data | text_input | No | — | OSS path to the training annotation file. Required if the training annotation port is not connected. |
| oss data path to evaluation | text_input | No | — | OSS path to evaluation images. Required if the evaluation input port is not connected. |
| oss annotation path for evaluation data | text_input | No | — | OSS path to the evaluation annotation file. Required if the evaluation annotation port is not connected. |
| oss path to dataset info file | text_input | No | — | OSS path to the dataset info file. Required if the dataset info port is not connected. |
| Data Source Type | dropdown | Yes | DetSourceCOCO | Input data format. Only DetSourceCOCO is supported. |
| oss path to pretrained model | text_input | No | — | OSS path to a custom pre-trained model. If blank, PAI uses its default pre-trained model. |
| backbone | dropdown | Yes | hrnet, lite_hrnet | Backbone model. Valid values: hrnet, lite_hrnet. |
| num keypoints | number_input | Yes | — | Number of keypoint categories in the dataset. |
| image size after resizing | text_input | Yes | 192,256 | Fixed input image size (width,height). Separate values with a comma. |
| initial learning rate | number_input | Yes | 0.01 | Starting learning rate for training. |
| learning rate policy | dropdown | Yes | step | Learning rate schedule. Only step is supported: the rate decays at the epochs specified in lr step. |
| lr step | text_input | Yes | 170,200 | Epochs at which the learning rate decays by 90%. Separate multiple values with commas. |
| train batch size | number_input | Yes | 32 | Number of samples per training iteration. |
| eval batch size | number_input | Yes | 32 | Number of samples per evaluation iteration. |
| total train epochs | number_input | Yes | 200 | Total number of passes over the training data. |
| save checkpoint epoch | number_input | No | 1 | Checkpoint save frequency. 1 saves a checkpoint after every epoch. |
| optimizer | dropdown | Yes | SGD, Adam | Optimizer for model training. Valid values: SGD, Adam. |
| number process of reading data per gpu | number_input | No | 2 | Number of data-loading threads per GPU. |
| evtorch model with fp16 | checkbox | No | — | Enable FP16 mixed precision to reduce GPU memory usage. |
| single worker or distributed on DLC | dropdown | Yes | single_on_dlc, distribute_on_dlc | Compute mode. Valid values: single_on_dlc, distribute_on_dlc. |
| number of worker | number_input | No | 1 | Number of worker nodes. Required when distribute_on_dlc is selected. |
| cpu machine type | text_input | No | 16vCPU+64GB Mem-ecs.g6.4xlarge | CPU instance type. Required when distribute_on_dlc is selected. |
| gpu machine type | text_input | Yes | 8vCPU+60GB Mem+1xp100-ecs.gn5-c8g1.2xlarge | GPU instance type. |

### Configure Resource Policies

**Navigation**: Console > PAI > Workspace Settings > Resource Policies

**Prerequisites**:
- Workspace administrator privileges
- Understanding of RAM roles and permissions
- Knowledge of DLC and DSW modules

1. Navigate to the **Resource Policies** section
   - Element: **Resource Policies** (menu) — in the left navigation panel under Workspace Settings
   - Notes: Ensure you are in the correct workspace

2. Select the target module (**DLC** or **DSW**) using the module selector
   - Element: **Module Selection Dropdown** (dropdown) — in the main content area
   - Notes: Policy settings differ by module

3. Set the **Maximum runtime** for DLC jobs
   - Element: **Maximum runtime** (text_input) — in the DLC configuration panel
   - Notes: Enter time in hours:minutes:seconds format (e.g., 08:00:00)

4. Define allowed **Job priority** range
   - Element: **Job priority** (text_input) — in the DLC configuration panel
   - Notes: Only integers accepted; higher numbers = higher priority

5. Disable **Allow public resources** to block pay-as-you-go spending
   - Element: **Allow public resources** (checkbox) — in the DLC configuration panel
   - Notes: When unchecked, users cannot create pay-as-you-go DLC jobs

6. Configure **Shutdown policy** for DSW instances
   - Element: **Shutdown policy** (dropdown) — in the DSW Auto-shutdown section
   - Notes: Options include idle time, CPU utilization, memory utilization, GPU utilization

7. Set **Exclusion policy** for instances exempt from auto-shutdown
   - Element: **Exclusion policy** (text_input) — in the DSW Auto-shutdown section
   - Notes: Specify by instance name or priority level

8. Set **Maximum instance wait time**
   - Element: **Maximum instance wait time** (text_input) — in the DSW configuration panel
   - Notes: Instances exceeding this time are stopped automatically

| Parameter | Type | Required | Options/Values | Description |
|-----------|------|----------|----------------|-------------|
| Resource usage modules | dropdown | No | DSW, DLC | Restricts the quota to specific modules such as DSW or DLC. |
| Resource usage roles | dropdown | No | — | Restricts the quota to specific roles within the workspace. |
| Resource usage amount | text_input | No | — | Sets the maximum GPUs, CPU cores, and memory that different roles can use from this quota. |
| Resource specification template | dropdown | No | — | Users must select this template when creating DSW instances or DLC jobs. |
| Maximum runtime | text_input | No | — | Sets the maximum time a job can run before being automatically stopped. |
| Job priority | text_input | No | — | Sets the maximum priority level that different roles or members can assign when submitting jobs. |
| Allow public resources | checkbox | No | — | When disabled, users cannot create pay-as-you-go DLC jobs in this workspace. |
| Maximum job wait time | text_input | No | — | Sets the maximum time a job can spend waiting, queuing, or preparing its environment. A notification is sent when the threshold is exceeded. |
| Instance priority | text_input | No | — | Sets the maximum priority level that different roles or members can assign when creating instances. |
| Maximum instance wait time | text_input | No | — | Sets the maximum time an instance can spend queuing or preparing its environment. Instances that exceed this threshold are stopped automatically. |

## FAQ

Q: Where do I find the pose detection component in Machine Learning Designer?
A: Navigate to Console > Machine Learning Platform for AI (PAI) > Machine Learning Designer > Components > Video Algorithm > Offline Training > pose detection.

Q: What happens if I leave the OSS paths blank in the pose detection component?
A: If the corresponding input ports (e.g., training data) are connected via Read File Data components, the OSS path fields are optional. Otherwise, they are required.

Q: Can I modify resource policies after they are applied?
A: Yes, workspace administrators can update resource policies at any time. Changes take effect immediately for new jobs and instances.

Q: What permissions are required to configure resource policies?
A: You must have workspace administrator privileges and appropriate RAM permissions to manage quotas and policies in PAI.

Q: Does enabling FP16 mixed precision affect model accuracy?
A: FP16 reduces GPU memory usage and may slightly impact numerical precision, but it is generally safe for pose estimation tasks and often accelerates training.

## Pricing & Billing

### Billing Model
Per-instance-hour billing for pose detection training jobs. Resource policy configuration itself is free.

### Price Reference
- Standard tier: ¥0.05/hour for input and output processing
- Distributed training tasks are billed per node per hour

### Free Tier
10 hours of free compute resources per month

### Billing Notes
- Billed by actual usage time; minimum unit is 1 minute
- Distributed training costs scale with number of worker nodes and duration
- Pay-as-you-go DLC jobs can be blocked via the "Allow public resources" policy setting