# pai-training

Part of **PAI**

# Platform for AI (PAI) Training Job Management Console Guide

## Operations Overview

| Operation | Console Entry | Prerequisites | Description |
|----------|---------------|---------------|-------------|
| Train Image Metric Learning Model | Console > Machine Learning Platform for AI > Workflows > Image Metric Learning Training (raw) | Activated OSS, Authorized Machine Learning Studio to access OSS | Configure and launch a metric learning training job using raw image data with backbones like ResNet or Vision Transformers |

## Step-by-Step Instructions

### Train Image Metric Learning Model

**Navigation**: Console > Machine Learning Platform for AI > Workflows > Image Metric Learning Training (raw)

**Prerequisites**:
- Activated OSS
- Authorized Machine Learning Studio to access OSS

1. Prepare and annotate data using the PAI iTAG module  
   - Element: **iTAG** (link) — top navigation panel  
   - Notes: Refer to the iTAG documentation for detailed instructions.

2. Use Read OSS Data components to read training and validation annotation files  
   - Element: **Read OSS Data-4** (panel) — main content area  
   - Notes: Set the OSS Data Path parameter to the corresponding annotation file path.

3. Connect both Read OSS Data components to the Image Metric Learning Training (raw) component and configure parameters  
   - Element: **Image Metric Learning Training (raw)** (panel) — main content area  
   - Notes: Configure parameters such as OSS directory for training output, backbone model, image resize size, and loss function. After configuration, the workflow can be submitted for execution.

| Parameter | Type | Required | Options/Values | Description |
|-----------|------|----------|----------------|-------------|
| Metric learning model type | dropdown | Yes | Data-parallel metric learning, Model-parallel metric learning | Algorithm parallelism strategy. Select Data-parallel for standard training or Model-parallel to distribute the model across multiple GPUs. |
| OSS directory for training output | text_input | Yes | — | OSS directory where trained models are saved. Example: oss://examplebucket/yun****/designer_test |
| Training annotation file path | text_input | No | — | Required only if no training annotation file is connected to the input port. If configured in both places, the input port takes precedence. |
| Validation annotation file path | text_input | No | — | Required only if no evaluation annotation file is connected to the input port. If configured in both places, the input port takes precedence. |
| File of class name list | text_input | No | — | Enter class names directly or specify an OSS path to a file containing class names. |
| Data source format | dropdown | Yes | ClsSourceImageList, ClsSourceItag | Input data format. ClsSourceImageList expects annotation files in the specified format; ClsSourceItag expects data annotated with the PAI iTAG module. |
| OSS path of the pre-trained model | text_input | No | — | OSS path to a custom pre-trained model. If not set, PAI uses its default pre-trained weights. |
| Metric learning backbone | dropdown | Yes | resnet_50, resnet18, resnet34, resnet101, swint_tiny, swint_small, swint_base, vit_tiny, vit_small, vit_base, xcit_tiny, xcit_small, xcit_base | Backbone model for feature extraction. See supported backbones section for selection guidance. |
| Image resize size | number_input | Yes | — | Image width and height in pixels after resizing. |
| Backbone output feature dimension | number_input | Yes | — | Output feature dimension of the backbone. Must be an integer. |
| Feature output dimension | number_input | Yes | — | Output feature dimension of the Neck. Must be an integer. |
| Number of training classes | number_input | Yes | — | Total number of classes in your dataset. Set this to match your label count. |
| Loss function | dropdown | Yes | AMSoftmax (margin 0.4, scale 30), ArcFaceLoss (margin 28.6, scale 64), CosFaceLoss (margin 0.35, scale 64), LargeMarginSoftmaxLoss (margin 4, scale 1), SphereFaceLoss (margin 4, scale 1), Model-parallel AMSoftmax (classification limit scales with GPU count), Model-parallel Softmax (classification limit scales with GPU count) | Loss function for training. Use the recommended parameters in parentheses as a starting point. |
| Loss function scale parameter | number_input | Yes | — | Scale parameter for the selected loss function. |
| Loss function margin parameter | number_input | Yes | — | Margin parameter for the selected loss function. |
| Loss function weight | number_input | No | — | Balances optimization between metric and classification objectives. |
| Optimizer | dropdown | Yes | SGD, AdamW | Optimizer for training. SGD suits most cases. AdamW converges faster and is a good alternative when SGD is slow to converge. |
| Initial learning rate | number_input | Yes | — | Initial learning rate. Must be a floating-point number. |
| Training batch_size | number_input | Yes | — | Number of samples per training iteration. Use a smaller batch size for larger backbones. |
| Total training epochs | number_input | Yes | — | Total number of passes through the training dataset. |
| Checkpoint saving frequency | number_input | No | — | Number of epochs between checkpoint saves. Set to 1 to save after every epoch. |
| Training data read threads | number_input | No | — | Number of parallel processes for reading training data. |
| Enable half-precision | checkbox | No | — | Enables half-precision (float16) training to reduce GPU memory usage. |
| Computing mode | dropdown | Yes | Standalone DLC, Distributed DLC | Computing engine mode. Standalone DLC runs on a single machine. Distributed DLC distributes training across multiple worker processes. |
| Workers | number_input | No | — | Number of concurrent worker processes. Configure when using Distributed DLC. |
| GPU model | dropdown | Yes | 8vCPU+60GB Mem+1xp100-ecs.gn5-c8g1.2xlarge | GPU specification for training. Select a larger instance for bigger backbones or larger batch sizes. |

## FAQ

Q: Where do I find the Image Metric Learning Training (raw) component?
A: Navigate to Console > Machine Learning Platform for AI > Workflows, then locate the "Image Metric Learning Training (raw)" component in the algorithm palette under offline training components.

Q: What happens if I leave the Training annotation file path empty?
A: If no annotation file is connected to the input port of the component, the job will fail. The field is optional only if you have already connected a Read OSS Data component to the training input port.

Q: Can I modify training parameters after submitting the job?
A: No. Training parameters must be finalized before submission. To change settings, you must clone or recreate the workflow and reconfigure the component.

Q: What permissions are required to run this training job?
A: Your account must have OSS read/write permissions for the specified buckets and authorization for Machine Learning Studio to access OSS resources.

Q: Why is my backbone output feature dimension fixed to certain values?
A: The value must match the actual output dimension of the selected backbone (e.g., ResNet-50 outputs 2048 features). Mismatched dimensions will cause training errors.

## Pricing & Billing

### Billing Model
Per instance hour

### Price Reference
| Tier | Price |
|------|-------|
| Standard | 0.05 / |

### Free Tier
No free tier available

### Billing Notes
Billed based on actual usage time, with a minimum billing unit of 1 minute. Maximum single training duration is 24 hours.