# alinux-ai

Part of **ALINUX**

<!-- intent-backlink:auto -->

> 💡 **Path Selection**: This skill is one implementation path for [Deploy AI models for inference or training](../../intent/alinux-deploy-model/SKILL.md). If you're unsure which path to take, check the routing skill first.

# Alibaba Cloud Linux AI and GPU Workloads Console Guide

## Operations Overview

| Operation | Console Entry | Prerequisites | Description |
|----------|---------------|---------------|-------------|
| Run AC2 Container on ECS | Console > Elastic Compute Service > Instances > Create Instance | - ECS instance selected and connected<br>- Recommended OS: Alibaba Cloud Linux 3.2104 LTS 64-bit<br>- GPU resources available (for GPU tasks) | Deploy AC2 AI containers supporting PyTorch CPU/GPU training on ECS |
| Deploy CPU-Based Models (AMD) | ECS > Instance Creation Wizard | - AMD CPU-based ECS instance ready<br>- Alibaba Cloud Linux 3.2104 LTS 64-bit OS<br>- Public IPv4 assigned<br>- Docker installed and enabled | Deploy ChatGLM3-6B using AMD-optimized AC2 AI container image |
| Deploy CPU-Based Models (Intel) | ECS > Instance Creation Wizard | - Intel CPU-based ECS instance (e.g., ecs.g8i.4xlarge)<br>- Alibaba Cloud Linux 3.2104 LTS 64-bit OS<br>- Public IPv4 with 100 Mbps bandwidth<br>- 100 GiB data disk<br>- Docker installed | Deploy Qwen-7B-Chat using Intel-optimized AC2 AI container image |
| Deploy GPU-Accelerated Models | ECS > Instance Creation Wizard | - GPU instance with ≥16 GiB VRAM (e.g., ecs.gn6i-c4g1.xlarge)<br>- Alibaba Cloud Linux 3.2104 LTS 64-bit OS<br>- Public IPv4 with 100 Mbps bandwidth<br>- 100 GiB data disk<br>- Docker and NVIDIA drivers installed | Deploy Qwen-7B-Chat with NVIDIA GPU acceleration via AC2 container |
| Monitor GPU Performance | Console > ECS > Instances > Instance Details > Monitoring & Diagnostics | - Running GPU-enabled instance<br>- GPU driver installed and loaded<br>- Console access permissions granted | View real-time GPU utilization, memory usage, and performance metrics |
| Diagnose GPU Issues | Console > ECS > GPU Performance & Diagnostics > GPU Diagnostics | - GPU instance<br>- SysOM component installed<br>- Instance in supported region | Analyze system anomalies, NCCL errors, and slow nodes using diagnostic reports |
| View GPU Topology | Console > GPU Performance & Diagnostics > GPU Resource Topology | - ACK GPU cluster<br>- NVIDIA GPUs<br>- NCCL v2.22.23<br>- SysOM v3.9.1+ | Visualize multi-node GPU communication topology for distributed training |

## Operation Steps

### Run AC2 Container on ECS

**Navigation**: Console > Elastic Compute Service > Instances > Create Instance

**Prerequisites**:
- ECS instance already selected and accessible
- Recommended OS: **Alibaba Cloud Linux 3.2104 LTS 64**
- GPU resources available if running GPU-accelerated workloads

1. Click the **Create Instance** button to start configuration  
   - Element: **Create Instance** (button) — located in the top-right corner of the ECS console
   - Notes: This opens the instance creation wizard

2. In the operating system selection section, choose **Alibaba Cloud Linux 3.2104 LTS 64** 
 - Element: **** (dropdown) — found in the "Image" section of the instance configuration page
   - Notes: Using this OS ensures full compatibility with AC2 AI containers and support services

3. Select a GPU-capable instance type if needed (e.g., **gn7i**)  
 - Element: **** (dropdown) — in the "Instance Type" section
   - Notes: Only required for GPU workloads; skip for CPU-only tasks

| Parameter | Type | Required | Options/Values | Description |
|-----------|------|----------|----------------|-------------|
| dropdown | Yes | , , , GPUgn7i | Choose an instance type suitable for AI training tasks |
| dropdown | Yes | Alibaba Cloud Linux 3.2104 LTS 64, CentOS 7, Ubuntu 20.04 | Recommended: **Alibaba Cloud Linux 3.2104 LTS 64** for optimal support |

### Deploy CPU-Based Models (AMD)

**Navigation**: ECS > Instance Creation Wizard

**Prerequisites**:
- AMD CPU-based ECS instance prepared
- OS: **Alibaba Cloud Linux 3.2104 LTS 64**
- Public IPv4 address assigned
- Docker installed and daemon enabled

1. Navigate to the instance creation page  
 - Element: **** (link) — accessible from top navigation bar or console homepage

2. Select the instance specification  
 - Element: **** (dropdown) — in the instance type selection area
   - Notes: Recommended: **ecs.g8a.4xlarge** (64 GiB memory) for stable ChatGLM3-6B operation

3. Choose the operating system image  
 - Element: **** (dropdown) — in the image selection section
 - Notes: Must select **Alibaba Cloud Linux 3.2104 LTS 64**

4. Enable public IP assignment  
 - Element: **IPv4** (checkbox) — in network and security group settings
   - Notes: Set bandwidth peak to **100 Mbps** to accelerate model downloads

5. Configure data disk size  
 - Element: **** (text_input) — in storage settings
   - Notes: Set to **100 GiB** to accommodate model files

| Parameter | Type | Required | Options/Values | Description |
|-----------|------|----------|----------------|-------------|
| dropdown | Yes | ecs.g8a.4xlarge, | Recommended: **ecs.g8a.4xlarge** to meet 30 GiB memory requirement |
| dropdown | Yes | Alibaba Cloud Linux 3.2104 LTS 64 | Must use this OS for AMD-optimized AI containers |
| IP | checkbox | No | — | Enables remote access and model downloading |
| dropdown | Yes | , | Recommended: **** |
| number_input | Yes | — | Set to **100** Mbps for faster downloads |
| text_input | Yes | — | Set to **100 GiB** for model storage |

### Deploy CPU-Based Models (Intel)

**Navigation**: ECS > Instance Creation Wizard

**Prerequisites**:
- Intel CPU instance (minimum **ecs.g8i.4xlarge**, 64 GiB RAM)
- OS: **Alibaba Cloud Linux 3.2104 LTS 64**
- Public IPv4 with 100 Mbps bandwidth
- 100 GiB data disk
- Docker installed and running

1. Go to the instance creation page  
 - Element: **** (link) — in top navigation

2. Select instance type  
 - Element: **** (dropdown) — in instance specification area
   - Notes: Use **ecs.g8i.4xlarge** or higher for Qwen-7B-Chat stability

3. Choose OS image  
 - Element: **** (dropdown) — in image section
 - Notes: Must be **Alibaba Cloud Linux 3.2104 LTS 64**

4. Configure public IP  
 - Element: **IPv4** (checkbox) — in network settings
 - Notes: Set billing mode to **** and peak bandwidth to **100 Mbps**

5. Set data disk size  
 - Element: **** (text_input) — in storage section
   - Notes: Allocate **100 GiB** for model files

| Parameter | Type | Required | Options/Values | Description |
|-----------|------|----------|----------------|-------------|
| dropdown | Yes | ecs.g8i.4xlarge, | Must provide ≥64 GiB memory |
| dropdown | Yes | Alibaba Cloud Linux 3.2104 LTS 64 | Required OS for Intel-optimized containers |
| IP | checkbox | No | — | Optional but recommended for model download |
| dropdown | Yes | , | **** preferred for cost control |
| number_input | Yes | — | Set to **100** Mbps |
| text_input | Yes | — | Set to **100 GiB** |

### Deploy GPU-Accelerated Models

**Navigation**: ECS > Instance Creation Wizard

**Prerequisites**:
- GPU instance with ≥16 GiB VRAM (e.g., **ecs.gn6i-c4g1.xlarge**)
- OS: **Alibaba Cloud Linux 3.2104 LTS 64**
- Public IPv4 with 100 Mbps bandwidth
- 100 GiB data disk
- Docker and NVIDIA drivers pre-installed

1. Access the instance creation page  
 - Element: **** (link) — from top navigation or product entry

2. Complete configuration and create the instance  
 - Element: **** (button) — at bottom of main content area
   - Notes: Ensure correct settings for instance type, OS, public IP, and disk

| Parameter | Type | Required | Options/Values | Description |
|-----------|------|----------|----------------|-------------|
| dropdown | Yes | ecs.gn6i-c4g1.xlarge, 16GiB | Must support NVIDIA GPU with ≥16 GiB VRAM |
| dropdown | Yes | Alibaba Cloud Linux 3.2104 LTS 64 | OS for AI container deployment |
| IP | checkbox | Yes | IPv4 | Required for model download |
| dropdown | Yes | Recommended for cost efficiency |
| number_input | Yes | — | Set to **100** Mbps |
| number_input | Yes | — | Set to **100 GiB** |

### Monitor GPU Performance

**Navigation**: Console > ECS > Instances > Instance Details > Monitoring & Diagnostics

**Prerequisites**:
- Running GPU-enabled ECS instance
- GPU driver installed and loaded
- Appropriate console permissions

1. Navigate to the Instances menu  
   - Element: **Instances** (menu) — in left navigation panel

2. Select the target GPU instance  
   - Element: **Instance ID or Name** (link) — in main content area

3. Open the monitoring tab  
   - Element: **Monitoring & Diagnostics** (tab) — at top of instance details panel

4. View GPU metrics  
   - Element: **GPU Utilization Chart** (graph) — in main content area
   - Notes: Charts update every 60 seconds; hover for detailed values

### Diagnose GPU Issues

**Navigation**: Console > ECS > GPU Performance & Diagnostics > GPU Diagnostics

**Prerequisites**:
- GPU instance
- SysOM component installed
- Instance in supported region

1. In the left navigation panel, expand **GPU** 
 - Element: **GPU** (menu) — left sidebar

2. Click **GPU** 
 - Element: **GPU** (menu) — under GPU performance section

3. Enter training task ID  
 - Element: **ID** (text_input) — in diagnostic parameters area
   - Notes: Typically the pod name, e.g., **traintask-82asd33as-master-0**

4. Enter runtime duration  
 - Element: **** (text_input) — next to task ID field
   - Notes: Unit is **seconds**

5. View diagnostic report  
 - Element: **** (button) — in analysis records section
   - Notes: Report includes system info, conclusions, and raw data

| Parameter | Type | Required | Options/Values | Description |
|-----------|------|----------|----------------|-------------|
| ID | text_input | Yes | — | Pod name of the training task (e.g., traintask-82asd33as-master-0) |
| number_input | Yes | — | Duration in seconds for diagnostic analysis |

### View GPU Topology

**Navigation**: Console > GPU Performance & Diagnostics > GPU Resource Topology

**Prerequisites**:
- ACK GPU cluster
- NVIDIA GPUs
- NCCL v2.22.23
- SysOM v3.9.1 or higher

1. In SysOM console, go to **** 
 - Element: **** (link) — left navigation panel

2. Click **** 
 - Element: **** (button) — in main content area

3. Enable GPU topology feature  
 - Element: **GPU** (checkbox) — in configuration parameters
   - Notes: Enabling increases Sysom Agent memory limit by 100MB (e.g., 350MB → 450MB)

4. Confirm configuration  
 - Element: **** (button) — bottom of dialog
   - Notes: Creates a new configuration template

5. Navigate to managed resources  
 - Element: **** (link) — left panel, under "Resource Management"

6. Initiate component update for target cluster  
 - Element: **** (button) — in action column of cluster list

7. Select the new configuration template  
 - Element: **** (dropdown) — in configuration change panel
 - Notes: Choose the template created in step 4 (e.g., "GPU")

8. Submit the change  
 - Element: **** (button) — bottom of panel
   - Notes: Wait for task completion; GPU topology becomes active after success

| Parameter | Type | Required | Options/Values | Description |
|-----------|------|----------|----------------|-------------|
| text_input | Yes | — | Descriptive name (e.g., **GPU**) |
| GPU | checkbox | No | — | Must be checked to enable the feature |

## FAQ

Q: Where can I find the GPU diagnostics feature in the console?
A: Navigate to **Console > ECS > GPU Performance & Diagnostics > GPU Diagnostics**. Ensure your instance has SysOM installed and is in a supported region.

Q: Can I change the instance type or disk size after creating an ECS instance for AI workloads?
A: Disk size can often be expanded, but instance type changes may require stopping the instance or recreating it. For GPU/CPU-specific optimizations, it's best to select the correct type during initial creation.

Q: What happens if I don’t assign a public IP when deploying large models like Qwen-7B-Chat?
A: Without a public IP, you cannot download model files from public repositories. You’ll need alternative methods (e.g., internal mirrors or pre-loaded images), which complicates deployment.

Q: Do I need to install NVIDIA drivers manually for GPU-accelerated deployments?
A: Yes, for AC2 container deployments on GPU instances, you must install NVIDIA drivers and the NVIDIA Container Toolkit before running containers. The console only provisions the instance.

Q: Is the GPU resource topology feature available for single-instance deployments?
A: No, GPU resource topology is designed for **multi-node ACK GPU clusters** to analyze inter-GPU communication in distributed training scenarios.

## Pricing & Billing

### Billing Model
All ECS instances are billed **per instance hour**, with separate charges for compute, data disk storage, and outbound network traffic.

### Price Reference

| Instance Type | Price |
|---------------|-------|
| ecs.g8a.4xlarge | 0.12 / |
| ecs.g8i.4xlarge | 0.89 / |
| ecs.gn6i-c4g1.xlarge | Specific price not listed; refer to official Alibaba Cloud pricing page |
| GPU-optimized instance (general) | 0.85 / |

### Free Tier
- **AI-enhanced Alibaba Cloud Linux 3 images**: Free to use
- **GPU diagnostics and topology features**: Free system-level tools with no additional charge

### Billing Notes
- GPU instances incur higher hourly costs; release instances when not in use
- Public network traffic (outbound) is billed separately based on actual usage
- Data disk storage fees apply independently of instance runtime
- Model downloads do not incur extra fees beyond standard network and storage costs