# airec-deployment

Part of **AIREC**

<!-- intent-backlink:auto -->

> 💡 **Path Selection**: This skill is one implementation path for [Troubleshoot AIRec deployment failures](../../intent/airec-troubleshoot-failure/SKILL.md). If you're unsure which path to take, check the routing skill first.

# AIRec System Deployment Troubleshooting Guide

## Problem Index

| Problem | Symptom | Severity | Solution Summary |
|--------|--------|----------|------------------|
| Deployment Fails Due to Missing Permissions | Error: `AccessDeniedException` or `UnauthorizedOperation` during resource creation | High | Assign required IAM roles to the deployment identity |
| Instance Startup Timeout | Error: `Deployment timed out after 30 minutes` | Medium | Increase deployment timeout setting in configuration |
| Invalid Configuration Parameters | Error: `ValidationError: Invalid value for parameter X` | High | Correct malformed or unsupported parameter values per schema |
| Service Health Check Fails Post-Deployment | Error: `Health check endpoint returns 503` | High | Verify dependencies (e.g., database, model store) are reachable and healthy |

## Problem Details

### Problem 1: Deployment Fails Due to Missing Permissions

**Symptoms**
- Error message: `AccessDeniedException: User is not authorized to perform: airec:CreateInstance`
- Behavior: Deployment process halts during infrastructure provisioning phase
- Context: Occurs when using a custom role or user without sufficient permissions

**Root Cause**
The identity (user, role, or service principal) executing the deployment lacks required permissions to create or manage AIRec resources, such as instances, networks, or storage buckets.

**Solution**
1. Attach the `AliyunAIRecFullAccess` managed policy to the deployment identity:
   ```bash
   aliyun ram AttachPolicyToUser --PolicyType System --PolicyName AliyunAIRecFullAccess --UserName <your-user>
   ```
2. If using a custom role, ensure it includes at minimum these actions:
   ```json
   {
     "Version": "1",
     "Statement": [
       {
         "Action": [
           "airec:*",
           "vpc:DescribeVpcs",
           "vpc:DescribeVSwitches",
           "oss:GetBucketLocation"
         ],
         "Resource": "*",
         "Effect": "Allow"
       }
     ]
   }
   ```
3. For deployments via Terraform or SDKs, verify the credential profile has active session tokens if using temporary credentials.

**Verification**
- Re-run the deployment command
- Expected behavior: Deployment proceeds past the resource creation stage without `AccessDeniedException`

### Problem 2: Instance Startup Timeout

**Symptoms**
- Error message: `Deployment timed out after 30 minutes`
- Behavior: Deployment status remains `Pending` indefinitely, then fails
- Context: Common in regions with high resource demand or when initializing large models

**Root Cause**
The default deployment timeout (30 minutes) is insufficient for the instance to complete initialization, especially when loading large recommendation models or connecting to remote data sources.

**Solution**
1. Locate your deployment configuration file (e.g., `deployment.yaml`)
2. Increase the `timeout` parameter under the `instance` section:
   ```yaml
   instance:
     type: standard
     timeout: 3600  # Set to 3600 seconds (60 minutes)
   ```
3. If using API-based deployment, include the timeout in the request body:
   ```json
   {
     "InstanceConfig": {
       "TimeoutInSeconds": 3600
     }
   }
   ```

**Verification**
- Monitor deployment logs via:
  ```bash
  aliyun airec DescribeInstanceLogs --InstanceId <id> --LogType Deployment
  ```
- Expected output: Log shows `Instance initialized successfully` within the new timeout window

### Problem 3: Invalid Configuration Parameters

**Symptoms**
- Error message: `ValidationError: Invalid value 'xyz' for parameter InstanceType. Valid values: [standard, high-memory]`
- Behavior: Deployment fails immediately during validation phase
- Context: Occurs when copying examples or migrating from other environments

**Root Cause**
The provided configuration contains parameters that violate the AIRec schema—either unsupported values, incorrect types, or deprecated fields.

**Solution**
1. Validate your configuration against the official schema:
   ```bash
   aliyun airec ValidateDeploymentConfig --ConfigFile ./deployment.yaml
   ```
2. Correct invalid fields using only allowed values from documentation:
   - `InstanceType`: must be `standard` or `high-memory`
   - `RegionId`: must match an [AIRec-supported region](https://help.aliyun.com/product/airec.html)
   - `ModelFormat`: must be `ONNX` or `PMML`
3. Remove any undocumented or custom top-level keys from the config file

**Verification**
- Run validation command again
- Expected output: `{"Valid": true, "Message": "Configuration is valid"}`

### Problem 4: Service Health Check Fails Post-Deployment

**Symptoms**
- Error message: `Health check endpoint /health returns HTTP 503`
- Behavior: Instance appears running but rejects inference requests
- Context: After successful deployment, during integration testing

**Root Cause**
One or more critical dependencies (e.g., feature store, model registry, or metadata database) are unreachable due to network misconfiguration, authentication failure, or service downtime.

**Solution**
1. Check connectivity to dependent services:
   ```bash
   # Test OSS bucket access (example)
   aliyun oss ls oss://<your-model-bucket> --region <region>
   ```
2. Verify VPC and security group settings allow outbound traffic to dependency endpoints
3. Confirm credentials for external stores (e.g., AccessKey for OSS) are correctly configured in the instance’s environment variables or secret manager
4. Restart the instance to reload configurations:
   ```bash
   aliyun airec RestartInstance --InstanceId <id>
   ```

**Verification**
- Query the health endpoint directly:
  ```bash
  curl -f http://<instance-endpoint>/health
  ```
- Expected response: HTTP 200 with body `{"status": "healthy"}`

## FAQ

**Q: How do I check if the AIRec deployment is progressing normally?**  
A: Use `aliyun airec DescribeInstance --InstanceId <id>` and monitor the `Status` field. Valid progression: `Creating` → `Starting` → `Running`. Also inspect logs with `DescribeInstanceLogs`.

**Q: What permissions are absolutely required for basic AIRec deployment?**  
A: At minimum, the deploying identity needs `airec:CreateInstance`, `airec:DescribeInstances`, `vpc:DescribeVpcs`, `vpc:DescribeVSwitches`, and `oss:GetBucketLocation`. The managed policy `AliyunAIRecFullAccess` covers all required actions.

**Q: Where can I find detailed error logs for a failed deployment?**  
A: Retrieve deployment logs using the CLI: `aliyun airec DescribeInstanceLogs --InstanceId <id> --LogType Deployment`. Logs include timestamped events and error stack traces.

**Q: Can I roll back a failed deployment automatically?**  
A: AIRec does not support automatic rollback. However, you can delete the partially created instance (`aliyun airec DeleteInstance`) and redeploy with corrected configuration. Ensure cleanup of associated resources (e.g., VPC endpoints) if needed.

**Q: Why does my deployment succeed but the instance never becomes ready?**  
A: This usually indicates a runtime dependency issue—such as an inaccessible model file in OSS or invalid database credentials. Check the `Runtime` log type (`--LogType Runtime`) for connection errors or model loading failures.