# airec-instance

Part of **AIREC**

# AIRec Instance Management Troubleshooting Guide

## Problem Index

| Problem | Symptom | Severity | Solution Summary |
|--------|---------|----------|------------------|
| Resource Conflict During Deployment | Error: `NOT YET FOUND` or `PrepareResource failed, new resource Invalid check failed` | High | Check for duplicate service deployments and remove conflicting instances |
| Service Fails to Reach Desired State | Error: `ServiceNotInDesiredState` or rolling task stuck | High | Inspect operation logs, verify server role status, and terminate outdated rolling tasks |
| Installation Progress Stuck | Installation halts at specific percentage (e.g., 30%, 70%) | Medium | Diagnose PXE boot, diskless system, or network issues using dmesg and DNS checks |
| Password-Based Logon Denied for Linux Instances | Error: `OperationDenied` when creating ECS instance with password | Low | Use SSH key pairs instead; ensure RAM policy enforces `ecs:PasswordCustomized` condition |

## Problem Details

### Problem 1: Resource Conflict During Deployment

**Symptoms**
- Error message: `NOT YET FOUND`
- Error message: `PrepareResource failed, new resource Invalid check failed`
- Behavior: Deployment fails during resource allocation phase
- Context: Occurs when deploying a new service or updating an existing one in the Apsara Infrastructure Management Framework

**Root Cause**
- The required resources (e.g., ports, IPs, machine slots) are already claimed by another identical or conflicting service instance.
- Common when redeploying without cleaning up previous deployments or when multiple teams deploy overlapping configurations.

**Solution**
1. Log on to the Apsara Infrastructure Management Framework console.
2. Navigate to **Operations > Cluster Operations**.
3. Identify any existing service instances with the same name or configuration.
4. If a duplicate exists, delete or undeploy it:
   ```bash
   # Example CLI command (if available in your environment)
   airec-cli undeploy --service-name <conflicting-service>
   ```
5. Retry the original deployment after confirming no conflicts remain.

**Verification**
- Re-run the deployment command or UI action.
- Confirm the deployment progresses beyond the resource allocation stage.
- Check the **Cluster Dashboard** for successful resource binding.

### Problem 2: Service Fails to Reach Desired State

**Symptoms**
- Error message: `ServiceNotInDesiredState`
- Error message: `RollingTaskFailed`
- Behavior: Service remains in `UPDATING` or `FAILED` state indefinitely
- Context: After initiating a version update or cluster upgrade

**Root Cause**
- A rolling task is stuck or outdated, blocking state convergence.
- Dependent server roles are not in `GOOD` state (e.g., due to `fail to update target version`).
- Post-check scripts fail (`PostCheckDetail:ExitedCode is 1`).

**Solution**
1. Log on to the Apsara Infrastructure Management Framework console.
2. Go to **Operations > Cluster Operations**.
3. Click the cluster name to open the **Cluster Dashboard**.
4. From the **Operation Menu**, select **Operation Logs** to inspect recent failures.
5. Check **Service Instances** section for non-healthy services.
6. Click **Details** for the affected service, then review the **Server Role List**.
7. For any machine not in desired state, click **Details** to view the **Server Role Dashboard**.
8. Verify:
   - Target version matches current version
   - No error messages in logs
   - Service status is active
9. If a rolling task is stuck:
   - Terminate it via the console
   - Retry the deployment or update operation

**Verification**
- On the **Server Role Dashboard**, confirm:
  - "Current Version" = "Target Version"
  - Status shows "GOOD" or "RUNNING"
  - No error banners or alerts

### Problem 3: Installation Progress Stuck

**Symptoms**
- Installation halts at a fixed percentage (e.g., 30%, 50%, 70%)
- Behavior: No progress for >10 minutes during initial system setup
- Context: During first-time deployment of diskless or PXE-booted instances

**Root Cause**
- Network connectivity issues preventing PXE boot or image download
- Hard drive failure or misconfiguration in target machine
- DNS resolution failure for internal deployment services
- Kernel panic or hardware incompatibility logged in early boot

**Solution**
1. Access the physical or virtual machine console.
2. Run the following to inspect early boot logs:
   ```bash
   dmesg | grep -i error
   ```
3. Verify network connectivity:
   ```bash
   ping <deployment-server-ip>
   nslookup ais-deploy.internal
   ```
4. Ensure the machine is configured for PXE boot in BIOS/UEFI.
5. Confirm the hard drive is detected and healthy (replace if faulty).
6. If using DHCP, validate IP assignment:
   ```bash
   ip addr show
   ```
7. Retry installation after resolving underlying hardware or network issue.

**Verification**
- Installation resumes and completes successfully
- Machine reboots into the deployed OS
- Service appears as healthy in the Cluster Dashboard

### Problem 4: Password-Based Logon Denied for Linux Instances

**Symptoms**
- Error message: `OperationDenied`
- Behavior: RAM user cannot create ECS instance with custom password
- Context: Attempting to launch Linux instance via console or API with password authentication

**Root Cause**
- A custom RAM policy explicitly denies actions when `ecs:PasswordCustomized` is true.
- Security best practice enforced to require SSH key pairs instead of passwords.

**Solution**
1. When creating an instance, use **SSH key pair** authentication instead of password:
   - In the ECS console, under **Logon Credentials**, select **Key Pair**
2. If you must reset credentials, use **Session Manager** for secure access without passwords.
3. To modify policy (admin only):
   - Go to **RAM Console > Policies**
   - Edit the `ecs-password-control` policy to adjust conditions (not recommended for production)

Example policy snippet that causes denial:
```json
{
  "Effect": "Deny",
  "Action": "ecs:RunInstances",
  "Resource": "*",
  "Condition": {
    "StringEquals": {
      "ecs:PasswordCustomized": "true"
    }
  }
}
```

**Verification**
- Instance creation succeeds when using SSH key pair
- Attempting password-based creation returns `OperationDenied` (expected behavior under policy)

## FAQ

**Q: How do I check if a service has reached its desired state?**  
A: Navigate to the **Cluster Dashboard** > **Service Instances** > click **Details** for the service > review the **Server Role List**. Each machine should show matching "Current Version" and "Target Version" with no error messages.

**Q: What permissions are required to manage AIRec instances?**  
A: Users need appropriate RAM policies granting `ecs:RunInstances`, `airec:DeployService`, and `airec:ViewCluster` actions. Custom policies may restrict password usage via the `ecs:PasswordCustomized` condition.

**Q: Where can I find detailed error logs for a failed deployment?**  
A: In the Apsara Infrastructure Management Framework console, go to **Operations > Cluster Operations** > select your cluster > **Operation Menu** > **Operation Logs**. Additionally, check post-check logs on the affected machine for `ExitedCode is 1` errors.

**Q: Why does my cluster upgrade fail with "upgrade result failed"?**  
A: This typically indicates a resource application error during the upgrade. Check for resource conflicts, ensure all dependent server roles are in `GOOD` state, and verify no duplicate services are deployed.

**Q: How do I recover from a stuck rolling update?**  
A: Terminate the outdated rolling task from the **Operation Logs** page in the console, then retry the deployment. Ensure no manual changes conflict with the target configuration before restarting.