# ess-instance

Part of **ESS**

# Auto Scaling Instance Management Troubleshooting Guide

## Problem Index

| Problem | Symptom | Severity | Solution Summary |
|---------|---------|----------|------------------|
| Alarm task shows "insufficient data" | Alarm task state is `INSUFFICIENT_DATA` | Medium | Install or restart Cloud Monitor agent on ECS instances |
| Instances removed due to overdue payment | Pay-as-you-go or preemptible instances stopped and released automatically | High | Recharge account balance and verify instance health |
| Scaling activity fails with parameter conflict | Error: `The value of parameter X and parameter Y are conflict` | Medium | Remove conflicting parameters (e.g., LaunchTemplateId vs ScalingConfigurationId) |
| Scaling activity rejected due to running InstanceRefresh | Error: `This operation cannot be performed because a InstanceRefresh task is running` | Medium | Wait for InstanceRefresh to complete before triggering new scaling activities |
| Health check fails with 404 error | Error: `Not Found: The specified auto scaling group or instance does not exist` | High | Verify scaling group ID and instance ID are correct and active |

## Problem Details

### Problem 1: Alarm task shows "insufficient data"

**Symptoms**
- Error message: `INSUFFICIENT_DATA`
- Behavior: Alarm task does not trigger scaling actions despite metric thresholds being crossed
- Context: Occurs when Cloud Monitor cannot collect metrics from ECS instances in the scaling group

**Root Cause**
- The Cloud Monitor agent (Argus Agent) is not installed or not running properly on ECS instances
- Without the agent, system metrics like CPU utilization cannot be reported to Cloud Monitor, causing alarm tasks to enter the "insufficient data" state

**Solution**
1. Log on to the Auto Scaling console at https://ecs.console.aliyun.com
2. Navigate to **Auto Scaling > Scaling Groups**
3. Find your scaling group and click its ID to open the details page
4. Click the **Instances** tab and note an ECS instance ID
5. Log on to the Cloud Monitor console at https://cloudmonitornext.console.aliyun.com/
6. In the left-side navigation pane, click **Host Monitoring**
7. Enter the ECS instance ID in the search box
8. Check the **Argus Agent Status** column:
   - If the agent is not installed, click the installation icon to install it automatically
   - If the agent is installed but shows abnormal status, follow the guide to restart it

```bash
# To manually restart the Cloud Monitor agent on Linux:
sudo /usr/local/cloudmonitor/CmsGoAgent.linux-amd64 restart
```

**Verification**
- Return to the Cloud Monitor Host Monitoring page and confirm the agent status shows as **Active**
- Wait 2-3 minutes for metrics to appear
- Verify in the Auto Scaling console that the alarm task state changes from `INSUFFICIENT_DATA` to `ALARM` or `OK`

### Problem 2: Instances removed due to overdue payment

**Symptoms**
- Error message: `OverduePayment`
- Behavior: Pay-as-you-go or preemptible ECS instances are stopped and automatically removed from the scaling group
- Context: Occurs when account balance is insufficient to cover ongoing instance charges

**Root Cause**
- When an account has overdue payments, the system stops pay-as-you-go and preemptible instances
- Auto Scaling detects these stopped instances as unhealthy during health checks
- Unhealthy instances are automatically removed and released according to scaling group policies

**Solution**
1. Recharge your account to clear the overdue balance
2. Verify that your account has sufficient funds for ongoing instance usage
3. Check the scaling group's health check settings to understand removal behavior
4. Consider using subscription instances for critical workloads that shouldn't be interrupted

**Verification**
- After recharging, check the **Billing Management** console to confirm account status is normal
- Monitor the scaling group's **Instances** tab to see if new instances are launched during the next scale-out event
- Review instance lifecycle events in the **Scaling Activities** tab to confirm no further removals due to health check failures

### Problem 3: Scaling activity fails with parameter conflict

**Symptoms**
- Error message: `The value of parameter LaunchTemplateId and parameter ScalingConfigurationId are conflict`
- Error message: `The value of parameter ImageId and parameter LaunchTemplateId are conflict`
- Behavior: API request returns HTTP 400 error when creating scaling configurations or triggering scaling activities
- Context: Occurs when mutually exclusive parameters are specified together in API requests

**Root Cause**
- Auto Scaling does not allow certain parameter combinations that would create ambiguous instance configurations
- Common conflicts include:
  - Specifying both `LaunchTemplateId` and `ScalingConfigurationId`
  - Specifying both `ImageId` and `LaunchTemplateId` (since launch templates already contain image information)
  - Specifying both `ScalingConfigurationId` and `InstanceTypeOverrides`

**Solution**
1. Choose one configuration source method and remove conflicting parameters:
   - Use **launch templates** OR **scaling configurations**, not both
   - If using launch templates, do not specify `ImageId`, `InstanceType`, or other parameters already defined in the template
2. Review your API request payload or console configuration to ensure only one configuration source is used

```json
// Correct: Using only LaunchTemplateId
{
  "LaunchTemplateId": "lt-xxxxxxxxx",
  "MaxSize": 10,
  "MinSize": 2
}

// Incorrect: Mixing LaunchTemplateId and ScalingConfigurationId
{
  "LaunchTemplateId": "lt-xxxxxxxxx",
  "ScalingConfigurationId": "asc-xxxxxxxxx"
}
```

**Verification**
- Resubmit the API request with corrected parameters
- Expect HTTP 200 response with successful operation confirmation
- Check the scaling group's configuration history to confirm the new settings were applied

### Problem 4: Scaling activity rejected due to running InstanceRefresh

**Symptoms**
- Error message: `This operation cannot be performed because a InstanceRefresh task is running`
- Behavior: Scale-out or scale-in requests are rejected with HTTP 400 error
- Context: Occurs when attempting to trigger scaling activities while an InstanceRefresh operation is in progress

**Root Cause**
- InstanceRefresh is a process that replaces existing instances with new ones based on updated configuration
- During InstanceRefresh, the scaling group is locked to prevent conflicting operations that could interfere with the refresh process

**Solution**
1. Check the current status of InstanceRefresh tasks:
   - In the Auto Scaling console, go to the scaling group details page
   - Look for **Instance Refresh** in the navigation tabs
2. Wait for the InstanceRefresh task to complete (status becomes **Successful** or **Failed**)
3. Only then trigger new scaling activities or modify scaling group parameters

```bash
# Check InstanceRefresh status via CLI (example)
aliyun ess DescribeInstanceRefreshes --ScalingGroupId asg-xxxxxxxxx
```

**Verification**
- Confirm InstanceRefresh status is no longer **InProgress**
- Retry the original scaling activity request
- Monitor the **Scaling Activities** tab to see the new activity proceed normally

### Problem 5: Health check fails with 404 error

**Symptoms**
- Error message: `404 Not Found: The specified auto scaling group or instance does not exist`
- Behavior: Health diagnosis API calls fail, instances appear missing from scaling group
- Context: Occurs when referencing invalid or deleted scaling group/instance IDs

**Root Cause**
- The specified scaling group ID or instance ID does not exist in the current region
- Possible causes:
  - Typo in the ID
  - Scaling group was deleted
  - Request sent to wrong region
  - Instance was already removed from the scaling group

**Solution**
1. Verify the scaling group ID exists:
   - Check the Auto Scaling console for the correct ID
   - Ensure you're operating in the correct region
2. If checking instance health, confirm the instance is still part of the scaling group:
   - View the **Instances** tab in the scaling group details
   - Note that instances removed due to health checks or scale-in events will no longer be associated
3. Update your API calls or scripts with correct, current IDs

```bash
# List scaling groups to verify existence
aliyun ess DescribeScalingGroups --RegionId cn-hangzhou

# List instances in a specific scaling group
aliyun ess DescribeScalingInstances --ScalingGroupId asg-xxxxxxxxx
```

**Verification**
- API calls with verified IDs return HTTP 200 with expected data
- Console shows the scaling group and instances as active
- Health check operations complete successfully

## FAQ

**Q: How do I check if my Auto Scaling service is working properly?**
A: Verify by checking three areas: (1) Scaling group status should be **Active**, (2) Instances tab should show healthy instances, and (3) Scaling Activities tab should show recent successful activities. Also ensure Cloud Monitor agents are running on instances for alarm-based scaling.

**Q: What permissions are required to manage Auto Scaling instances?**
A: You need RAM permissions for Auto Scaling actions like `ess:DescribeScalingInstances`, `ess:EnterStandby`, `ess:ExitStandby`, and `ess:RemoveInstances`. For full instance management, include ECS permissions like `ecs:DescribeInstances` and `ecs:StopInstance`. Custom policies can restrict access by resource tags or regions.

**Q: Why are my instances not being added to the SLB after scale-out?**
A: Verify that: (1) The scaling group is properly associated with an SLB instance, (2) The SLB backend server group type is "Instance" (not IP), (3) Instances pass health checks, and (4) There are no network connectivity issues between instances and SLB.

**Q: How can I prevent accidental instance removal during scale-in?**
A: Use instance protection by calling `ess:SetInstancesProtection` to protect specific instances, or configure removal policies that prioritize older or less critical instances. You can also use lifecycle hooks to implement custom validation before instance termination.

**Q: What causes timeout errors during scaling activities?**
A: Common causes include: (1) Insufficient ECS instance quota in the region, (2) Launch template referencing unavailable resources (like deleted images), (3) Security group rules blocking necessary traffic, or (4) vSwitch IP address exhaustion. Check scaling activity details for specific error messages.