# es-troubleshooting

Part of **ES**

# Elasticsearch Monitoring and Troubleshooting Troubleshooting Guide

## Problem Index

| Problem | Symptom | Severity | Solution Summary |
|--------|--------|---------|------------------|
| Instance or cluster initialization in progress | `The instance is being created.` or `Cluster is being initialized.` | Medium | Wait for the current operation to complete before proceeding |
| Resource limit exceeded | `Maximum number of indexes has been exceeded.` or `Data source limit exceeded.` | High | Delete unused resources or upgrade your plan/quota |
| Invalid request parameters | `Invalid request parameter.` or `Invalid data source type.` | Medium | Validate all input against documentation and correct syntax |
| Connection pool timeout in Java SDK | `ConnectionPoolTimeoutException` | High | Increase max connections using `HttpClientManager.setMaxConnections()` |
| Model training or prediction failure | `The model failed to be trained.` or `Model prediction failed.` | High | Verify data quality, model state, and input parameters |

## Problem Details

### Problem 1: Instance or Cluster Initialization In Progress

**Symptoms**
- Error message: `The instance is being created. Wait for completion before proceeding.`
- Error message: `Cluster is being initialized. Wait for initialization to complete.`
- Behavior: API requests fail with 400 error during creation or setup phase
- Context: Occurs immediately after initiating instance or cluster creation

**Root Cause**
- Elasticsearch requires time to provision infrastructure and initialize services
- Concurrent operations are blocked until the initial setup completes successfully

**Solution**
1. Wait for the current initialization process to finish
2. Monitor status through the console or by polling the instance/cluster status API
3. Avoid submitting additional modification requests during this period

**Verification**
- Query the instance or cluster status endpoint
- Expected response: `"status": "active"` or equivalent ready state
- Subsequent API calls should succeed without 400 initialization errors

### Problem 2: Resource Limit Exceeded

**Symptoms**
- Error message: `Maximum number of indexes has been exceeded. Delete unused indexes or upgrade plan.`
- Error message: `Data source limit exceeded. Remove unused sources or upgrade quota.`
- Error message: `The sum of start and hit exceeds 5000.`
- Behavior: New resource creation fails despite valid configuration

**Root Cause**
- Account or instance has reached predefined quotas for indexes, data sources, or query depth
- These limits prevent resource exhaustion and ensure system stability

**Solution**
1. For index/data source limits:
   ```bash
   # List existing indexes
   curl -X GET "localhost:9200/_cat/indices?v"
   # Delete unused indexes
   curl -X DELETE "localhost:9200/unused_index_name"
   ```
2. For query pagination limits (start + hit > 5000):
   - Use scroll search instead of deep pagination
   ```json
   {
     "query": { "match_all": {} },
     "size": 1000
   }
   ```
3. If cleanup isn't sufficient, upgrade your service plan to increase quotas

**Verification**
- After cleanup, attempt to create the new resource again
- For scroll search, confirm results return without error code 6013
- Check quota usage in the console to ensure you're below limits

### Problem 3: Invalid Request Parameters

**Symptoms**
- Error message: `Invalid request parameter. Review request body for correct syntax and values.`
- Error message: `Invalid data source type. Verify the data source type matches supported formats.`
- Error message: `Failed to parse JSON. The JSON field may contain unescaped double quotation marks...`
- Behavior: API returns 400 error with validation failure

**Root Cause**
- Request contains malformed JSON, invalid field types, unsupported values, or incorrect structure
- Common causes include unescaped quotes, wrong data types, or missing required fields

**Solution**
1. Validate JSON syntax using a linter or online validator
2. Escape special characters properly:
   ```json
   {
     "description": "This field contains \"quoted\" text"
   }
   ```
3. Verify all parameter values against API documentation
4. For data sources, confirm type matches supported formats (e.g., MySQL, Kafka)
5. Check naming rules—avoid reserved words and invalid characters

**Verification**
- Resubmit the corrected request
- Expected result: 200 OK or successful operation response
- No validation error messages in response body

### Problem 4: Connection Pool Timeout in Java SDK

**Symptoms**
- Error message: `ConnectionPoolTimeoutException`
- Behavior: Java application fails to establish new connections under load
- Context: Occurs when making many concurrent requests using OpenSearch SDK for Java

**Root Cause**
- Default connection pool size (50 connections) is insufficient for high-concurrency workloads
- New connection requests time out waiting for available slots in the pool

**Solution**
1. Increase the maximum connection pool size in your Java application:
   ```java
   import com.aliyun.opensearch.util.HttpClientManager;
   
   // Set max connections to 100 (adjust based on needs)
   HttpClientManager.setMaxConnections(100);
   ```

**Verification**
- Run load test with same concurrency level as before
- Confirm no `ConnectionPoolTimeoutException` occurs
- Monitor connection metrics to ensure pool size is adequate

### Problem 5: Model Training or Prediction Failure

**Symptoms**
- Error message: `The model failed to be trained. The next operation cannot be performed.`
- Error message: `The data required for model training is not ready.`
- Error message: `The number of page views (PVs) on the last day is insufficient.`
- Behavior: Model operations block subsequent actions

**Root Cause**
- Insufficient or poor-quality training data
- Model is in an intermediate state (training/predicting) and can't accept new commands
- Input parameters don't meet validation requirements

**Solution**
1. If model is still training:
   - Wait for training to complete before performing next operation
2. If training failed:
   - Verify data quality and volume:
     - Ensure minimum days of historical data
     - Confirm sufficient PVs/IPVs on recent days
     - Clean behavioral data to reduce invalid types
3. Validate all model parameters:
   - Correct model ID format
   - Valid app group existence and status
   - Proper cron expression syntax

**Verification**
- Check model status via API or console
- Expected state: `trained` or `ready` for prediction
- Retry the blocked operation after conditions are met

## FAQ

**Q: How do I check if my Elasticsearch instance is healthy?**  
A: Query the cluster health endpoint: `GET /_cluster/health`. A status of "green" indicates all primary and replica shards are allocated. "Yellow" means primary shards are allocated but replicas are not. "Red" indicates unallocated primary shards.

**Q: What causes throttling errors and how do I handle them?**  
A: Throttling occurs when you exceed request rate limits (`Throttling.User` or `Algorithm.Throttling`). Implement exponential backoff with jitter in your retry logic. Start with a base delay (e.g., 100ms) and double it after each failure, up to a maximum (e.g., 5s).

**Q: How do I debug authentication failures?**  
A: Verify your AccessKey pair is enabled and correct. Confirm you're using the right API endpoint (found in the console under Instance Management > Application Details). Ensure your signature includes all required parameters sorted correctly and uses proper encoding.

**Q: What should I do when I get an InternalError?**  
A: For 4xx InternalError responses, review your request for subtle validation issues. For 5xx errors, wait briefly and retry—the issue is likely transient. If errors persist beyond 5 minutes, submit a support ticket with request IDs and timestamps.

**Q: How can I avoid deep pagination errors (code 6013)?**  
A: Never set `start + hit > 5000`. For large result sets, use the scroll API for initial data extraction or the search_after parameter for real-time pagination. Scroll is ideal for batch processing; search_after works better for user navigation.