# es-text

Part of **ES**

# Elasticsearch Text Processing

## Capabilities Overview

| Sub-capability | Calling Mode | Description |
|----------------|--------------|-------------|
| Create Custom Analyzer | Synchronous | Create custom text analyzers for specialized text processing needs. |
| Define Custom Analyzer Intervention | Synchronous | Define intervention rules for custom analyzer behavior. |
| Manage User Dictionary | Synchronous | Create and manage user-defined dictionaries for text analysis. |
| Manage Custom Analyzers | Synchronous | Create, list, describe, and remove user-defined text analyzers. |
| Manage Intervention Dictionaries | Synchronous | Create, list, describe, push entries to, and remove intervention dictionaries for query understanding. |
| Parse Document | Async Task | Extract structured content from uploaded documents asynchronously. |
| Split Document | Synchronous | Divide documents into logical chunks for downstream processing. |

## API Calling Patterns

### Authentication
The primary authentication method is Bearer Token authentication.

- Use the header: `Authorization: Bearer <your_api_key>`
- Store your API key in the environment variable: `DASHSCOPE_API_KEY`
- While other authentication methods may exist, Bearer Token is the recommended approach for all text processing APIs.

### Service Endpoint (Endpoint)
The APIs use region-specific endpoints with the pattern: `https://opensearch.{region}.aliyuncs.com`

Common regions include:
- `cn-hangzhou` (China Hangzhou)
- `cn-shanghai` (China Shanghai) 
- `cn-shanghai` (US West 1)

For document parsing and splitting services, the endpoint follows: `{host}/v3/openapi/workspaces/{workspace_name}/document-analyze/{service_id}` or `{host}/v3/openapi/workspaces/{workspace_name}/document-split/{service_id}`

### Async Task Pattern (Document Parsing)
Document parsing uses an asynchronous task pattern:

1. **Submit Task**: POST to the async endpoint with document parameters
2. **Receive Task ID**: The response contains a `task_id` for tracking
3. **Poll Status**: GET the task status endpoint with the `task_id` 
4. **Get Results**: When status is "SUCCESS", retrieve the parsed content
5. **Handle Errors**: Check for error statuses and retry if appropriate

Polling should use exponential backoff (e.g., start with 5-second intervals).

### Synchronous Pattern (All Other Operations)
Most operations (analyzer management, dictionary management, document splitting) use synchronous patterns:

1. **Direct Request**: Send HTTP request with required parameters
2. **Immediate Response**: Receive JSON response with results or error
3. **Error Handling**: Check HTTP status codes and error fields
4. **Pagination**: For list operations, use `pageNumber` and `pageSize` parameters

## Parameter Reference

### Create Custom Analyzer

| Parameter | Type | Required | Default | Constraints | Description |
|-----------|------|----------|---------|-------------|-------------|
| name | String | true | | | The name of the custom analyzer. |
| business | String | true | | one of: chn_standard, chn_scene_name, chn_ecommerce, chn_it_content, en_min, th_standard, th_ecommerce, vn_standard, chn_community_it, chn_ecommerce_general, chn_esports_general, chn_edu_question | The built-in analyzer on which the custom analyzer is based. |
| dicts[] | Object | false | | | The custom dictionaries for analysis. |
| available | Boolean | false | true | | Specifies whether the custom analyzer is available. |

### Define Custom Analyzer Intervention

| Parameter | Type | Required | Default | Constraints | Description |
|-----------|------|----------|---------|-------------|-------------|
| cmd | string | true | | one of: add / delete | The action to perform on the intervention entry. |
| key | string | true | | | The search query to segment. |
| value | string | true | | | The analysis result for the search query. |
| status | string | false | ACTIVE | one of: ACTIVE / PENDING_ACTIVE | The status of the intervention entry. |
| splitEnabled | boolean | false | true | | Specifies whether to further segment the tokens generated after the search query is segmented. |

### Parse Document

| Parameter | Type | Required | Default | Constraints | Description |
|-----------|------|----------|---------|-------------|-------------|
| service_id | String | true | | | Built-in service ID |
| document.url | String | false | | | Publicly downloadable document URL (HTTP or HTTPS). |
| document.content | String | false | | | Document content encoded in Base64 |
| document.file_name | String | false | | | The document file name. Required if both document.url and this field are blank. |
| document.file_type | String | false | | one of: txt, pdf, html, doc, docx, ppt, pptx | The document type. Must be specified explicitly if it cannot be inferred. |
| output.image_storage | String | false | base64 | one of: base64 / url | How images are stored in the output. |
| strategy.enable_semantic | Boolean | false | false | | Specifies whether to enable semantic hierarchy extraction. |

### Split Document

| Parameter | Type | Required | Default | Constraints | Description |
|-----------|------|----------|---------|-------------|-------------|
| document.content | string | true | | max 8 MB request body size | The plain text to split. |
| document.content_encoding | string | false | utf8 | only 'utf8' supported | The character encoding. |
| document.content_type | string | false | text | one of: html / markdown / text | The document format. |
| strategy.type | string | false | default | one of: default | The chunking strategy. |
| strategy.max_chunk_size | int | false | 300 | default 300 tokens | The maximum chunk size in tokens. |
| strategy.compute_type | string | false | token | only 'token' supported | The method for measuring chunk length. |
| strategy.need_sentence | boolean | false | false | default false | Specifies whether to return sentence-level chunks. |

### Manage Intervention Dictionaries

| Parameter | Type | Required | Default | Constraints | Description |
|-----------|------|----------|---------|-------------|-------------|
| name | string | true | | | The name of the intervention dictionary. |
| type | string | true | | Valid values: stopword, synonym, correction, category_prediction, ner, term_weighting | The type of the intervention dictionary. |
| analyzer | string | false | | | The custom analyzer. Required if type is ner. |

## Code Examples

### Create Custom Analyzer - curl - all

```bash
curl -X POST https://your-endpoint/v4/openapi/user-analyzers \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
  "name": "lsh_test_user_analyzer",
  "business": "chn_standard"
}'
```

### List Custom Analyzers - Python - all

```python
import requests

url = "https://opensearch.cn-hangzhou.aliyuncs.com/v4/openapi/user-analyzers"
params = {
    "pageNumber": 1,
    "pageSize": 10
}
headers = {
    "Authorization": "Bearer $DASHSCOPE_API_KEY",
    "Content-Type": "application/json"
}

response = requests.get(url, params=params, headers=headers)
print(response.json())
```

### Parse Document Async - Python - all

```python
from alibabacloud_tea_openapi.models import Config
from alibabacloud_searchplat20240529.client import Client
from alibabacloud_searchplat20240529.models import (
    CreateDocumentAnalyzeTaskRequestDocument,
    CreateDocumentAnalyzeTaskRequestOutput,
    CreateDocumentAnalyzeTaskRequest,
    CreateDocumentAnalyzeTaskResponse,
)

config = Config(
    bearer_token="<your-api-key>",
    endpoint="<your-api-endpoint>",
    protocol="http"
)
client = Client(config=config)

document = CreateDocumentAnalyzeTaskRequestDocument(
    url="https://example.com/sample.pdf",
    file_type="pdf"
)
output = CreateDocumentAnalyzeTaskRequestOutput(
    image_storage="url"
)
request = CreateDocumentAnalyzeTaskRequest(document=document, output=output)

response: CreateDocumentAnalyzeTaskResponse = client.create_document_analyze_task(
    "default", "ops-document-analyze-001", request
)
task_id = response.body.result.task_id
print("Task ID:", task_id)
```

### Poll Document Parsing Status - Python - all

```python
import time
from alibabacloud_searchplat20240529.models import (
    GetDocumentAnalyzeTaskStatusRequest,
    GetDocumentAnalyzeTaskStatusResponse,
)

request = GetDocumentAnalyzeTaskStatusRequest(task_id=task_id)

while True:
    response: GetDocumentAnalyzeTaskStatusResponse = client.get_document_analyze_task_status(
        "default", "ops-document-analyze-001", request
    )
    status = response.body.result.status
    print("Status: ", status)

    if status == "PENDING":
        time.sleep(5)
    elif status == "SUCCESS":
        data = response.body.result.data
        usage = response.body.usage
        print("Parsed content (first 1000 chars):\n", data.content[:1000])
        print("Page count: ", data.page_num)
        print("Usage: ", usage)
        break
    else:
        print("Task ended with status: ", response.body.result)
        break
```

### Split Document - bash - all

```bash
curl -XPOST -H "Content-Type: application/json" \
  "http://***-hangzhou.opensearch.aliyuncs.com/v3/openapi/workspaces/default/document-split/ops-document-split-001" \
  -H "Authorization: Bearer Your-API-Key" \
  -d '{
    "document": {
      "content": "Product benefits\nIndustry algorithm edition\nIntelligent\nBuilt-in rich customizable algorithm models...",
      "content_encoding": "utf8",
      "content_type": "text"
    },
    "strategy": {
      "type": "default",
      "max_chunk_size": 300,
      "compute_type": "token",
      "need_sentence": false
    }
  }'
```

### List Intervention Dictionaries - Python - all

```python
import requests

url = "https://opensearch.cn-hangzhou.aliyuncs.com/v4/openapi/intervention-dictionaries"
params = {
    "pageSize": 10,
    "pageNumber": 1,
    "types": "synonym"
}
headers = {
    "Authorization": "Bearer $DASHSCOPE_API_KEY",
    "Content-Type": "application/json"
}

response = requests.get(url, params=params, headers=headers)
print(response.json())
```

### Push Intervention Dictionary Entries - bash - all

```bash
curl -X POST 'https://opensearch.cn-hangzhou.aliyuncs.com/v4/openapi/intervention-dictionaries/abccc/entries/actions/bulk' \
-H 'Authorization: Bearer $DASHSCOPE_API_KEY' \
-H 'Content-Type: application/json' \
-d '[
  {
    "cmd": "add",
    "relevance": {
      "100": 0
    },
    "word": "hah"
  }
]'
```

### Describe User Analyzer - Python - all

```python
import requests

url = 'https://your-endpoint/v4/openapi/user-analyzers/kevin_test?with=all'
headers = {
    'Authorization': 'Bearer $DASHSCOPE_API_KEY'
}

response = requests.get(url, headers=headers)
print(response.json())
```

## Response Format (Response Format)

```json
{
    "requestId": "1201EB10-FF51-01EF-068A-144393E618B5",
    "result": {
        "name": "lsh_test_user_analyzer",
        "business": "chn_standard",
        "available": true,
        "dicts": {
            "id": 2078,
            "type": "segment",
            "entriesLimit": 4,
            "entriesCount": 0,
            "available": true,
            "created": 1597200767,
            "updated": 1597200767
        },
        "created": 1597200764,
        "updated": 1597200764
    }
}
```

**Key Fields**:
- `requestId` — Unique identifier for the API request
- `result.name` — Name of the created/queried analyzer
- `result.business` — Base analyzer type used
- `result.available` — Availability status of the analyzer
- `result.dicts.id` — Dictionary ID associated with the analyzer
- `result.created` — Creation timestamp in seconds
- `result.updated` — Last update timestamp in seconds

## Error Handling (Error Handling)

| Error Code (Code) | Description (Description) | Recommended Action (Recommended Action) |
|-------------------|---------------------------|----------------------------------------|
| AccessDenied | The user does not have sufficient permissions to perform this operation. Check your RAM policies. | Verify your RAM policies and ensure you have the necessary permissions for the OpenSearch service. |
| InvalidParameter | One or more parameters are invalid. Verify that 'name' and 'business' are correctly specified. | Check that all required parameters are provided and conform to the specified constraints. |
| ResourceConflict | A custom analyzer with the same name already exists. Choose a different name. | Use a unique name for your custom analyzer. |
| 404 | The specified analyzer does not exist. | Verify the analyzer name exists and is spelled correctly. |
| 403 | Access denied due to insufficient permissions. | Ensure your API key has the required permissions for the requested operation. |
| 429 | Too Many Requests – Rate limit exceeded. Wait before retrying. | Implement exponential backoff and respect the rate limits. |
| BadRequest.TaskNotExist | Task not found | Verify the task_id is correct and the task hasn't expired. |
| InvalidParameter | Invalid request parameters | Check that document parameters are valid and within size limits. |

### Rate Limits & Retry
- Custom analyzer operations: 100 QPS per account
- Document parsing: 10 QPS for ops-document-analyze-001 (submit a ticket to increase)
- Document splitting: 2 QPS (Alibaba Cloud account and RAM users combined)
- For rate limit errors (429), implement exponential backoff starting with 1-second delays
- Async document parsing tasks should poll with 5-second intervals initially

## Environment Requirements (Requirements)

- SDK package: `alibabacloud_tea_openapi>=0.3.0, alibabacloud_searchplat20240529>=1.0.0`
- Environment variable setup: `export DASHSCOPE_API_KEY=your_api_key_here`
- Python version: Compatible with standard Python installations (no specific version requirements stated)

## FAQ

Q: How do I handle large documents that exceed the 8MB limit?
A: Split large documents into smaller segments before uploading, or use URL-based input instead of Base64 encoding to avoid size limitations in the request body.

Q: What's the difference between custom analyzers and intervention dictionaries?
A: Custom analyzers define the overall text analysis pipeline (tokenization, filtering), while intervention dictionaries provide specific rules for particular text processing tasks like synonyms, stopwords, or named entity recognition within those analyzers.

Q: How long do async document parsing tasks remain available?
A: Task results are typically available for a limited time (usually 24-48 hours). Poll for results promptly after task submission and store the parsed content in your own system for long-term access.

Q: Can I use the same API key across different regions?
A: Yes, the same DASHSCOPE_API_KEY can be used across all regions, but you must use the correct regional endpoint for each request.

Q: What happens if I exceed my free tier quota?
A: Once you exceed the free tier (e.g., 1000 requests/month for analyzer operations), you'll be charged according to the standard pricing rates for additional usage.

## Pricing & Billing

### Billing Model
- Custom analyzer and intervention dictionary operations: Per request
- Document parsing: Per token (based on input tokens processed)
- Document splitting: Per token (based on input tokens processed)

### Price Reference

| Tier/Model | Input Price | Output Price |
|------------|-------------|--------------|
| standard | 0.0001 / | 0.0001 / |
| ops-document-analyze-001 | 0.002 /tokens | 0.002 /tokens |
| ops-document-split-001 | 0.0001 /tokens | 0.0001 /tokens |

### Free Tier
- Custom analyzer operations: 1000 free requests per month
- Document parsing: 100 tokens free per month
- Intervention dictionaries: 1000 free requests per month

### Usage Limits
- Custom analyzer operations: 100 QPS per account
- Document parsing: 10 QPS, maximum 8 MB request body size
- Document splitting: 2 QPS, maximum 8 MB request body size
- User dictionaries: Maximum 4 intervention entries per dictionary

### Billing Notes
- Semantic hierarchy extraction in document parsing is billed separately based on `usage.semantic_token_count`
- If `usage.semantic_token_count` is absent from the response, you are not charged for semantic extraction
- Async tasks are billed only upon successful completion
- Sentence-level chunking in document splitting doubles token usage and cost