# es-text-generation

Part of **ES**

<!-- intent-backlink:auto -->

> 💡 **Path Selection**: This skill is one implementation path for [Deploy a Retrieval-Augmented Generation (RAG) AI application](../../intent/es-deploy-application/SKILL.md). If you're unsure which path to take, check the routing skill first.

# Elasticsearch AI and RAG

## Capabilities Overview

| Sub-capability | Models | Calling Mode | Description |
|----------------|--------|--------------|-------------|
| Generate Text | qwen3-235b-a22b, qwq-32b, ops-qwen-turbo, +9 more | Synchronous / OpenAI Compatible | Produce natural language output from a prompt using large language models. |
| Analyze Query | — | Synchronous | Interpret and structure user queries for downstream processing. |
| Calculate Token Count | ops-qwen-turbo, qwen-turbo, qwen-plus, qwen-max | Synchronous | Determine the number of tokens in input text for billing or truncation. |
| Document Split | ops-document-split-001 | Synchronous | Split long documents into smaller chunks for processing by LLMs. |
| Perform Web Search | ops-web-search-001 | Synchronous | Execute live searches on the public web to retrieve current information. |

## Model Selection Guide

### Generate Text

| Model ID | Calling Mode |
|----------|--------------|
| qwen3-235b-a22b | Synchronous |
| qwq-32b | Synchronous |
| ops-qwen-turbo | Synchronous / OpenAI Compatible |
| qwen-turbo | Synchronous |
| qwen-plus | Synchronous |
| qwen-max | Synchronous |
| deepseek-r1 | Synchronous |
| deepseek-v3 | Synchronous |
| deepseek-r1-distill-qwen-7b | Synchronous |
| deepseek-r1-distill-qwen-14b | Synchronous |
| deepseek-v4-pro | Synchronous |
| deepseek-v4-flash | Synchronous |

## API Calling Patterns

### Authentication
Use **Bearer Token** authentication as the primary method.

- Include the header: `Authorization: Bearer <your_api_key>`
- Store your API key in the environment variable: `DASHSCOPE_API_KEY` or `OS_API_KEY`
- The API key is created in the AI Search Open Platform console under API Key Management.

### Service Endpoint
The base URL pattern is region-specific:

```text
http://{instance_id}.{region}.opensearch.aliyuncs.com
```

Common regions include:
- `cn-hangzhou`
- `cn-shanghai`
- `cn-beijing`

For OpenAI-compatible endpoints, use:
```text
http://{instance_id}.{region}.opensearch.aliyuncs.com/compatible-mode/v1
```

### Synchronous Calls
Used for Generate Text, Analyze Query, Token Count, Document Split, and Web Search.

1. Send a POST request to the appropriate endpoint with JSON body
2. Include `Authorization: Bearer <API_KEY>` header
3. Receive a complete JSON response immediately
4. Parse the result from the response body (e.g., `result.text`, `result.chunks`)

### OpenAI-Compatible Calls
Used primarily with `ops-qwen-turbo` for Generate Text.

1. Set the base URL to `{endpoint}/compatible-mode/v1`
2. Use standard OpenAI SDKs (Python, Java, etc.)
3. Pass `model="ops-qwen-turbo"` in the request
4. The response follows OpenAI’s `chat.completion` format

## Parameter Reference

### Generate Text

| Parameter | Type | Required | Default | Constraints | Description |
|----------|------|----------|---------|-------------|-------------|
| messages | List | Yes | — | — | Conversation history with role/content pairs (system, user, assistant) |
| stream | Boolean | No | false | — | Enable streaming response mode |
| enable_search | Boolean | No | false | Supported only for deepseek-r1 | Enable web search during generation |
| csi_level | String | No | strict | none, loose, strict, rigorous | Green Net filtering level |
| parameters.search_return_result | Boolean | No | false | Only when enable_search=true | Return raw search results |
| parameters.search_top_k | Integer | No | 5 | Only when enable_search=true | Number of search results (deepseek-r1 only) |
| parameters.search_way | String | No | normal | Only when enable_search=true | Search strategy: normal, fast, full |
| parameters.max_tokens | Integer | No | — | max 1500 (qwen-turbo), max 2000 (qwen-max/plus) | Max tokens to generate |
| parameters.temperature | Float | No | — | [0, 2) | Controls randomness of output |
| parameters.top_p | Float | No | — | (0, 1.0) | Nucleus sampling threshold |
| parameters.top_k | Integer | No | — | >100 disables | Candidate token set size |
| parameters.repetition_penalty | Float | No | — | >0 | Reduces repetition |
| parameters.presence_penalty | Float | No | — | [-2.0, 2.0] | Penalizes repeated topics |
| parameters.stop | string/array | No | — | Don’t mix strings and token_ids | Stops generation at specified sequence |

### Analyze Query

| Parameter | Type | Required | Default | Constraints | Description |
|----------|------|----------|---------|-------------|-------------|
| query | string | Yes | — | — | Current user query to analyze |
| history | list<Message> | No | — | — | Prior conversation context |
| functions | list<Function> | No | — | pre, intent, similar_query, nl2sql | Enabled analysis functions |
| functions[].name | string | No | — | — | Function name (e.g., "intent") |
| functions[].parameters.enable | boolean | No | true | — | Whether to enable the function |

### Document Split

| Parameter | Type | Required | Default | Constraints | Description |
|----------|------|----------|---------|-------------|-------------|
| document.content | string | Yes | — | max 8 MB | Text to split |
| document.content_type | string | Yes | text | html, markdown, text | Document format |
| strategy.max_chunk_size | integer | No | 300 | 1–10000 | Max chunk size in characters |
| strategy.need_sentence | boolean | No | false | — | Preserve sentence boundaries |

### Perform Web Search

| Parameter | Type | Required | Default | Constraints | Description |
|----------|------|----------|---------|-------------|-------------|
| query | string | Yes | — | — | Search query |
| query_rewrite | boolean | No | true | — | Use LLM to rewrite query |
| top_k | integer | No | 5 | — | Number of results to return |
| history | list | No | — | Must alternate roles | Conversation context |
| content_type | string | No | snippet | snippet, summary | Result content format |

## Code Examples

### Text Generation with Web Search - Bash - All Regions

```bash
curl -X POST \
-H "Content-Type: application/json" \
-H "Authorization: Bearer Your-API-Key" \
"http://xxxx-hangzhou.opensearch.aliyuncs.com/v3/openapi/workspaces/default/text-generation/qwen-max" \
-d '{
      "messages":[
      {
          "role":"system",
          "content":"You are a robot assistant"
      },
      {
          "role":"user",
          "content":"What is the capital of Henan?"
      },
      {
          "role":"assistant",
          "content":"Zhengzhou"
      },
      {
          "role":"user",
          "content":"What is the weather like in Zhengzhou today?"
      }
      ],
      "parameters":{
          "search_return_result":true,
          "search_top_k":5,
          "search_way":"normal"
      },
       "stream":false,
       "enable_search":true
}'
```

### OpenAI-Compatible Chat Completion - Python - All Regions

```python
from openai import OpenAI

def get_response():
    client = OpenAI(
        api_key="OS_API_KEY",  # Replace OS_API_KEY with your API Key created on the platform
        base_url="http://xxxx-hangzhou.opensearch.aliyuncs.com/compatible-mode/v1",
    )

    completion = client.chat.completions.create(
        model="ops-qwen-turbo",
        messages=[
            {"role": "system", "content": "You are a robot assistant"},
            {"role": "user", "content": "What is the capital of Henan"},
            {"role": "assistant", "content": "Zhengzhou"},
            {"role": "user", "content": "What are some interesting places there"}
        ]
    )

    print(completion.model_dump_json())

if __name__ == '__main__':
    get_response()
```

### Document Splitting - Python - All Regions

```python
from alibabacloud_tea_openapi.models import Config
from alibabacloud_searchplat20240529.client import Client
from alibabacloud_searchplat20240529.models import GetDocumentSplitRequest

if __name__ == '__main__':
    config = Config(
        bearer_token='OS-****',
        endpoint='****.platform-cn-shanghai.opensearch.aliyuncs.com',
        protocol='http'
    )

    client = Client(config=config)
    request = GetDocumentSplitRequest().from_map({
        "document": {
            "content": "test123",
            "content_type": "text"
        },
        "strategy": {
            "max_chunk_size": 300,
            "need_sentence": False
        }
    })
    response = client.get_document_split("default", "ops-document-split-001", request)
    print(response)
```

### Query Analysis with History - Python - All Regions

```python
from alibabacloud_tea_openapi.models import Config
from alibabacloud_searchplat20240529.client import Client
from alibabacloud_searchplat20240529.models import GetQueryAnalysisRequest, GetQueryAnalysisRequestHistory

if __name__ == '__main__':
    config = Config(
        bearer_token="<your-api-key>",
        endpoint="<your-api-endpoint>",
        protocol="http"
    )
    client = Client(config=config)

    history = [
        GetQueryAnalysisRequestHistory(content="Where is the capital of China", role="user"),
        GetQueryAnalysisRequestHistory(content="Beijing", role="assistant")
    ]

    request = GetQueryAnalysisRequest(history=history, query="What is the population?")
    response = client.get_query_analysis("default", "ops-query-analyze-001", request)
    print(response)
```

### Web Search with Context - Bash - All Regions

```bash
curl -X POST \
-H "Content-Type: application/json" \
-H "Authorization: Bearer <YOUR_API_KEY>" \
"http://xxxx-hangzhou.opensearch.aliyuncs.com/v3/openapi/workspaces/default/web-search/ops-web-search-001" \
-d '{
      "history": [
        {"role": "system", "content": "You are a robot assistant"},
        {"role": "user", "content": "What is the capital of Zhejiang province?"},
        {"role": "assistant", "content": "Hangzhou"}
        ],
      "query":"What is the weather like in Hangzhou today?",
      "query_rewrite":true,
      "top_k":5,
      "content_type":"snippet"
}'
```

### Token Counting - Bash - All Regions

```bash
curl -XPOST -H "Content-Type:application/json" \
"http://****-shanghai.opensearch.aliyuncs.com/v3/openapi/workspaces/default/text-generation/ops-qwen-turbo/tokenizer" \
-H "Authorization: Bearer Your API-KEY" \
-d "{\"messages\":[{\"role\":\"user\",\"content\":\"Test token calculation interface\"}]}"
```

## Response Format

```json
{
  "request_id": "450fcb80-f796-****-8d69-e1e86d29aa9f",
  "latency": 564.903929,
  "result": {
    "text":"According to the latest weather forecast, Zhengzhou will be cloudy today with temperatures ranging from 9°C to 19°C and northeast winds around level 2...",
    "search_results":[
      {
        "url":"https://xxxxx.com",
        "title":"xxxx",
        "snippet":"Zhengzhou weather is sunny today"
      }
    ]
   },
  "usage": {
      "output_tokens": 934,
      "input_tokens": 798,
      "total_tokens": 1732
  }
}
```

**Key Fields**:
- `result.text` — The generated natural language response
- `result.search_results` — Web search results when `enable_search=true`
- `usage.output_tokens` — Number of tokens in the generated response
- `usage.input_tokens` — Number of tokens in the input prompt and context
- `usage.total_tokens` — Sum of input and output tokens

## Error Handling

| Error Code | Description | Recommended Action |
|------------|-------------|-------------------|
| InvalidParameter | The request contains invalid parameters. Check the input format and ensure all required fields are present. | Validate JSON structure, check required fields, and ensure parameter constraints are met. |
| 400 | Bad request. Likely due to invalid parameters, missing required fields, or malformed JSON. | Verify request body format and parameter values. |
| 401 | Unauthorized. Invalid or missing API key in the Authorization header. | Ensure the API key is correct and included as `Bearer <key>`. |
| 404 | Not found. The specified endpoint or model does not exist. | Confirm the service ID and workspace name are correct. |
| 429 | Too many requests. Rate limit exceeded. Wait and retry after the specified time interval. | Implement exponential backoff and respect rate limits. |
| 500 | Internal server error. Retry the request after a short delay. If persistent, contact support. | Retry with jittered backoff; escalate if issue persists. |

### Rate Limits & Retry
- Generate Text: 3 QPS (Alibaba Cloud account and RAM users combined)
- Document Split: 2–100 QPS depending on service
- Query Analysis: 10 QPS
- Web Search: 3 QPS

Implement exponential backoff with jitter for retries. Respect the `Retry-After` header if provided.

## Environment Requirements

- Python SDK: `pip install alibabacloud_searchplat20240529>=1.0.0 alibabacloud_tea_openapi>=0.1.0`
- OpenAI SDK (for OpenAI-compatible mode): `pip install openai>=1.0.0`
- Java SDK: `com.aliyun:searchplat20240529`
- Set environment variable: `export DASHSCOPE_API_KEY=your_api_key_here`

## FAQ

Q: How do I enable web search in text generation?
A: Set `enable_search=true` in the request body and use a supported model like `deepseek-r1`. You can also control result inclusion with `parameters.search_return_result`.

Q: Can I use the OpenAI SDK with Elasticsearch AI APIs?
A: Yes, for the `ops-qwen-turbo` model, use the OpenAI-compatible endpoint `/compatible-mode/v1` and set the base URL accordingly in your SDK.

Q: What is the maximum request size?
A: The request body must not exceed 8 MB for all synchronous APIs.

Q: How are tokens counted for billing?
A: Tokens are counted separately for input (prompt + context) and output (generated text). Use the Token Calculation API to estimate costs before generation.

Q: Why am I getting a 401 error?
A: Verify that your API key is correctly formatted as a Bearer token (`Authorization: Bearer YOUR_KEY`) and that it has permissions for the requested service.

## Pricing & Billing

### Billing Model
Billing is primarily **per-token** for text generation and token counting, and **per-request** for document splitting, query analysis, and web search.

### Price Reference

| Model/Service | Input Price | Output Price |
|---------------|-------------|--------------|
| qwen-turbo | 0.002 /tokens | 0.002 /tokens |
| qwen-plus | 0.004 /tokens | 0.004 /tokens |
| qwen-max | 0.012 /tokens | 0.012 /tokens |
| deepseek-r1 | 0.015 /tokens | 0.015 /tokens |
| deepseek-v3 | 0.020 /tokens | 0.020 /tokens |
| deepseek-v4-pro | 0.030 /tokens | 0.030 /tokens |
| deepseek-v4-flash | 0.010 /tokens | 0.010 /tokens |
| ops-qwen-turbo | 0.002 /tokens | 0.003 /tokens |
| ops-document-split-001 | 0.0001 / | 0.0001 / |
| ops-web-search-001 | 0.002 /tokens | 0.002 /tokens |

### Free Tier
- ops-qwen-turbo: 1 million tokens free per month
- Document Split: 1,000 free calls per month
- Query Analysis: 1,000 free requests per month (in some configurations)

### Usage Limits
- Request body size: max 8 MB
- ops-qwen-turbo: max 8,192 tokens per request
- Document Split: max 10,000 characters per chunk
- Rate limits: 2–100 QPS depending on service

### Billing Notes
- Web search billing includes tokens used in query rewriting and result filtering
- Minimum charge per request is often 100 tokens
- Async tasks are billed upon completion
- Sentence-level chunking in Document Split doubles token usage