# opensearch-model_and_ai_services

Part of **OPENSEARCH**

# OpenSearch Model and AI Services

## Capabilities Overview

| Sub-capability | Models | Calling Mode | Description |
|----------------|--------|--------------|-------------|
| Generate Text | qwen3-235b-a22b, qwq-32b, ops-qwen-turbo, +9 more | Synchronous | Generate text responses using large language models. |
| List NER Results | — | Synchronous | Retrieve NER results from intervention dictionaries. |
| Generate Natural Language Text | qwen3-235b-a22b, qwq-32b, ops-qwen-turbo, +9 more | Synchronous | Produce human-like text responses using large language models with optional web search. |
| Calculate Token Count | ops-qwen-turbo, qwen-turbo, qwen-plus, qwen-max | Synchronous | Determine the number of tokens in input text for billing or processing purposes. |
| Compatible with OpenAI SDK | ops-qwen-turbo, qwen-turbo, qwen-plus, qwen-max, +4 more | OpenAI Compatible | Use OpenSearch text generation APIs with OpenAI-compatible SDKs. |
| Call Deep Models | haixuan_pvlist_cate_er100_zhibing, recall_pvlist_batchneg_v4, nx_tower_v5_distillation, +2 more | Synchronous | Invoke pre-trained deep learning models for inference tasks. |
| Parse Retrieval Results | — | Synchronous | Process and interpret retrieval results from the engine. |
| Perform Web Search | ops-web-search-001 | Synchronous | Execute searches on the internet to retrieve current information. |
| Get Web Search | — | Synchronous | Performs a live web search and returns results to supplement OpenSearch responses. |

## Model Selection Guide

### Generate Text

| Model ID | Calling Mode |
|----------|--------------|
| qwen3-235b-a22b | Synchronous |
| qwq-32b | Synchronous |
| ops-qwen-turbo | Synchronous |
| qwen-turbo | Synchronous |
| qwen-plus | Synchronous |
| qwen-max | Synchronous |
| deepseek-r1 | Synchronous |
| deepseek-v3 | Synchronous |
| deepseek-r1-distill-qwen-7b | Synchronous |
| deepseek-r1-distill-qwen-14b | Synchronous |
| deepseek-v4-pro | Synchronous |
| deepseek-v4-flash | Synchronous |

### Generate Natural Language Text

| Model ID | Calling Mode |
|----------|--------------|
| qwen3-235b-a22b | Synchronous |
| qwq-32b | Synchronous |
| ops-qwen-turbo | Synchronous |
| qwen-turbo | Synchronous |
| qwen-plus | Synchronous |
| qwen-max | Synchronous |
| deepseek-r1 | Synchronous |
| deepseek-v3 | Synchronous |
| deepseek-r1-distill-qwen-7b | Synchronous |
| deepseek-r1-distill-qwen-14b | Synchronous |
| deepseek-v4-pro | Synchronous |
| deepseek-v4-flash | Synchronous |

### Calculate Token Count

| Model ID | Calling Mode |
|----------|--------------|
| ops-qwen-turbo | Synchronous |
| qwen-turbo | Synchronous |
| qwen-plus | Synchronous |
| qwen-max | Synchronous |

### Compatible with OpenAI SDK

| Model ID | Calling Mode |
|----------|--------------|
| ops-qwen-turbo | OpenAI Compatible |
| qwen-turbo | OpenAI Compatible |
| qwen-plus | OpenAI Compatible |
| qwen-max | OpenAI Compatible |
| qwen-max-longcontext | OpenAI Compatible |
| ops-text-embedding-001 | OpenAI Compatible |
| ops-text-embedding-zh-001 | OpenAI Compatible |
| ops-text-embedding-en-001 | OpenAI Compatible |
| ops-text-embedding-002 | OpenAI Compatible |

### Call Deep Models

| Model ID | Calling Mode |
|----------|--------------|
| haixuan_pvlist_cate_er100_zhibing | Synchronous |
| recall_pvlist_batchneg_v4 | Synchronous |
| nx_tower_v5_distillation | Synchronous |
| yueqi_grk_drr_ls_aware | Synchronous |
| base_bs_cvr | Synchronous |

### Perform Web Search

| Model ID | Calling Mode |
|----------|--------------|
| ops-web-search-001 | Synchronous |

## API Calling Patterns

### Authentication
Use **Bearer Token** authentication as the primary method.

- Include the header: `Authorization: Bearer <your_api_key>`
- Store your API key in the environment variable: `DASHSCOPE_API_KEY`
- Some endpoints may use `OS_API_KEY`—check your service configuration

### Service Endpoint
APIs use region-specific endpoints with the pattern:
`http://{instance_id}.{region}.opensearch.aliyuncs.com`

Common regions include:
- cn-hangzhou
- cn-shanghai
- cn-beijing

For OpenAI-compatible interfaces, use:
`{host}/compatible-mode/v1/chat/completions`

### Synchronous Pattern
Used for most text generation, web search, and model inference calls.

1. Send a POST request with JSON body containing required parameters
2. Receive a complete JSON response immediately
3. Parse the `result` or `choices` field for generated content
4. Check `usage` for token consumption metrics

### OpenAI Compatible Pattern
Enables use of standard OpenAI SDKs with OpenSearch services.

1. Configure your OpenAI client with:
   - `base_url = "http://{host}/compatible-mode/v1"`
   - `api_key = your_api_key`
2. Call `client.chat.completions.create()` with standard OpenAI parameters
3. Handle responses identically to OpenAI API responses
4. Supports both streaming and non-streaming modes

## Parameter Reference

### Generate Text / Generate Natural Language Text

| Parameter | Type | Required | Default | Constraints | Description |
|-----------|------|----------|---------|-------------|-------------|
| messages | List | Yes | — | — | The conversation history between the user and the model. Each element has the form {"role": role, "content": content}. Valid roles: system, user, assistant. |
| stream | Boolean | No | false | — | Whether to return responses in streaming mode. When true, each response contains the entire sequence generated so far. |
| enable_search | Boolean | No | false | Supported only for deepseek-r1 | Whether to enable web search. When true, the LLM uses an internal prompt to decide whether to perform a web search. |
| csi_level | String | No | strict | one of: none, loose, strict, rigorous | Green Net filtering level. |
| parameters.search_return_result | Boolean | No | false | Only effective when enable_search is true | true: Returns web search results. false: Does not return web search results. |
| parameters.search_top_k | Integer | No | 5 | Only effective when enable_search is true | Number of web search results to return. Supported only for deepseek-r1. |
| parameters.search_way | String | No | normal | Only effective when enable_search is true | Web search strategy. Options: normal, fast, full. |
| parameters.max_tokens | Integer | No | — | max 1500 for qwen-turbo; max 2000 for qwen-max and qwen-plus | Limits the number of tokens the model generates. |
| parameters.temperature | Float | No | — | range [0, 2) | Controls randomness and diversity in generation. |
| parameters.top_p | Float | No | — | range (0, 1.0) | Probability threshold for nucleus sampling. |
| parameters.presence_penalty | Float | No | — | range [-2.0, 2.0] | Controls repetition across the entire generated sequence. |

### Calculate Token Count

| Parameter | Type | Required | Default | Constraints | Description |
|-----------|------|----------|---------|-------------|-------------|
| messages | List | Yes | — | list must end with role[user] | Conversation history to tokenize. Each item has role and content fields. |

### Perform Web Search / Get Web Search

| Parameter | Type | Required | Default | Constraints | Description |
|-----------|------|----------|---------|-------------|-------------|
| query | String | Yes | — | — | The search query. |
| query_rewrite | Boolean | No | true | — | Specifies whether to use an LLM to rewrite the query. |
| top_k | Integer | No | 5 | — | The number of search results to return. |
| history | List | No | null | Must alternate between user/assistant; system must be first if used | Conversation history between user and model. |
| content_type | String | No | snippet | one of: snippet, summary | Content type of the search results. |

### Call Deep Models

| Parameter | Type | Required | Default | Constraints | Description |
|-----------|------|----------|---------|-------------|-------------|
| score_model_bizs | String | No | — | — | Business configurations that access the scoring model. |
| recall_model_bizs | String | No | — | — | Business configurations that access the retrieval model. |
| fg_user | String | No | — | max length 6000 chars | User-specific qinfo in map-like format (k1:v1;k2:v2;...). |
| qinfo | String | No | — | max length 6000 chars | User-specific qinfo in JSON format, Base64 encoded for SQL. |
| topk | Integer | No | — | range 1-10000 | Number of top entries to return in retrieval results. |

## Code Examples

### Text Generation with Web Search - Bash - All Regions

```bash
curl -X POST \
-H "Content-Type: application/json" \
-H "Authorization: Bearer Your-API-Key" \
"http://xxxx-hangzhou.opensearch.aliyuncs.com/v3/openapi/workspaces/default/text-generation/qwen-max" \
-d '{
      "messages":[
      {
          "role":"system",
          "content":"You are a robot assistant"
      },
      {
          "role":"user",
          "content":"What is the capital of Henan?"
      },
      {
          "role":"assistant",
          "content":"Zhengzhou"
      },
      {
          "role":"user",
          "content":"What is the weather like in Zhengzhou today?"
      }
      ],
      "parameters":{
          "search_return_result":true,
          "search_top_k":5,
          "search_way":"normal"
      },
       "stream":false,
       "enable_search":true
}'
```

### OpenAI-Compatible Text Generation - Python - All Regions

```python
from openai import OpenAI

def get_response():
    client = OpenAI(
        api_key="OS_API_KEY",  # Replace OS_API_KEY with your API Key created on the platform
        base_url="http://xxxx-hangzhou.opensearch.aliyuncs.com/compatible-mode/v1",
    )

    completion = client.chat.completions.create(
        model="ops-qwen-turbo",
        messages=[
            {"role": "system", "content": "You are a robot assistant"},
            {"role": "user", "content": "What is the capital of Henan"},
            {"role": "assistant", "content": "Zhengzhou"},
            {"role": "user", "content": "What are some interesting places there"}
        ]
    )

    print(completion.model_dump_json())

if __name__ == '__main__':
    get_response()
```

### Token Calculation - Bash - All Regions

```bash
curl -XPOST -H "Content-Type:application/json" \
"http://****-shanghai.opensearch.aliyuncs.com/v3/openapi/workspaces/default/text-generation/ops-qwen-turbo/tokenizer" \
-H "Authorization: Bearer Your API-KEY" \
-d "{\"messages\":[{\"role\":\"user\",\"content\":\"Test token calculation interface\"}]}"
```

### Web Search with History - Bash - All Regions

```bash
curl -X POST \
-H "Content-Type: application/json" \
-H "Authorization: Bearer <YOUR_API_KEY>" \
"http://xxxx-hangzhou.opensearch.aliyuncs.com/v3/openapi/workspaces/default/web-search/ops-web-search-001" \
-d '{
      "history": [
        {"role": "system", "content": "You are a robot assistant"},
        {"role": "user", "content": "What is the capital of Zhejiang province?"},
        {"role": "assistant", "content": "Hangzhou"}
        ],
      "query":"What is the weather like in Hangzhou today?",
      "query_rewrite":true,
      "top_k":5,
      "content_type":"snippet"
}'
```

### Deep Model Call with SQL - JavaScript - All Regions

```javascript
select * from gul_jhs_match_mind_item_v2 where aitheta('ju_recall_model','qinfo:eyJ1c2VyOmFnZSI6IjI5IiwidXNlcjp0bV9sZXZlbCI6IlQyIiwidXNlcjpwaG9uZV9tb2RlbCI6ImlwaG9uZSA3IiwidXNlcjpnZW5kZXIiOiJNIiwidXNlcjp0aW1lX2lkIjpbMjMsMjUsMjUsMjUsMjUsMjYsMjYsMjYsMjYsMjYsMjYsMjYsMjYsMjYsMjYsMjYsMjYsMjYsMjYsMjYsMjYsMjYsMjYsMjYsMjYsMjYsMjYsMjYsMjYsMjYsMjYsMjYsMjYsMjYsMjYsMjYsMjYsMjYsMjYsMjYsMjYsMjZdLCJ1c2VyOmFjdGlvbl90eXBlIjpbMSwxLDEsMywxLDEsMSwxLDEsMywxLDEsMSwxLDEsMSwxLDEsMSwxLDEsMSwxLDEsMSwxLDEsMSwxLDEsMywxLDEsMSwyLDEsMSwxLDQsMSwzLDEsMywxLDNdLCJ1c2VyOmlkIjoiMTkwODcyMzAwNSIsInVzZXI6Y291bnR5X25hbWUiOiLpqazlhbPljr8iLCJ1c2VyOnByZWRfaGFzX2hvdXNlIjoiTiIsInVzZXI6cGhvbmVfYnJhbmQiOiJhcHBsZSIsInVzZXI6Y2xpY2tfc2VxIjpbNTg3NjQ0MDkyNjAyLDY0NDk3NjE1ODg5Myw2NDE1NDY0NDM3MjUsNjQ0MDMxNDg1ODI4LDYyMDY3MjkzNjg5MSw2NDAyMTk3MDIyNTUsNjE1NzU2NTg4OTg5LDY0Mjg5NTM0MDg0Nyw2Mzg1MTc3MDI5NDQsNjM4NjYzOTc0MTQxLDU0NDUzMDI5MTg0MCw2Mzk2MzMwMDU4NjgsNjM3MjkxOTUyMjExLDYxODEwMDc2NzI0MCw2NDc5Mjg5NTU0NTEsNjE0MjU2NzMyNzA5LDYzODY2ODc3MTIyOCw2NDY1Nzg3NjU4MTEsNjM5NDcwNjg2NzAyLDY0MjU4MTA1NjIyMCw2NDQ4NDAyMDIxMDUsNjI5OTE3MTM2MzcyLDY0MTIzMDUwNjM4MSw2MzU3ODQwOTIwMTQsNjA0MjcyNDg5NzM1LDU5OTQ2OTQ3MjU1OCw2MTYxMTk0NDIwODIsNjI5MDY0OTQzNDYyLDY0NDE0MTAzMDE0Nyw1OTgzMTAzMDM3NTksNjE3MjA0ODUyNjgxLDY0NjgyNzE5MDEyMyw2NDM4MDYyMDE3MjcsNjMxMDU5MDUyMjcyLDY0NzQxOTI1ODI0Niw2MjE5NTUwNTMxNDEsNjUwODk5MDIzNTU0LDU5ODAzMzUyNzc2Nyw2NDcwNzQ0MjQyNDksNjQxNTAwMjg2ODk2LDYzMzYzMDQ5Nzg1Nyw2NDYzODAxODMwODJdLCJ1c2VyOnNlcXVlbmNlX2xlbmd0aCI6NDIsInVzZXI6aG91c2VfYWdlIjoiIiwidXNlcjpwcm9wZXJ0eV9ob3Vyc2VfbGV2ZWwiOiIifQ==,category:1\,2,topk:1')"
```

### Parse Protobuf Retrieval Results - Java - All Regions

```java
import com.aliyun.ha3engine.Client;
import com.aliyun.ha3engine.models.*;
import com.aliyun.tea.TeaException;
import com.aliyun.demo.protobuf.Ha3ResultProto;
import org.junit.Before;
import org.junit.Test;

import java.nio.ByteBuffer;
import java.util.*;

public class DataFormatService {
    private Client client;

    @Before
    public void clientInit() throws Exception {
        Config config = new Config();
        config.setEndpoint("");
        config.setInstanceId("");
        config.setAccessUserName("");
        config.setAccessPassWord("");
        config.setHttpProxy("");
        client = new Client(config);
    }

    @Test
    public void protobufFormat() throws Exception {
        try {
            SearchRequestModel request = new SearchRequestModel();
            SearchQuery query = new SearchQuery();
            query.setQuery("query=id:8148508889615505646&&config=start:0,hit:100,format:protobuf&&cluster=general");
            request.setQuery(query);

            SearchBytesResponseModel response = client.SearchBytes(request);
            System.out.println("Raw bytes: " + Arrays.toString(response.getBody()));

            Ha3ResultProto.PBResult pbResult = Ha3ResultProto.PBResult.parseFrom(response.getBody());
            System.out.println("Parsed result: " + pbResult);

        } catch (TeaException e) {
            System.out.println(e.getCode());
            System.out.println(e.getMessage());
            System.out.println(com.aliyun.teautil.Common.toJSONString(e.getData()));
        }
    }
}
```

### Document Splitting - Python - China

```python
from alibabacloud_tea_openapi.models import Config
from alibabacloud_searchplat20240529.client import Client
from alibabacloud_searchplat20240529.models import GetDocumentSplitRequest

if __name__ == '__main__':
    config = Config(
        bearer_token='OS-****',
        endpoint='****.platform-cn-shanghai.opensearch.aliyuncs.com',
        protocol='http'
    )

    client = Client(config=config)
    request = GetDocumentSplitRequest().from_map({
        "document": {
            "content": "test123",
            "content_type": "text"
        },
        "strategy": {
            "max_chunk_size": 300,
            "need_sentence": False
        }
    })
    response = client.get_document_split("default", "ops-document-split-001", request)
    print(response)
```

## Response Format

```json
{
  "request_id": "450fcb80-f796-****-8d69-e1e86d29aa9f",
  "latency": 564.903929,
  "result": {
    "text":"According to the latest weather forecast, Zhengzhou will be cloudy today with temperatures ranging from 9°C to 19°C and northeast winds around level 2...",
    "search_results":[
      {
        "url":"https://xxxxx.com",
        "title":"xxxx",
        "snippet":"Zhengzhou weather is sunny today"
      }
    ]
   },
  "usage": {
      "output_tokens": 934,
      "input_tokens": 798,
      "total_tokens": 1732
  }
}
```

**Key Fields**:
- `result.text` — The generated text response from the model
- `result.search_results` — Array of web search results when enable_search is true
- `usage.output_tokens` — Number of tokens in the generated response
- `usage.input_tokens` — Number of tokens in the input messages
- `usage.total_tokens` — Total tokens consumed (input + output)

## Error Handling

| Error Code | Description | Recommended Action |
|------------|-------------|-------------------|
| InvalidParameter | The request contains invalid parameters. Check the input format and ensure all required fields are present. | Validate your JSON structure and parameter values against the API specification. |
| 400 | Bad request. Typically caused by missing or invalid required parameter 'query'. | Ensure all required parameters are provided and correctly formatted. |
| 401 | Unauthorized. Invalid or missing API key in the Authorization header. | Verify your API key is correct and properly formatted as a Bearer token. |
| 404 | Not found. The specified endpoint or model does not exist. | Check your endpoint URL and model ID for typos. |
| 429 | Too many requests. Rate limit exceeded. Wait and retry after the specified time interval. | Implement exponential backoff retry logic with jitter. |
| 500 | Internal server error. Retry the request after a short delay. | Retry with exponential backoff; contact support if persistent. |

### Rate Limits & Retry
- Text generation: 3 QPS (includes Alibaba Cloud account and RAM users)
- Query analysis: 10 QPS per Alibaba Cloud account and RAM user
- OpenAI-compatible interfaces: 100 QPS per model
- Web search: 3 QPS

Implement retry logic with exponential backoff (e.g., 1s, 2s, 4s, 8s) when receiving 429 errors. For 500 errors, retry immediately with a short delay (100-500ms).

## Environment Requirements

- Python SDK: `pip install alibabacloud_searchplat20240529>=1.0.0`
- Java SDK: Maven dependency `com.aliyun:searchplat20240529`
- OpenAI SDK: `pip install openai>=1.0.0`
- Environment variable setup: `export DASHSCOPE_API_KEY=your_api_key_here`
- Python version: >=3.8 for most SDKs

## FAQ

Q: How do I enable web search in my text generation requests?
A: Set `enable_search: true` in your request and include `parameters.search_return_result: true` to get the actual search results. Note that web search is only supported for the deepseek-r1 model.

Q: What's the difference between synchronous and OpenAI-compatible APIs?
A: Synchronous APIs use OpenSearch-specific endpoints and response formats, while OpenAI-compatible APIs follow the standard OpenAI interface, allowing you to use existing OpenAI SDKs with minimal code changes.

Q: How are tokens counted for billing purposes?
A: Tokens are counted separately for input (prompt) and output (completion). You can use the token calculation API to estimate costs before making expensive calls. Free tier includes 1 million tokens per month.

Q: Can I use streaming responses with the OpenAI-compatible interface?
A: Yes, set `stream: true` in your request parameters. The response will be returned as a series of SSE chunks that you need to iterate through to get incremental results.

Q: What should I do if I get a 413 error (request too large)?
A: The request body cannot exceed 8 MB. Reduce your input text size or split large documents into smaller chunks before sending them to the API.

## Pricing & Billing

### Billing Model
Per-token billing for text generation and embedding models; per-request billing for web search, document splitting, and deep model calls.

### Price Reference

| Model/Specification | Input Price | Output Price |
|---------------------|-------------|--------------|
| qwen3-235b-a22b | 0.002 /tokens | 0.004 /tokens |
| qwq-32b | 0.0015 /tokens | 0.003 /tokens |
| ops-qwen-turbo | 0.0005 /tokens | 0.001 /tokens |
| qwen-turbo | 0.0005 /tokens | 0.001 /tokens |
| qwen-plus | 0.001 /tokens | 0.002 /tokens |
| qwen-max | 0.002 /tokens | 0.004 /tokens |
| deepseek-r1 | 0.0018 /tokens | 0.0036 /tokens |
| deepseek-v3 | 0.0016 /tokens | 0.0032 /tokens |
| deepseek-r1-distill-qwen-7b | 0.0012 /tokens | 0.0024 /tokens |
| deepseek-r1-distill-qwen-14b | 0.0014 /tokens | 0.0028 /tokens |
| deepseek-v4-pro | 0.003 /tokens | 0.006 /tokens |
| deepseek-v4-flash | 0.0008 /tokens | 0.0016 /tokens |
| ops-web-search-001 | 0.002 / | 0.002 / |
| ops-document-split-001 | 0.0001 / | 0.0001 / |

### Free Tier
Monthly free tier of 1 million tokens for most text generation models. Document splitting and web search have separate free quotas (1000 calls/month).

### Usage Limits
- Request body maximum size: 8 MB
- ops-qwen-turbo maximum tokens: 4000
- Standard models maximum tokens: 8192
- Web search QPS limit: 3 requests per second
- General QPS limits: 10-100 depending on service

### Billing Notes
Billing occurs based on actual token consumption for text generation. Web search functionality may incur additional charges beyond base model usage. Async tasks are billed upon completion. Minimum billing unit is 100 tokens for most services.