# opensearch-document

Part of **OPENSEARCH**

<!-- intent-backlink:auto -->

> 💡 **Path Selection**: This skill is one implementation path for [Manage data sources for ingestion](../../intent/opensearch-manage-sources/SKILL.md). If you're unsure which path to take, check the routing skill first.

# OpenSearch Document Retrieval

## Capabilities Overview

| Sub-capability | Calling Mode | Description |
|----------------|--------------|-------------|
| Fetch Document by Primary Key Hash | Synchronous | Retrieve a document using its primary key hash, raw primary key, or document ID via the fetch_summary clause. |
| Get Document Summary | Synchronous | Retrieve summarized content of documents with optional field selection and text highlighting using the summary clause. |

## API Calling Patterns

### Authentication
The primary authentication method is **Bearer Token**.
- Include the header: `Authorization: Bearer <your-api-key>`
- Store your API key in the environment variable: `OPENSEARCH_API_KEY`
- While some endpoints may not require authentication in internal deployments, public API usage requires this header.

### Service Endpoint
APIs use the standard OpenSearch RESTful endpoint:
- Base URL pattern: `https://{instance-id}.{region}.opensearch.aliyuncs.com/_search`
- Common regions: `cn-hangzhou`, `cn-shanghai`, `cn-beijing`
- The `_search` endpoint is used for both `fetch_summary` (as a query parameter) and `summary` (as a JSON body clause).

### Synchronous Request Pattern
Both functions use a **Synchronous** calling mode:
1. Construct a POST request to `/_search`
2. For **Fetch Document by Primary Key Hash**: encode parameters in the `config` query string (e.g., `?config=fetch_summary_type:pk&&fetch_summary=...`)
3. For **Get Document Summary**: send a JSON body containing `config` and `summary` objects
4. Receive an immediate XML (for fetch_summary) or JSON (for summary) response
5. Parse the response to extract fields, raw_pk, or highlighted content

No asynchronous or streaming patterns are supported for document retrieval.

## Parameter Reference

### Fetch Document by Primary Key Hash

| Parameter | Type | Required | Default | Constraints | Description |
|----------|------|----------|---------|-------------|-------------|
| fetch_summary_type | string | Yes | — | one of: pk, rawpk, docid | Specifies the lookup method: pk (primary key hash), rawpk (raw primary key), or docid (document ID). |
| fetch_summary | string | Yes | — | comma-separated list; semicolon separates clusters in rawpk mode; reserved characters must be escaped | List of document identifiers. For pk/docid: use gid values. For rawpk: use `cluster:pk` format. |

### Get Document Summary

| Parameter | Type | Required | Default | Constraints | Description |
|----------|------|----------|---------|-------------|-------------|
| config.fetch_summary_type | string | Yes | — | one of: docid, pk, rawpk | Method to identify documents for summary retrieval. |
| summary.gids | array | Yes | — | max length 100 items | List of document identifiers (gids) to retrieve summaries for. |
| summary.fetch_fields | array | No | — | max length 50 fields | Fields to include in the summary output. |
| summary.highlight.highlighter | string | No | plain | one of: plain, lucene | Name of the highlighter to use for text highlighting. |
| summary.highlight.pre_tag | string | No | `<em>` | max length 10 characters | HTML tag to wrap the beginning of highlighted text. |
| summary.highlight.post_tag | string | No | `</em>` | max length 10 characters | HTML tag to wrap the end of highlighted text. |
| summary.highlight.fields.fragment_size | integer | No | 100 | range 1-500 | Length of each fragment of highlighted text. |
| summary.highlight.fields.number_of_fragments | integer | No | 3 | range 1-10 | Number of fragments to return per field. |

## Code Examples

### Fetch Document by PK Hash - Plain - All Regions

```text
config=fetch_summary_type:pk&&fetch_summary=daogou|6|100|100|100|00000000000000004cd645cfd1c63041|184140777,daogou|6|200|200|200|00000000000000005b3ceae33e5ab800|184140777
```

### Fetch Documents by Raw Primary Key - Plain - All Regions

```text
config=fetch_summary_type:rawpk&&fetch_summary=cluster1:pk1,pk2;cluster2:pk3,pk4
```

### Get Summary by DocID - JSON - All Regions

```json
{
  "config" : {
    "fetch_summary_type" : "docid"
  },
  "summary" : {
    "gids" : [
        "daogou|6|0|0|0|00000000000000004cd645cfd1c63041|184140777",
        "daogou|6|0|0|1|00000000000000005b3ceae33e5ab800|184140777"
    ]
  }
}
```

### Get Summary with Field Selection - JSON - All Regions

```json
{
  "summary" : {
    "fetch_fields" : ["title", "body", "price"]
  }
}
```

### Get Summary with Highlighting - JSON - All Regions

```json
{
  "summary" : {
    "highlight" : {
      "highlighter" : "plain",
      "pre_tag" : "<em>",
      "post_tag" : "</em>",
      "fields" : {
        "title" : {
          "fragment_size" : 100,
          "number_of_fragments" : 3
        }
      }
    }
  }
}
```

### Full Summary Request with PK Lookup - JSON - All Regions

```json
{
  "config" : {
    "fetch_summary_type" : "pk"
  },
  "summary" : {
    "gids" : [
        "daogou|6|100|100|100|00000000000000004cd645cfd1c63041|184140777",
        "daogou|6|200|200|200|00000000000000005b3ceae33e5ab800|184140777"
    ]
  }
}
```

## Response Format

### Fetch Document Response (XML)

```xml
<?xml version="1.0" encoding="UTF-8"?>
<Root>
<TotalTime>0.003</TotalTime>
<hits numhits="2" totalhits="0" coveredPercent="0.00">
<hit cluster_name="daogou" hash_id="17871" docid="0" gid="daogou|0|0|17871|0|4cd645cfd1c63041391f27d3272cfeeb|4294967295">
<fields>
<id>1</id>
</fields>
<property>
</property>
<sortExprValues></sortExprValues>
<raw_pk>111</raw_pk>
</hit>
<hit cluster_name="daogou" hash_id="60131" docid="0" gid="daogou|0|0|60131|0|5b3ceae33e5ab800352f040b4d9c05e9|4294967295">
<fields>
<id>2</id>
</fields>
<property>
</property>
<sortExprValues></sortExprValues>
<raw_pk>112</raw_pk>
</hit>
</hits>
<AggregateResults>
</AggregateResults>
<Error>
<ErrorCode>0</ErrorCode>
<ErrorDescription></ErrorDescription>
</Error>
</Root>
```

**Key Fields**:
- `hits.hit.fields` — contains the retrieved document field values
- `hits.hit.raw_pk` — the original primary key value
- `hits.numhits` — number of documents returned
- `TotalTime` — total query execution time in seconds
- `Error.ErrorCode` — 0 indicates success; non-zero indicates failure

### Get Document Summary Response (JSON)

```json
{
  "summary": {
    "results": [
      {
        "gid": "daogou|6|0|0|0|00000000000000004cd645cfd1c63041|184140777",
        "fields": {
          "title": "Sample Title",
          "body": "This is a sample body text..."
        },
        "highlighted": {
          "title": "<em>Sample</em> Title"
        }
      }
    ]
  }
}
```

**Key Fields**:
- `summary.results[].gid` — global document identifier
- `summary.results[].fields` — requested field values
- `summary.results[].highlighted` — highlighted versions of fields (if enabled)

## Error Handling

| Error Code | Description | Recommended Action |
|------------|-------------|---------------------|
| 400 | Invalid request syntax, such as malformed fetch_summary or invalid fetch_summary_type. | Validate parameter format, ensure correct gid structure, and escape reserved characters. |
| 404 | Document not found. The specified gid or primary key value does not exist in the cluster. | Verify the document exists and the primary key/hash matches the cluster schema. |
| 500 | Internal server error due to cluster instability or data update conflicts during summary extraction. | Retry the request after a short delay. |
| 503 | Service unavailable. The cluster may be unstable or overloaded. | Implement exponential backoff and retry. |

### Rate Limits & Retry
- Rate limit: 100 QPS per user account
- Recommended retry strategy: Exponential backoff with jitter (e.g., 1s, 2s, 4s delays)
- Do not retry 4xx errors (client-side issues); only retry 5xx errors

## Environment Requirements

- Set your API key: `export OPENSEARCH_API_KEY=your_api_key_here`
- Use any HTTP client that supports custom headers (e.g., `curl`, Python `requests`, Java `HttpClient`)
- No specific SDK is required; standard REST clients suffice

## FAQ

Q: What is the difference between `fetch_summary` and `summary`?
A: `fetch_summary` is a query parameter used in phase-1 style requests and returns XML. `summary` is a JSON clause in the request body and returns structured JSON with support for field selection and highlighting.

Q: How do I construct a valid gid for `fetch_summary_type:pk`?
A: A gid follows the format: `{cluster}|{partition}|{bucket}|{hash}|{version}|{md5}|{timestamp}`. Use the full gid from your phase-1 query results.

Q: Can I retrieve documents without knowing the primary key?
A: No. Both methods require either a document ID, primary key hash, or raw primary key value. Ensure your cluster schema defines a primary key field with `has_primary_key_attribute=true`.

Q: Why am I getting a 400 error with `fetch_summary_type=rawpk`?
A: In rawpk mode, the `fetch_summary` value must use `cluster:pk` format, and multiple clusters are separated by semicolons. Also, primary key values containing commas, colons, or pipes must be URL-encoded.

Q: Is highlighting supported in the `fetch_summary` clause?
A: No. Text highlighting is only available when using the JSON-based `summary` clause with the `highlight` configuration.

## Pricing & Billing

### Billing Model
Billed per request (per phase-2 summary retrieval operation).

### Price Reference

| Tier | Input Price | Output Price |
|------|-------------|--------------|
| standard | 0.0001 / | 0.0002 / |

### Free Tier
 10,000 

### Usage Limits
100 QPS, 5000 

### Billing Notes
Summaries retrieved via phase-2 queries are billed per request. Async tasks are not supported for this operation.