# opensearch-document

Part of **OPENSEARCH**

# OpenSearch Data Ingestion and Processing Console Guide

## Operations Overview

| Operation | Console Entry | Prerequisites | Description |
|------|-----------|---------|------|
| Document Split | Console > AI Search Open Platform > Document Split Service > Quick Start | - API key obtained<br>- Correct endpoint configured<br>- Alibaba Cloud Python SDK installed | Access the quick start guide for document splitting via the console |
| Build Data Pipelines | Console > AI Search Open Platform > RAG > Scenario Center > Multimodal Data Preprocessing Scenario - Data Parsing and Embedding | - AI Search Open Platform activated<br>- Service endpoint and authentication credentials obtained | Launch a multimodal data preprocessing scenario to generate parsing and embedding code |
| Batch Synchronize Single Table | DataWorks > Workspaces > Quick Access > Data Integration | - OpenSearch Retrieval Engine Edition instance created in same region as resource group<br>- Workspace created in DataWorks<br>- Resource group associated with workspace<br>- Source table accessible | Perform offline batch synchronization of a single table into OpenSearch |
| Configure MaxCompute Data Source | Console > OpenSearch Vector Search Edition > Instance Management > Manage > Configuration Center > Data Source Configuration | - Basic MaxCompute knowledge<br>- Table permissions: describe, select, download<br>- Label permissions on all fields<br>- Partitioned MaxCompute table | Add or edit a MaxCompute data source for OpenSearch indexing |
| Add MaxCompute Data Source | OpenSearch Retrieval Engine Edition > Instances > Manage > Configuration Center > Data Source | - Familiarity with MaxCompute<br>- Partitioned internal MaxCompute table<br>- Supported field types (STRING, BOOLEAN, DOUBLE, BIGINT, DATETIME)<br>- DESCRIBE, SELECT, DOWNLOAD, and LABEL permissions | Create a new MaxCompute data source in OpenSearch Retrieval Engine Edition |
| Add API Data Source | Console > OpenSearch Retrieval Engine Edition > Instance Management > Manage > Configuration Center > Data Source | - OpenSearch instance created and accessible<br>- No index table associated (for deletion) | Add or delete an API-based data source |
| Configure OSS Data Source | Console > OpenSearch Retrieval Engine Edition > Instances > Manage > Configuration Center > Data Source | - OSS activated in same region as OpenSearch instance<br>- OSS bucket created<br>- Objects uploaded to bucket | Set up OSS as a data source for document indexing |

## Operation Steps

### Document Split

**Navigation**: Console > AI Search Open Platform > Document Split Service > Quick Start

**Prerequisites**:
- API key obtained
- Correct endpoint configured
- Alibaba Cloud Python SDK installed

1. Navigate to the top navigation bar and click **Console** (link)
   - Element: **Console** (link) — top navigation bar

2. In the product center, select **AI Search Open Platform** (link)
   - Element: **AI Search Open Platform** (link) — product center

3. On the Document Split Service page, click **Quick Start** (link)
   - Element: **Quick Start** (link) — Document Split Service module

### Build Data Pipelines

**Navigation**: Console > AI Search Open Platform > RAG > Scenario Center > Multimodal Data Preprocessing Scenario - Data Parsing and Embedding

**Prerequisites**:
- AI Search Open Platform activated
- Service endpoint and authentication credentials obtained

1. Log on to the **AI Search Open Platform console** (link)
   - Element: **AI Search Open Platform console** (link) — top navigation

2. In the top-right corner, select **China (Shanghai)** (dropdown) and switch to your target workspace
   - Element: **China (Shanghai)** (dropdown) — top-right corner

3. In the left navigation pane, go to **RAG service** (menu)
   - Element: **RAG service** (menu) — left navigation pane

4. In the main content area, choose **Scenario Center** and click the **Multimodal Data Preprocessing Scenario - Data Parsing and Embedding** (panel)
   - Element: **Multimodal Data Preprocessing Scenario - Data Parsing and Embedding** (panel) — main content area

5. At the bottom of the page, click **Next** (button) to view and download code
   - Element: **Next** (button) — bottom of the page

| Parameter | Type | Required | Options/Values | Description |
|-----------|------|----------|----------------|-------------|
| api_key | text_input | Yes | — | The API key used to call the service. To obtain an API key, see Manage API keys. |
| aisearch_endpoint | text_input | Yes | — | The service endpoint. Remove the http:// prefix. You can call the API over the Internet or through a VPC. |
| workspace_name | text_input | Yes | — | The name of your workspace in AI Search Open Platform. |
| service_id | text_input | Yes | — | The service ID. For convenience, you can configure the IDs for each service in the service_id_config object. |

### Batch Synchronize Single Table

**Navigation**: DataWorks > Workspaces > Quick Access > Data Integration

**Prerequisites**:
- An OpenSearch Retrieval Engine Edition instance must be created and configured in the same region as the Data Integration resource group
- A workspace must be created or selected in DataWorks
- The purchased DataWorks resource group must be associated with the workspace
- The source data table must be accessible and properly structured

1. Go to the DataWorks Workspaces page and access **Quick Access** (menu)
   - Element: **Quick Access** (menu) — top navigation panel

2. From the left navigation panel, choose **Data Integration** (link)
   - Element: **Data Integration** (link) — left navigation panel

3. In the left navigation pane, go to **Data Source** and click **Add Data Source** (button)
   - Element: **Add Data Source** (button) — left navigation panel

4. On the Add Data Source page, search for and select **OpenSearch** (dropdown)
   - Element: **OpenSearch** (dropdown) — search bar results

5. In the Basic Information section, set **Engine Type** (dropdown) to **Retrieval Engine Edition**
   - Element: **Engine Type** (dropdown) — Basic Information section

6. In the Connection Configuration section, click **Test Connectivity** (button) to verify
   - Element: **Test Connectivity** (button) — Connection Configuration section
   - Notes: If connectivity fails, use the Network Connectivity Diagnostic Tool to resolve issues

7. Return to the Synchronization Task module and click **Create Synchronization Task** (button)
   - Element: **Create Synchronization Task** (button) — top-left corner

8. On the Create Synchronization Task page, set **Synchronization Type** (dropdown) to **Batch synchronization for a single table**
   - Element: **Synchronization Type** (dropdown) — Create Synchronization Task page

9. In DataStudio, create a new node using the **Create Node** (dialog) dialog box with **Node Type** set to **Offline synchronization**
   - Element: **Create Node** (dialog) — center of screen

10. Configure Network and Resource Settings (Data Source, My Resource Group, Data Destination), then click **Next** (button)
    - Element: **Next** (button) — bottom of form
    - Notes: Ensure all connections are established before proceeding

11. In the main content area, configure **Field Mapping** (text_input) between source and destination fields
    - Element: **Field Mapping** (text_input) — main content area
    - Notes: Source and destination fields must correspond row by row; missing default values cause failure

12. Click **Run** (button) to start the offline data synchronization task
    - Element: **Run** (button) — upper-left corner

13. After completion, return to the OpenSearch console, find your instance, and click **Manage** (button)
    - Element: **Manage** (button) — Actions column

14. In the left navigation panel, go to Extended Features → SQL Development and click **Create SQL Instance** (button)
    - Element: **Create SQL Instance** (button) — left navigation panel

15. In the code editor, run `select count(*) from table_name;` and click **Run** (button)
    - Element: **Run** (button) — top-right corner of code editor

16. Run `select * from table_name where id="***"` and click **Run** (button) to retrieve specific data
    - Element: **Run** (button) — top-right corner of code editor

| Parameter | Type | Required | Options/Values | Description |
|-----------|------|----------|----------------|-------------|
| Table name | text_input | Yes | — | Customizable name for the index table |
| Number of data shards | number_input | Yes | — | Positive integer no greater than 256; recommended not to exceed three times the number of data nodes |
| Number of resources for data updates | number_input | No | — | Number of resources used for data updates; free quota provided per index |
| Full Data Source | dropdown | Yes | API | Indicates that data is pushed to the instance using an API |
| Engine Type | dropdown | Yes | Retrieval Engine Edition | Specifies the type of OpenSearch engine to connect to |
| Synchronization Type | dropdown | Yes | Batch synchronization for a single table | Defines the method of data transfer |
| Node Type | dropdown | Yes | Offline synchronization | Specifies the execution mode of the synchronization task |

### Configure MaxCompute Data Source

**Navigation**: Console > OpenSearch Vector Search Edition > Instance Management > Manage > Configuration Center > Data Source Configuration

**Prerequisites**:
- Basic understanding of MaxCompute (formerly ODPS)
- Required table permissions: describe, select, download
- Label permissions for all fields in the table
- A partitioned MaxCompute table

1. Log on to the OpenSearch console and switch to **OpenSearch Vector Search Edition** (menu)
   - Element: **OpenSearch Vector Search Edition** (menu) — upper-left corner

2. Find your instance and click **Manage** (button) in the Actions column
   - Element: **Manage** (button) — Actions column

3. In the left-side navigation pane, choose **Configuration Center** (menu) > Data Source Configuration
   - Element: **Configuration Center** (menu) — left-side navigation panel

4. Click **Add Data Source** (button), select MaxCompute, and configure parameters
   - Element: **Add Data Source** (button) — main content area
   - Notes: Configure: Data Source Name, Project, AccessKey ID, AccessKey Secret, Table, Partition Key, Automatic Index Rebuilding

5. Click **Validate** (button) to test configuration
   - Element: **Validate** (button) — main content area
   - Notes: After validation passes, OK becomes clickable

6. Click **OK** (button) to save the data source
   - Element: **OK** (button) — main content area

7. To edit, click **Modify** (button) in the Actions column
   - Element: **Modify** (button) — Actions column

8. On the Edit Data Source page, modify parameters and click **Validate** (button)
   - Element: **Validate** (button) — main content area

9. Click **OK** (button) to save changes after validation
   - Element: **OK** (button) — main content area

10. To delete, click **Delete** (button) in the Actions column
    - Element: **Delete** (button) — Actions column
    - Notes: If referenced by an index table, deletion fails with error message

| Parameter | Type | Required | Options/Values | Description |
|-----------|------|----------|----------------|-------------|
| Data Source Name | text_input | Yes | — | The name of the data source. Must follow the format instance_name_custom_name. |
| Project | text_input | Yes | — | The MaxCompute project name. |
| AccessKey ID | text_input | Yes | — | The AccessKey ID used to authenticate access to MaxCompute. |
| AccessKey Secret | text_input | Yes | — | The AccessKey Secret used to authenticate access to MaxCompute. |
| Table | text_input | Yes | — | The name of the MaxCompute table to be used as the data source. |
| Partition Key | text_input | Yes | — | The partition key of the MaxCompute table. |
| Automatic Index Rebuilding | toggle | No | Enabled, Disabled | If enabled, the system automatically rebuilds the index when the data source changes. |

### Add MaxCompute Data Source

**Navigation**: OpenSearch Retrieval Engine Edition > Instances > Manage > Configuration Center > Data Source

**Prerequisites**:
- Familiarity with MaxCompute (formerly ODPS)
- MaxCompute table must be a partitioned internal table
- Table fields must use only STRING, BOOLEAN, DOUBLE, BIGINT, and DATETIME data types
- Account must have DESCRIBE, SELECT, DOWNLOAD permissions on the table and LABEL permission on fields

1. Log on to the OpenSearch console and select **OpenSearch Retrieval Engine Edition** (menu)
   - Element: **OpenSearch Retrieval Engine Edition** (menu) — upper-left corner

2. Find your instance and click **Manage** (button) in the Actions column
   - Element: **Manage** (button) — Actions column

3. In the left-side navigation pane, choose Configuration Center > Data Source, then click **Add Data Source** (button)
   - Element: **Add Data Source** (button) — main content area

4. Select **MaxCompute** (radio) as the data source type
   - Element: **MaxCompute** (radio) — data source type selection

5. Click **Verify** (button) to validate configuration, then click **OK**
   - Element: **Verify** (button) — bottom of panel

6. To modify, click **Modify** (button) in the Actions column
   - Element: **Modify** (button) — Actions column
   - Notes: Data source name cannot be changed

7. To delete, click **Delete** (button) in the Actions column
   - Element: **Delete** (button) — Actions column
   - Notes: If referenced by an index table, delete the index table first

| Parameter | Type | Required | Options/Values | Description |
|-----------|------|----------|----------------|-------------|
| Data Source Name | text_input | Yes | — | Name of the data source. Format: InstanceName_CustomName. Cannot be changed after creation. |
| Project | text_input | Yes | — | The MaxCompute project that contains your table. |
| AccessKey | text_input | Yes | — | The AccessKey ID of the account. |
| AccessKey Secret | text_input | Yes | — | The AccessKey secret of the account. |
| Table | text_input | Yes | — | The MaxCompute table to use as the data source. Must be a partitioned internal table. |
| Partition Key | text_input | Yes | — | The partition key of the table. Use yyyymmddhh format for hourly partitions. |
| Automatic Reindexing | checkbox | No | — | When enabled, OpenSearch automatically rebuilds indexes each time a change is detected in the data source. Requires a done table. |

### Add API Data Source

**Navigation**: Console > OpenSearch Retrieval Engine Edition > Instance Management > Manage > Configuration Center > Data Source

**Prerequisites**:
- An OpenSearch instance must be created and accessible
- No index table should be associated with the API data source to be deleted

1. Log on to the OpenSearch console and select **OpenSearch Retrieval Engine Edition** (menu)
   - Element: **OpenSearch Retrieval Engine Edition** (menu) — upper-left corner

2. Find the target instance and click **Manage** (button) in the Actions column
   - Element: **Manage** (button) — Actions column

3. In the left-side navigation pane, choose **Configuration Center** (menu) > Data Source
   - Element: **Configuration Center** (menu) — left-side navigation panel

4. Click **Add Data Source** (button)
   - Element: **Add Data Source** (button) — main content area

5. In the Add Data Source panel, select **API Data Source** (radio) and enter a **Data Source Name** (text_input)
   - Element: **API Data Source** (radio) — data source type selection

6. Click **Verify** (button) to proceed
   - Element: **Verify** (button) — Add Data Source panel

7. On the Data Source page, find the API data source and click **Delete** (button)
   - Element: **Delete** (button) — Actions column
   - Notes: Deleted API data sources cannot be recovered.

| Parameter | Type | Required | Options/Values | Description |
|-----------|------|----------|----------------|-------------|
| Data Source Name | text_input | Yes | — | A unique name for the API data source to identify it in the system. |

### Configure OSS Data Source

**Navigation**: Console > OpenSearch Retrieval Engine Edition > Instances > Manage > Configuration Center > Data Source

**Prerequisites**:
- Activated Object Storage Service (OSS) in the same region as your OpenSearch Retrieval Engine Edition instance
- Created an OSS bucket in that region
- Uploaded objects to the OSS bucket

1. Log on to the OpenSearch console, select OpenSearch Retrieval Engine Edition, find your instance, and click **Manage** (button)
   - Element: **Manage** (button) — Actions column

2. In the left-side pane, choose Configuration Center > Data Source and click **Add Data Source** (button)
   - Element: **Add Data Source** (button) — Data Source page

3. Set **Data Source Type** (dropdown) to OSS and configure: **Data Source Name**, **OSS Path**, **Bucket**
   - Element: **Data Source Type** (dropdown) — Add Data Source panel
   - Notes: OSS Path must contain 'opensearch' and cannot include '=', '&', or '?'

4. Click **Verify** (button) to validate the connection
   - Element: **Verify** (button) — Add Data Source panel

5. Go to Configuration Center > Index Schema and click **Create Index Table** (button). Enter a name and select the OSS data source
   - Element: **Create Index Table** (button) — Index Schema page

6. Go to O&M Center > O&M Management and click **Reindexing** (button). Configure parameters to start reindexing
   - Element: **Reindexing** (button) — O&M Management page
   - Notes: After reindexing completes, run a query test to verify indexing

7. On the Data Source page, find the data source and click **Delete** (button)
   - Element: **Delete** (button) — Actions column
   - Notes: Important: Delete the index table first if it was created for this data source

| Parameter | Type | Required | Options/Values | Description |
|-----------|------|----------|----------------|-------------|
| Data Source Name | text_input | Yes | — | A custom name for the OSS data source. Must start with a letter and can contain letters, digits, and underscores (_). |
| OSS Path | text_input | Yes | — | The path used to access objects in the bucket. Must contain 'opensearch' and cannot contain '=', '&', or '?'. Example: /opensearch_index_data/ |
| Bucket | text_input | Yes | — | The name of the OSS bucket. |

## FAQ

Q: Where can I find the Document Split quick start guide in the console?
A: Navigate to Console > AI Search Open Platform > Document Split Service > Quick Start.

Q: What permissions are required to add a MaxCompute data source?
A: You need DESCRIBE, SELECT, and DOWNLOAD permissions on the MaxCompute table, plus LABEL permission on all fields.

Q: Can I modify the Data Source Name after creating a MaxCompute data source?
A: No, the Data Source Name cannot be changed after creation.

Q: Why does my OSS Path need to contain 'opensearch'?
A: This is a security requirement to ensure the path is dedicated to OpenSearch indexing and avoids conflicts with other services.

Q: What happens if I delete an API data source that is linked to an index table?
A: Deletion will fail. You must delete the associated index table first before removing the data source.

## Pricing & Billing

### Billing Model
- Document Split: billed per request (ops-document-split-001: 0.0005 /)
- Multimodal Pipeline Services: billed per 1,000 tokens (e.g., ops-document-analyze-001: 0.002 /tokens)
- OpenSearch Retrieval Engine Edition: billed per instance hour

### Price Reference
| Tier | Price |
|------|-------|
| ops-document-split-001 | 0.0005 / |
| ops-document-analyze-001 | 0.002 /tokens |
| ops-image-analyze-vlm-001 | 0.003 /tokens |
| ops-image-analyze-ocr-001 | 0.001 /tokens |
| ops-document-split-001 (pipeline) | 0.001 /tokens |
| ops-text-embedding-001 | 0.002 /tokens |
| ops-text-embedding-002 | 0.001 /tokens |
| ops-text-embedding-zh-001 | 0.0015 /tokens |
| ops-text-embedding-en-001 | 0.0018 /tokens |
| ops-text-sparse-embedding-001 | 0.0005 /tokens |

### Free Tier
- Document Split: First 1,000 requests per month free
- OpenSearch Retrieval Engine Edition: Two resources with 4 vCPUs and 8 GB memory provided free per index

### Billing Notes
- Document Split: Each request is charged once regardless of output size
- Multimodal Services: Billed based on actual usage per service invoked
- OpenSearch Instance: Charges apply for resources exceeding the free quota; refer to billing overview for full details
- Quota Limits: Single request body max 8MB (Document Split); 100 QPS per model (Multimodal); one resource group can sync to only one OpenSearch instance at a time