# pai-dataset_management

Part of **PAI**

# Platform for AI (PAI) Dataset Management

## Capabilities Overview

| Sub-capability | Calling Mode | Description |
|----------------|--------------|-------------|
| Manage Dataset | Synchronous | Create, delete, update, get details, list, or publish datasets. |
| Manage Dataset Version | Synchronous | Create, delete, update, or get details of dataset versions. |
| Manage Dataset Label | Synchronous | Add or delete labels for datasets and dataset versions. |
| Manage Dataset File | Synchronous | Create, delete, update, get metadata, or list files within datasets. |
| Manage Dataset Job | Synchronous | Create, delete, update, get details, list, or stop dataset processing jobs. |
| Manage Dataset Job Config | Synchronous | Create, delete, update, or get configuration for dataset jobs. |
| Manage Dataset Metadata | Synchronous | Create, get, and update metadata for dataset files. |
| Manage Dataset Jobs | Synchronous | Configure and create jobs for dataset processing. |
| Label Datasets | Synchronous | Apply labels to datasets for organization and training. |
| Manage Dataset Sharing | Synchronous | Configure sharing relationships for datasets across workspaces or users. |
| Manage Dataset Versions | Synchronous | Define and structure dataset versions. |
| Get Dataset Details | Synchronous | Retrieve comprehensive information about a dataset. |
| Define Relationship Structure | Synchronous | Define the structure of relationships between data entities. |
| Define Lineage Structures | Synchronous | Define the structure of lineage entities and their relationships. |
| Configure DataJuicer Task | Synchronous | Set up configuration for DataJuicer data processing tasks. |
| Run Chi-Square Goodness-of-Fit Test | Synchronous | Perform chi-square goodness-of-fit statistical tests on datasets. |
| Compute Percentiles | Synchronous | Calculate percentile values for datasets through API calls. |

## API Calling Patterns

### Authentication
The primary authentication method is Bearer Token authentication.

- **Header format**: `Authorization: Bearer <your_api_key>`
- **Environment variable**: `DASHSCOPE_API_KEY`
- While other authentication methods may exist in the broader Alibaba Cloud ecosystem, Bearer Token is the standard and recommended approach for PAI Dataset Management APIs.

### Service Endpoint (Endpoint)
The APIs use region-specific endpoints following this pattern:

`https://api.aliyun.com/api/AIWorkSpace/2021-02-04/{operation}` (for China regions)  
`https://api.alibabacloud.com/api/AIWorkSpace/2021-02-04/{operation}` (for international regions)

Common regions include:
- `cn-hangzhou`
- `cn-shanghai`
- `cn-beijing`

### Synchronous API Pattern
All operations in the Dataset Management domain follow a synchronous calling pattern:

1. **Send Request**: Make an HTTP request (GET, POST, PUT, or DELETE) to the appropriate endpoint with required parameters in the query string, path, or request body.
2. **Include Authentication**: Add the `Authorization: Bearer $DASHSCOPE_API_KEY` header to every request.
3. **Receive Immediate Response**: The server processes the request and returns a JSON response immediately with either success data or error details.
4. **Handle Response**: Parse the JSON response to extract results (e.g., `DatasetId`, `VersionName`) or handle errors based on HTTP status codes and error messages.

No polling or asynchronous result retrieval is required since all operations complete synchronously.

## Parameter Reference

### Manage Dataset

| Parameter | Type | Required | Default | Constraints | Description |
|-----------|------|----------|---------|-------------|-------------|
| Name | string | true | null | 1 to 127 characters, starts with letter/number/Chinese char, contains only letters, numbers, underscores, hyphens | The name of the dataset. |
| Property | string | true | null | one of: FILE, DIRECTORY | The property of the dataset. |
| DataSourceType | string | true | null | one of: OSS, NAS, EXTREMENAS, CPFS, BMCPFS, MAXCOMPUTE, URL | The type of the data source. |
| Uri | string | true | null | format depends on DataSourceType | The URI of the data. |
| DataType | string | false | COMMON | one of: COMMON, PIC, TEXT, VIDEO, AUDIO | The data type of the dataset. |
| Labels | array | false | null | null | A list of labels. |
| SourceType | string | false | USER | one of: PAI_PUBLIC_DATASET, ITAG, USER | The source of the data. |
| WorkspaceId | string | false | null | null | The ID of the workspace to which the dataset belongs. |
| Accessibility | string | false | PRIVATE | one of: PRIVATE, PUBLIC, ROLE_PUBLIC | The visibility of the dataset in the workspace. |
| Edition | string | false | BASIC | one of: BASIC, ADVANCED, LOGICAL | The edition of the dataset. |

### Manage Dataset Version

| Parameter | Type | Required | Default | Constraints | Description |
|-----------|------|----------|---------|-------------|-------------|
| DatasetId | string | true | null | null | The dataset ID. |
| Property | string | true | null | one of: FILE, DIRECTORY | The property of the dataset. |
| DataSourceType | string | true | null | one of: NAS, OSS, CPFS | The type of the data source. |
| Uri | string | true | null | null | The URI of the data. |
| SourceType | string | false | USER | one of: PAI-PUBLIC-DATASET, ITAG, USER | The type of the data source. |
| Description | string | false | null | null | A custom description for the dataset version. |
| DataSize | integer | false | null | null | The size of the space occupied by the dataset files. Unit: bytes. |
| DataCount | integer | false | null | null | The number of files in the dataset. |

### Manage Dataset Label

| Parameter | Type | Required | Default | Constraints | Description |
|-----------|------|----------|---------|-------------|-------------|
| DatasetId | string | false | null | null | The dataset ID. |
| Labels | array | false | null | null | The list of labels. |
| LabelKeys | string | false | null | null | The keys of the labels. Separate multiple keys with commas (,). |

### Manage Dataset File

| Parameter | Type | Required | Default | Constraints | Description |
|-----------|------|----------|---------|-------------|-------------|
| DatasetId | string | true | null | null | The dataset ID. |
| WorkspaceId | string | true | null | null | The ID of the workspace where the dataset is located. |
| DatasetVersion | string | true | null | null | The name of the dataset version. |
| DatasetFileMetaIds | string | true | null | null | The ID of the dataset file metadata. |
| AggregateBy | string | false | null | one of: filedir, filetype, tags.user, tags.user-delete-ai-tags, tags.ai, tags.all | The metadata field used for statistical aggregation. |
| QueryType | string | false | MIX | one of: MIX, TAG, VECTOR | The search type. |

### Manage Dataset Job

| Parameter | Type | Required | Default | Constraints | Description |
|-----------|------|----------|---------|-------------|-------------|
| DatasetId | string | true | null | null | The dataset ID. |
| WorkspaceId | string | true | null | null | The workspace ID. |
| JobAction | string | true | null | one of: SemanticIndex, IntelligentTag, FileMetaExport, FileMetaBuild, IntelligentTagRevert, FileMetaImport | The task operation. |
| JobMode | string | false | Full | one of: Full, Increment | The task type. |
| JobSpec | string | true | null | null | The task details. |
| Status | string | false | null | One of: Succeeded, Failed, Starting, Running, Deleted, Pending, PartialFailed, Deleting, ManuallyStop | The job status. |

### Statistical Analysis Functions

| Parameter | Type | Required | Default | Constraints | Description |
|-----------|------|----------|---------|-------------|-------------|
| inputTableName | string | true | None | null | The name of the input table. |
| colName | string | true | None | null | The name of the column to test. |
| outputTableName | string | true | None | null | The name of the output table. |
| probConfig | string | false | Not specified. All categories are assumed to have equal probability. | All probabilities must sum to 1 | The expected class probability for each category. |
| inputPartitions | string | false | Empty (all data) | null | The partition to use from the input table. |

## Code Examples

### Create a Dataset - curl - all

```bash
curl -X POST https://api.aliyun.com/api/AIWorkSpace/2021-02-04/CreateDataset \
-H "Authorization: Bearer <your-api-key>" \
-H "Content-Type: application/json" \
-d '{
  "Name": "myDataset",
  "Property": "DIRECTORY",
  "DataSourceType": "NAS",
  "Uri": "nas://09f****f2.cn-hangzhou/",
  "DataType": "COMMON",
  "Description": "This is a description of the dataset.",
  "WorkspaceId": "478**",
  "Options": "{ \"mountPath\": /mnt/data/ }",
  "Accessibility": "PRIVATE",
  "ProviderType": "Ecs",
  "Provider": "Github",
  "UserId": "2485765****023475",
  "Edition": "ADVANCED"
}'
```

### Delete a Dataset - Python - all

```python
import requests

url = "https://api.aliyun.com/api/AIWorkSpace/2021-02-04/DeleteDataset/d-rbvg5wzlj****9ks92"
headers = {
    "Authorization": "Bearer <your-api-key>",
    "Content-Type": "application/json"
}

response = requests.delete(url, headers=headers)
print(response.json())
```

### List Dataset Files with Vector Search - Python - all

```python
import requests

url = "https://api.aliyun.com/api/AIWorkSpace/2021-02-04/ListDatasetFileMetas"
headers = {
    "Authorization": "Bearer YOUR_API_KEY",
    "Content-Type": "application/json"
}
params = {
    "DatasetId": "d-rbvg5*****jhc9ks92",
    "DatasetVersion": "v1",
    "WorkspaceId": "105173",
    "QueryType": "MIX",
    "QueryText": "A fallen water",
    "TopK": 100,
    "ScoreThreshold": 0.6
}

response = requests.get(url, headers=headers, params=params)
print(response.json())
```

### Create a Dataset Processing Job - curl - all

```bash
curl -X POST https://api.aliyun.com/api/AIWorkSpace/2021-02-04/CreateDatasetJob \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
  "DatasetId": "d-rbvg5wz****c9ks92",
  "WorkspaceId": "478**",
  "JobAction": "SemanticIndex",
  "JobMode": "Full",
  "DatasetVersion": "v1",
  "Description": "This is a job description.",
  "JobSpec": "{\"modelId\": \"xxx\", \"contentList\": [\"file\"]}"
}'
```

### Run Chi-Square Goodness-of-Fit Test - bash - all

```bash
PAI -name chisq_test
    -project algo_public
    -DinputTableName=pai_chisq_test_input
    -DcolName=f0
    -DprobConfig=0:0.3,1:0.7
    -DoutputTableName=pai_chisq_test_output0
    -DoutputDetailTableName=pai_chisq_test_output0_detail
```

### Compute Percentiles - bash - all

```bash
PAI -name Percentile
     -project algo_public
     -DinputTableName=maple_test_percentile_3col_input
     -DcolName=col0,col1,col2 -DoutputTableName=maple_test_percentile_3col_output;
```

### Get Dataset Version Details - Python - all

```python
import requests

url = "https://api.aliyun.com/api/AIWorkSpace/2021-02-04/GetDatasetVersion"
params = {
    "DatasetId": "d-lfd60v0p****ujtsdx",
    "VersionName": "v1"
}
headers = {
    "Authorization": "Bearer $DASHSCOPE_API_KEY",
    "Content-Type": "application/json"
}

response = requests.get(url, params=params, headers=headers)
print(response.json())
```

### Update Dataset File Metadata - curl - all

```bash
curl -X PUT https://api.aliyun.com/api/AIWorkSpace/2021-02-04/datasets/d-lfd60v0p****ujtsdx/datasetfilemetas \
-H "Authorization: Bearer <your-api-key>" \
-H "Content-Type: application/json" \
-d '{
  "DatasetVersion": "v1",
  "WorkspaceId": "796**",
  "DatasetFileMetas": [
    {
      "DatasetFileMetaId": "07914c9534586e4e7aa6e9dbca5009082df*******8a0d857b33296c59bf6",
      "Tags": ["updated", "processed"]
    }
  ],
  "TagJobId": "dsjob-hv0b1****u8taig3y"
}'
```

## Response Format

```json
{
  "RequestId": "B2C51F93-1C07-5477-9705-5FDB****F19F",
  "DatasetId": "d-rbvg5*****jhc9ks92"
}
```

**Key Fields**:
- `RequestId` — Unique identifier for the API request, useful for troubleshooting
- `DatasetId` — The unique identifier of the created or retrieved dataset
- `VersionName` — The name of the dataset version (e.g., "v1")
- `Status` — Indicates whether batch operations succeeded (`true`/`false`)
- `TotalCount` — Total number of items matching the query criteria

## Error Handling

| Error Code (Code) | Description (Description) | Recommended Action (Recommended Action) |
|-------------------|---------------------------|----------------------------------------|
| InvalidParameter | One or more parameters are invalid. Check the request body for malformed values, missing required fields, or invalid enum values. | Validate all parameters against the API specification, especially enum values and required fields. |
| Unauthorized | The user does not have sufficient permissions to perform this operation. Ensure the user has the paidataset:CreateDataset permission and appropriate RAM policies. | Verify that your account has the necessary RAM permissions for the specific dataset operation. |
| ResourceNotFound | The specified resource (e.g., workspace, dataset) could not be found. Verify the WorkspaceId or SourceDatasetId is correct. | Double-check resource IDs and ensure they exist in your workspace. |
| Throttling | The request rate exceeds the allowed limit. Reduce the frequency of requests or implement exponential backoff. | Implement retry logic with exponential backoff and respect rate limits. |
| 400 | Bad Request: The request parameters are invalid or missing. | Review the request payload and ensure all required parameters are present and correctly formatted. |
| 404 | Not Found: The specified dataset ID does not exist. | Confirm the dataset ID is correct and that the dataset hasn't been deleted. |
| 403 | Forbidden: The user does not have sufficient permissions to access the resource. | Check your account permissions and workspace access rights. |

### Rate Limits & Retry
- Standard rate limit: 100 QPS per user/account
- For batch operations: Single request maximum 100MB data size
- Recommended retry strategy: Exponential backoff with jitter (start with 1s delay, double each attempt up to 30s)
- When receiving 429/Throttling errors, respect the `Retry-After` header if present

## Environment Requirements

- **Environment variable setup**: `export DASHSCOPE_API_KEY=your_api_key_here`
- **Python requirements**: `dashscope>=1.14.0` (for SDK-based usage)
- **Authentication**: Bearer token authentication via the `Authorization` header

## FAQ

Q: How do I authenticate API requests to the PAI Dataset Management service?
A: Use Bearer Token authentication by including the header `Authorization: Bearer $DASHSCOPE_API_KEY` in all requests, where `DASHSCOPE_API_KEY` is your API key stored as an environment variable.

Q: What's the difference between BASIC, ADVANCED, and LOGICAL dataset editions?
A: BASIC edition doesn't support file metadata management. ADVANCED supports metadata management for up to 1M files per version and is only available for OSS datasets. LOGICAL is also OSS-only and suitable for most use cases but requires using the SDK.

Q: Can I delete the v1 version of a dataset?
A: No, version v1 cannot be deleted individually. It is only removed when the entire dataset is deleted. All subsequent versions (v2, v3, etc.) can be deleted independently.

Q: How do I handle large-scale file metadata operations efficiently?
A: Use batch operations like `CreateDatasetFileMetas` and `UpdateDatasetFileMetas` which support processing multiple files in a single request, and check the `FailedDetails` in responses to handle partial failures.

Q: What data sources are supported for datasets?
A: Supported data sources include OSS, NAS, CPFS (both 1.0 and 2.0), MaxCompute, and URL-based sources. Each has specific URI formatting requirements documented in the API parameters.

## Pricing & Billing

### Billing Model
Per-request billing model where each API call counts as one request regardless of payload size or success/failure status (unless specifically noted otherwise).

### Price Reference

| Tier | Input Price | Output Price | Other Fees |
|------|-------------|--------------|------------|
| CreateDataset | 0.001 / | null | null |
| DeleteDataset | 0.001 / | null | null |
| GetDataset | 0.0001 / | null | null |
| ListDatasets | 0.001 / | 0.001 / | null |
| CreateDatasetJob | 0.001 / | 0.002 / | null |
| Chi-square test | 0.001 / | 0.0005 / | null |
| Percentile computation | 0.001 / | 0.001 / | null |

### Free Tier
- Most operations: 100-1000 free requests per month
- Dataset read operations: 1000 free calls monthly
- Statistical analysis functions: 100 free calls monthly

### Usage Limits
- Rate limits: 100 QPS per user/account
- Data size limits: Single request maximum 100MB for dataset creation
- File limits: ADVANCED edition supports up to 1M files per version

### Billing Notes
- Failed requests are generally billed unless specifically noted (e.g., GetDataset only bills successful calls)
- Dataset edition affects pricing (ADVANCED/LOGICAL may have additional charges)
- Statistical analysis costs are based on compute time and data volume processed
- Free tier quotas reset monthly