# opensearch-aggregation

Part of **OPENSEARCH**

# OpenSearch Data Aggregation

## Capabilities Overview

| Sub-capability | Calling Mode | Description |
|----------------|--------------|-------------|
| Get Group Statistics | Synchronous | Retrieve statistical aggregations grouped by specified fields using the aggregate clause. |
| Get Document Statistics | Synchronous | Obtain statistical information about documents in indices via aggregation functions. |
| Perform Data Aggregation | Synchronous | Execute various data aggregation operations using the Aggregate class in Java SDK. |
| Group Data by Field | Synchronous | Use GROUP BY to organize data into groups by field values and compute aggregates. |
| Group Rows and Compute Aggregate Metrics | Synchronous | Group rows by field and calculate aggregate statistics like SUM, AVG, COUNT simultaneously. |

## API Calling Patterns

### Authentication
OpenSearch aggregation APIs do not require explicit authentication headers when called from within an authorized Alibaba Cloud environment (e.g., via VPC or with proper RAM permissions). For public endpoint access:
- No `Authorization` header is needed in standard query requests.
- Ensure your OpenSearch instance is properly configured with access control policies in the Alibaba Cloud console.
- Environment variables are not used for credentials; access is managed through instance-level security settings.

### Service Endpoint
OpenSearch aggregation is performed through search query endpoints, typically structured as:
```text
http://{instance_id}.{region}.opensearch.aliyuncs.com
```
Common regions include:
- cn-hangzhou
- cn-shanghai
- cn-beijing

The exact endpoint is configured when you create your OpenSearch application and is used in all query requests.

### Synchronous Query Pattern
All aggregation operations in OpenSearch are synchronous and follow this flow:
1. Construct a search query string that includes either:
   - An `aggregate=...` clause for statistical grouping (used with the Search API)
   - A SQL `SELECT ... GROUP BY ...` statement (used with the SQL API)
2. Send a GET or POST request to the search endpoint with the query in the request body or as a URL parameter.
3. Receive a JSON response containing aggregated results under the `result.facet` field (for aggregate clause) or as standard SQL result rows (for GROUP BY).
4. Parse the response immediately—no polling or async handling is required.

Key headers (if using POST):
- `Content-Type: application/x-www-form-urlencoded` (for form-encoded queries)
- `Accept: application/json`

## Parameter Reference

### Get Group Statistics / Get Document Statistics

| Parameter | Type | Required | Default | Constraints | Description |
|----------|------|----------|---------|-------------|-------------|
| group_key:field | field (attribute) | Yes | — | INT, LITERAL, INT_ARRAY, or LITERAL_ARRAY | Specifies the field to group by. Must be configured as an attribute field in the schema. |
| agg_fun | string | Yes | — | count(), sum(id), max(id), min(id), distinct_count(id) | Built-in function(s) to compute. Multiple functions separated by `#`. Supports arithmetic expressions like `sum(hits+replies)`. |
| range | string | No | — | Format: `number1~number2` | Defines value ranges for distribution analysis. Only one range allowed. Not supported for STRING fields. |
| agg_filter | string | No | — | — | Filter condition to restrict documents included in aggregation (e.g., `create_timestamp>1423456781`). |
| agg_sampler_threshold | INT | No | — | — | Threshold rank above which documents are counted fully; below are sampled. |
| agg_sampler_step | INT | No | — | — | Sampling interval for documents below threshold. Final stats = full stats + (sampled stats × step). |
| max_group | INT | No | 1000 | — | Maximum number of groups to return. |

### Perform Data Aggregation (Java SDK)

| Parameter | Type | Required | Default | Constraints | Description |
|----------|------|----------|---------|-------------|-------------|
| groupkey | String | Yes | — | Attribute field of int or literal type | Field name for grouping. |
| aggfun | String | No | — | count(), sum(id), max(id), min(id) | Aggregation function to apply. |
| aggFilter | String | No | — | — | Filter expression for documents. |
| range | String | No | — | Format: `number1~number2` | Range specification for segmented stats. |
| aggSamplerThreHold | String | No | — | — | Sampling threshold (as string). |
| aggSamplerStep | String | No | — | — | Sampling step size (as string). |
| maxGrop | String | No | "1000" | — | Max groups to return (note: typo in parameter name `maxGrop`). |

## Code Examples

### Basic Aggregation with Multiple Functions - JavaScript - all

```javascript
query=default:'Zhejiang University'&&aggregate=group_key:group_id,agg_fun:sum(price)#max(price);group_key:company_id,agg_fun:count()
```

### Aggregation with Sampling - JavaScript - all

```javascript
query=default:'Zhejiang University'&&aggregate=group_key:group_id,agg_fun:sum(price), agg_sampler_threshold:10000, agg_sampler_step:5
```

### Aggregation with Range and Filter - JavaScript - all

```javascript
query=default:'Zhejiang University'&&aggregate=group_key:group_id,agg_fun:max(hits+replies),agg_filter:create_timestamp>1423456781
```

### GROUP BY with SUM - SQL - all

```sql
SELECT brand, SUM(price) FROM phone WHERE nid < 8 GROUP BY brand
```

### GROUP BY with HAVING Clause - SQL - all

```sql
SELECT brand FROM phone GROUP BY brand HAVING COUNT(brand) > 10
```

### Using Aggregate Class in Java - Java - all

```java
Aggregate agg = new Aggregate("group_id");
agg.setAggFun("sum(price)#max(price)");
agg.setAggFilter("create_timestamp>1423456781");
agg.setRange("10~50");
agg.setMaxGroup("500");
// Use agg in SearchParams or query builder
```

### Aggregation with Distinct Count - JavaScript - all

```javascript
query=default:'product'&&aggregate=group_key:category,agg_fun:distinct_count(user_id)
```

### GROUP BY with AVG - SQL - all

```sql
SELECT brand, AVG(price) FROM phone WHERE nid < 8 GROUP BY brand
```

## Response Format

```json
{
    "status": "OK",
    "result": {
        "searchtime": 0.015634,
        "total": 5,
        "num": 1,
        "viewtotal": 5,
        "items": [],
        "facet": [
            {
                "key": "group_id",
                "items": [
                    {
                        "value": 43,
                        "sum": 81,
                        "max": 20
                    },
                    {
                        "value": 63,
                        "sum": 91,
                        "max": 50
                    }
                ]
            },
            {
                "key": "company_id",
                "items": [
                    {
                        "value": 13,
                        "count": 4
                    },
                    {
                        "value": 10,
                        "count": 1
                    }
                ]
            }
        ]
    },
    "errors": [],
    "tracer": ""
}
```

**Key Fields**:
- `status` — Overall request status ("OK" indicates success)
- `result.facet[].key` — Name of the grouped field
- `result.facet[].items[].value` — Value of the group key
- `result.facet[].items[].count` — Number of documents in the group
- `result.facet[].items[].sum` — Sum of specified field values in the group
- `result.facet[].items[].max` — Maximum field value in the group

## Pricing & Billing

### Billing Model
Billing is based on **per-request** usage. Each search query that includes aggregation logic counts as one request.

### Usage Limits
The system can return accurate statistics for up to 100,000 documents. If the number of matching documents exceeds 100,000, returned statistics may be inaccurate due to engine performance limits. For exclusive clusters, accurate statistics require enabling the `enable_accurate_statistics` parameter in the `kvpairs` clause. The `distinct_count` feature is supported only in exclusive clusters.

### Billing Notes
Async tasks or large-scale aggregations may incur higher resource usage. Exclusive cluster features (like accurate distinct_count) may have additional cost implications.

## FAQ

Q: What field types can be used in group_key?
A: Only attribute fields of type INT, LITERAL, INT_ARRAY, or LITERAL_ARRAY are supported. The field must be explicitly configured as an attribute in your OpenSearch application schema.

Q: Why are my aggregation results inaccurate for large datasets?
A: By default, OpenSearch uses sampling for datasets larger than 100,000 documents to maintain performance. For accurate results on large datasets, use an exclusive cluster and set `enable_accurate_statistics=true` in your query's `kvpairs` clause.

Q: Can I use GROUP BY with non-SQL queries?
A: Yes. The `aggregate` clause provides equivalent functionality to GROUP BY in the standard search API (non-SQL mode). Use `aggregate=group_key:field,agg_fun:sum(x)` instead of SQL syntax if you're not using the SQL endpoint.

Q: How do I filter documents before aggregation?
A: Use the `agg_filter` parameter in the aggregate clause (e.g., `agg_filter:timestamp>1700000000`) or include a WHERE clause in your SQL query (e.g., `WHERE timestamp > 1700000000`).

Q: Is distinct_count supported in all clusters?
A: No. The `distinct_count(id)` function is only available in exclusive clusters. Shared clusters do not support this operation.