# bailian-multimodal

Part of **BAILIAN**

<!-- intent-backlink:auto -->

> 💡 **Path Selection**: This skill is one implementation path for [Extract and understand information from documents and images](../../intent/bailian-extract-documents/SKILL.md). If you're unsure which path to take, check the routing skill first.

# Bailian Multimodal Understanding and Interaction

## Capabilities Overview

| Sub-capability | Models | API Pattern | Description |
|----------------|--------|-------------|-------------|
| Visual Reasoning | qvq-max, qwen3-vl-plus, qwen3.6-plus + 24 more | OpenAI Compatible (Streaming) | Solve complex visual problems and answer questions requiring step-by-step image analysis. |
| Image and Video Understanding | qwen3.6-plus, qwen-vl-max, qwen3-vl-plus + 65 more | OpenAI Compatible | Analyze visual content, describe scenes, and answer questions about images and videos. |
| Audio Understanding | qwen3-omni-30b-a3b-captioner, qwen3-omni-captioner | OpenAI Compatible | Transcribe, caption, and analyze audio files and speech. |
| Text Extraction (OCR) | qwen-vl-ocr-latest, qwen-vl-ocr-2025-11-20 + 3 more | Synchronous | Extract structured text and layout information from images and scanned documents. |
| Document Data Mining | qwen-doc-turbo, PRO-100, PRO-200 + 2 more | OpenAI Compatible (Streaming) | Extract structured data, tables, and specific fields from complex PDF and document files. |
| GUI Automation | gui-plus, gui-plus-2026-02-26 | OpenAI Compatible | Understand user interfaces and generate automation actions based on UI screenshots. |
| GUI Interaction | gui-plus-2026-02-26, gui-plus | OpenAI Compatible | Automate and interact with graphical user interfaces using vision-based models. |

## Model Selection Guide

### Visual Reasoning

| Model ID | API Pattern |
|----------|-------------|
| qwen3.6-plus | OpenAI Compatible (Streaming) |
| qwen3.6-plus-2026-04-02 | OpenAI Compatible (Streaming) |
| qwen3.6-flash | OpenAI Compatible (Streaming) |
| qwen3.6-flash-2026-04-16 | OpenAI Compatible (Streaming) |
| qwen3.6-35b-a3b | OpenAI Compatible (Streaming) |
| qwen3.5-plus | OpenAI Compatible (Streaming) |
| qwen3.5-plus-2026-02-15 | OpenAI Compatible (Streaming) |
| qwen3.5-flash | OpenAI Compatible (Streaming) |
| qwen3.5-flash-2026-02-23 | OpenAI Compatible (Streaming) |
| qwen3.5-397b-a17b | OpenAI Compatible (Streaming) |
| qwen3.5-122b-a10b | OpenAI Compatible (Streaming) |
| qwen3.5-27b | OpenAI Compatible (Streaming) |
| qwen3.5-35b-a3b | OpenAI Compatible (Streaming) |
| qwen3-vl-plus | OpenAI Compatible (Streaming) |
| qwen3-vl-plus-2025-12-19 | OpenAI Compatible (Streaming) |
| qwen3-vl-plus-2025-09-23 | OpenAI Compatible (Streaming) |
| qwen3-vl-flash | OpenAI Compatible (Streaming) |
| qwen3-vl-flash-2025-10-15 | OpenAI Compatible (Streaming) |
| qwen3-vl-235b-a22b-thinking | OpenAI Compatible (Streaming) |
| qwen3-vl-32b-thinking | OpenAI Compatible (Streaming) |
| qwen3-vl-30b-a3b-thinking | OpenAI Compatible (Streaming) |
| qwen3-vl-8b-thinking | OpenAI Compatible (Streaming) |
| qvq-max | OpenAI Compatible (Streaming) |
| qvq-plus | OpenAI Compatible (Streaming) |
| kimi-k2.6 | OpenAI Compatible (Streaming) |
| kimi-k2.5 | OpenAI Compatible (Streaming) |
| qwen3-vl | OpenAI Compatible (Streaming) |

### Image and Video Understanding

| Model ID | API Pattern |
|----------|-------------|
| qwen3.6-plus | OpenAI Compatible |
| qwen3.6-flash | OpenAI Compatible |
| qwen3.6-35b-a3b | OpenAI Compatible |
| qwen3.5-plus | OpenAI Compatible |
| qwen3.5-flash | OpenAI Compatible |
| qwen3.5-397b-a17b | OpenAI Compatible |
| qwen3.5-122b-a10b | OpenAI Compatible |
| qwen3.5-27b | OpenAI Compatible |
| qwen3.5-35b-a3b | OpenAI Compatible |
| qwen3-vl-plus | OpenAI Compatible |
| qwen3-vl-flash | OpenAI Compatible |
| qwen3-vl-235b-a22b-thinking | OpenAI Compatible |
| qwen3-vl-235b-a22b-instruct | OpenAI Compatible |
| qwen3-vl-32b-thinking | OpenAI Compatible |
| qwen3-vl-32b-instruct | OpenAI Compatible |
| qwen3-vl-30b-a3b-thinking | OpenAI Compatible |
| qwen3-vl-30b-a3b-instruct | OpenAI Compatible |
| qwen3-vl-8b-thinking | OpenAI Compatible |
| qwen3-vl-8b-instruct | OpenAI Compatible |
| qwen-vl-max | OpenAI Compatible |
| qwen-vl-max-latest | OpenAI Compatible |
| qwen-vl-max-2025-08-13 | OpenAI Compatible |
| qwen-vl-plus-latest | OpenAI Compatible |
| qwen-vl-plus-2025-08-15 | OpenAI Compatible |
| qwen-vl-plus-2025-07-10 | OpenAI Compatible |
| qwen-vl-max-2025-04-02 | OpenAI Compatible |
| qwen-vl-max-2025-01-25 | OpenAI Compatible |
| qwen-vl-max-2024-12-30 | OpenAI Compatible |
| qwen-vl-max-2024-11-19 | OpenAI Compatible |
| qwen-vl-max-2024-10-30 | OpenAI Compatible |
| qwen-vl-max-2024-08-09 | OpenAI Compatible |
| qwen-vl-plus-2025-05-07 | OpenAI Compatible |
| qwen-vl-plus-2025-01-25 | OpenAI Compatible |
| qwen-vl-plus-2025-01-02 | OpenAI Compatible |
| qwen-vl-plus-2024-08-09 | OpenAI Compatible |
| qwen2.5-vl-3b-instruct | OpenAI Compatible |
| qwen2.5-vl-7b-instruct | OpenAI Compatible |
| qwen2.5-vl-32b-instruct | OpenAI Compatible |
| qwen2.5-vl-72b-instruct | OpenAI Compatible |
| qvq-max | OpenAI Compatible |
| qvq-max-latest | OpenAI Compatible |
| qvq-max-2025-05-15 | OpenAI Compatible |
| qvq-max-2025-03-25 | OpenAI Compatible |
| qvq-plus | OpenAI Compatible |
| qvq-plus-latest | OpenAI Compatible |
| qvq-plus-2025-05-15 | OpenAI Compatible |
| qwen2-vl | OpenAI Compatible |
| qwen2.5-vl | OpenAI Compatible |
| qwen3-vl | OpenAI Compatible |
| qwen3-vl-flash-2025-10-15 | OpenAI Compatible |
| qwen3-vl-plus-2025-09-23 | OpenAI Compatible |
| qwen3.5-120b-a10b | OpenAI Compatible |
| qwen3.6-plus-2026-02-15 | OpenAI Compatible |
| qwen3.6-plus-2026-02-23 | OpenAI Compatible |
| qwen35-122b-code | OpenAI Compatible |
| qwen35-27b-code | OpenAI Compatible |
| qwen35-35b-code | OpenAI Compatible |
| qwen35-397b-code | OpenAI Compatible |
| qwen35-flash-code | OpenAI Compatible |
| qwen35-flash-item | OpenAI Compatible |
| qwen35-models | OpenAI Compatible |
| qwen35-opensource-item | OpenAI Compatible |
| qwen35-plus-code | OpenAI Compatible |
| qwen35-plus-item | OpenAI Compatible |
| qwen35-section | OpenAI Compatible |
| qwen35-series | OpenAI Compatible |
| qwen35flash-cache-note | OpenAI Compatible |
| qwen35flash-code-001 | OpenAI Compatible |
| qwen35plus-code-001 | OpenAI Compatible |
| qwen36-35b-code | OpenAI Compatible |
| qwen36-flash-code | OpenAI Compatible |
| qwen36-flash-item | OpenAI Compatible |
| qwen36-models-list | OpenAI Compatible |
| qwen36-opensource-item | OpenAI Compatible |
| qwen36-plus-code | OpenAI Compatible |
| qwen36-plus-item | OpenAI Compatible |
| qwen36-series | OpenAI Compatible |
| qwen36flash-cache-note | OpenAI Compatible |
| qwen36flash-code-001 | OpenAI Compatible |
| qwen36plus-code-001 | OpenAI Compatible |

### Audio Understanding

| Model ID | API Pattern |
|----------|-------------|
| qwen3-omni-30b-a3b-captioner | OpenAI Compatible |
| qwen3-omni-captioner | OpenAI Compatible |

### Text Extraction (OCR)

| Model ID | API Pattern |
|----------|-------------|
| qwen-vl-ocr-latest | Synchronous |
| qwen-vl-ocr-2025-11-20 | Synchronous |
| qwen-vl-ocr-2025-08-28 | Synchronous |
| qwen-vl-ocr-2024-10-28 | Synchronous |
| qwen-vl-ocr | Synchronous |

### Document Data Mining

| Model ID | API Pattern |
|----------|-------------|
| qwen-doc-turbo | OpenAI Compatible (Streaming) |
| PRO-100 | OpenAI Compatible (Streaming) |
| PRO-200 | OpenAI Compatible (Streaming) |
| SEC-400 | OpenAI Compatible (Streaming) |
| SEC-500 | OpenAI Compatible (Streaming) |

### GUI Automation & Interaction

| Model ID | API Pattern |
|----------|-------------|
| gui-plus | OpenAI Compatible |
| gui-plus-2026-02-26 | OpenAI Compatible |

## API Calling Modes

### Authentication
The primary and recommended authentication method is the **Bearer Token**.
- **Header format**: `Authorization: Bearer $DASHSCOPE_API_KEY`
- **Environment variable**: `DASHSCOPE_API_KEY`
- Obtain your API key from the Model Studio console. Note that API keys are region-specific.

### Service Endpoints
The APIs use region-specific base URLs. 

**OpenAI-Compatible Endpoints:**
- China (Beijing): `https://dashscope.aliyuncs.com/compatible-mode/v1`
- Singapore (International): `https://dashscope-intl.aliyuncs.com/compatible-mode/v1`
- US (Virginia): `https://dashscope-us.aliyuncs.com/compatible-mode/v1`

**DashScope Native Endpoints:**
- China (Beijing): `https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation`
- Singapore (International): `https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation`

### OpenAI Compatible (Streaming)
Used primarily for Visual Reasoning and Document Data Mining.
1. Initialize the standard OpenAI SDK client with the Bailian base URL and API key.
2. Set `stream=True` in the `chat.completions.create` call.
3. Iterate over the returned chunks. For reasoning models, chunks will contain `delta.reasoning_content` followed by `delta.content`.
4. To receive token usage in the final chunk, pass `stream_options={"include_usage": True}`.

### OpenAI Compatible (Synchronous)
Used for Image/Video Understanding, Audio Understanding, OCR, and GUI Automation.
1. Initialize the OpenAI SDK client.
2. Call `chat.completions.create` without `stream=True` (or with `stream=False`).
3. Parse the full JSON response containing `choices[0].message.content`.

### DashScope Native API
An alternative to the OpenAI-compatible mode, using the `dashscope` SDK or direct HTTP POST.
- **SDK**: Use `dashscope.MultiModalConversation.call()` or `dashscope.Generation.call()`.
- **HTTP**: Send a POST request to the native endpoint. For streaming, include the header `X-DashScope-SSE: enable`. The request body uses an `input.messages` structure instead of the OpenAI `messages` structure.

## Parameter Reference

### Visual Reasoning & Image/Video Understanding

| Parameter | Type | Required | Default | Constraints | Description |
|-----------|------|----------|---------|-------------|-------------|
| model | string | Yes | - | See Model Selection Guide | The model ID to use for the request. |
| messages | array | Yes | - | Array of message objects | List of messages in the conversation. Content can include text, image_url, or video_url. |
| stream | boolean | No | false | true / false | Whether to stream the response. |
| enable_thinking | boolean | No | - | true / false | Enables or disables the thinking process. For hybrid-thinking models, set to true to enable. |
| thinking_budget | integer | No | - | max 81920 | Maximum number of tokens for the thinking process. |
| extra_body | object | No | - | - | Additional parameters passed to the API (e.g., enable_thinking, thinking_budget). |
| vl_high_resolution_images | boolean | No | false | true / false | When true, disables max_pixels limit for high-resolution image processing. |
| fps | number | No | 2.0 | 0.1 - 10 | Frame extraction frequency in frames per second for videos. |
| max_frames | integer | No | - | - | Maximum number of frames to extract from a video (DashScope SDK only). |

### Audio Understanding

| Parameter | Type | Required | Default | Constraints | Description |
|-----------|------|----------|---------|-------------|-------------|
| model | string | Yes | - | qwen3-omni-30b-a3b-captioner | The model ID to use for the request. |
| messages | array | Yes | - | Only one user message allowed | List of messages. Content must contain audio input via `input_audio` or `audio` type. |
| stream | boolean | No | false | true / false | If true, enables streaming output. |
| stream_options | object | No | - | - | Options for streaming, e.g., `{"include_usage": true}`. |
| incremental_output | boolean | No | false | true / false | When true, each chunk contains only new content (DashScope SDK/HTTP only). |

### Text Extraction (OCR)

| Parameter | Type | Required | Default | Constraints | Description |
|-----------|------|----------|---------|-------------|-------------|
| model | string | Yes | - | qwen-vl-ocr-latest, etc. | The model ID to use for the OCR task. |
| messages | array | Yes | - | - | List of message objects containing user input with image and prompt. |
| ocr_options | object | No | - | - | Configuration for built-in OCR tasks. |
| task | string | No | - | text_recognition, key_information_extraction, table_parsing, document_parsing, formula_recognition, multi_lan, advanced_recognition | Specifies the built-in OCR task to perform. |
| result_schema | object | No | - | Max 3 nested layers | Schema defining the fields to extract in key_information_extraction mode. |
| min_pixels | integer | No | 3072 | - | Minimum pixel threshold for image scaling. |
| max_pixels | integer | No | 8388608 | - | Maximum pixel threshold for image scaling. |
| enable_rotate | boolean | No | false | true / false | Whether to enable automatic image rotation correction. |

### Document Data Mining

| Parameter | Type | Required | Default | Constraints | Description |
|-----------|------|----------|---------|-------------|-------------|
| model | string | Yes | - | qwen-doc-turbo | The model to use for the request. |
| messages | array | Yes | - | Max 262,144 tokens total | List of messages. Can include `doc_url` or `fileid://` references. |
| stream | boolean | No | false | Must be true for PPT generation | If true, the response will be streamed. |
| skill | array | No | - | Streaming mode only | Specifies additional capabilities like PPT generation. |
| file_parsing_strategy | string | No | auto | auto / text_only / text_and_images | How to parse the document when using doc_url. |

### GUI Automation & Interaction

| Parameter | Type | Required | Default | Constraints | Description |
|-----------|------|----------|---------|-------------|-------------|
| model | string | Yes | - | gui-plus, gui-plus-2026-02-26 | The model to use for the request. |
| messages | array | Yes | - | Must include system prompt | List of messages, including system prompt, user input with image, and previous assistant responses. |
| vl_high_resolution_images | boolean | No | false | true / false | Enable high-resolution image processing for better accuracy on detailed screenshots. |
| temperature | float | No | 0.01 | [0, 2) | Sampling temperature. Recommend setting only one of temperature or top_p. |
| top_p | float | No | 0.01 | (0, 1.0] | Nucleus sampling threshold. |
| top_k | integer | No | 1 | >= 0 | Size of the candidate set for sampling. |
| presence_penalty | float | No | 1.5 | [-2.0, 2.0] | Controls content repetition. Positive values reduce repetition. |

## Code Examples

### Visual Reasoning - Python - China

```python
from openai import OpenAI
import os

# Initialize the OpenAI client
client = OpenAI(
    # If not configured, replace with: api_key="sk-xxx"
    api_key = os.getenv("DASHSCOPE_API_KEY"),
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1"
)

reasoning_content = ""  # Define the full thinking process
answer_content = ""     # Define the full response
is_answering = False   # Check if the thinking process has ended and the response has started

# Create a chat completion request
completion = client.chat.completions.create(
    model="qvq-max",  # Example uses qvq-max. Replace with other model names as needed.
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://img.alicdn.com/imgextra/i1/O1CN01gDEY8M1W114Hi3XcN_!!6000000002727-0-tps-1024-406.jpg"
                    },
                },
                {"type": "text", "text": "How do I solve this problem?"},
            ],
        },
    ],
    stream=True,
    # Uncomment the following to return token usage in the last chunk
    # stream_options={
    #     "include_usage": True
    # }
)

print("\n" + "=" * 20 + "Thinking process" + "=" * 20 + "\n")

for chunk in completion:
    # If chunk.choices is empty, print the usage
    if not chunk.choices:
        print("\nUsage:")
        print(chunk.usage)
    else:
        delta = chunk.choices[0].delta
        # Print the thinking process
        if hasattr(delta, 'reasoning_content') and delta.reasoning_content != None:
            print(delta.reasoning_content, end='', flush=True)
            reasoning_content += delta.reasoning_content
        else:
            # Start responding
            if delta.content != "" and is_answering is False:
                print("\n" + "=" * 20 + "Full response" + "=" * 20 + "\n")
                is_answering = True
            # Print the response process
            print(delta.content, end='', flush=True)
            answer_content += delta.content
```

### Image Understanding - Python - China

```python
import os
from openai import OpenAI

client = OpenAI(
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
)

completion = client.chat.completions.create(
    model="qwen3.6-plus",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "image_url", "image_url": {"url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241022/emyrja/dog_and_girl.jpeg"}},
                {"type": "text", "text": "What is shown in the picture?"}
            ]
        }
    ]
)
print(completion.choices[0].message.content)
```

### Audio Understanding - Python - China

```python
import os
from openai import OpenAI

client = OpenAI(
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
)

completion = client.chat.completions.create(
    model="qwen3-omni-30b-a3b-captioner",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "input_audio",
                    "input_audio": {
                        "data": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20240916/xvappi/%E8%A3%85%E4%BF%AE%E5%99%AA%E9%9F%B3.wav"
                    }
                }
            ]
        }
    ]
)
print(completion.choices[0].message.content)
```

### Text Extraction (OCR) - Python - China

```python
from openai import OpenAI
import os

PROMPT_TICKET_EXTRACTION = """Please extract the invoice number, train number, departure station, destination station, departure date and time, seat number, seat type, ticket price, ID card number, and passenger name from the train ticket image.
Extract the key information accurately. Do not omit information or fabricate false information. Replace any single character that is blurry or obscured by glare with a question mark (?).
Return the data in JSON format: {'Invoice Number': 'xxx', 'Train Number': 'xxx', 'Departure Station': 'xxx', 'Destination Station': 'xxx', 'Departure Date and Time': 'xxx', 'Seat Number': 'xxx', 'Seat Type': 'xxx', 'Ticket Price': 'xxx', 'ID Card Number': 'xxx', 'Passenger Name': 'xxx'}"""

try:
    client = OpenAI(
        api_key=os.getenv("DASHSCOPE_API_KEY"),
        base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
    )
    completion = client.chat.completions.create(
        model="qwen-vl-ocr-latest",
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "image_url",
                        "image_url": {"url":"https://img.alicdn.com/imgextra/i2/O1CN01ktT8451iQutqReELT_!!6000000004408-0-tps-689-487.jpg"},
                        "min_pixels": 32 * 32 * 3,
                        "max_pixels": 32 * 32 * 8192
                    },
                    {"type": "text", "text": PROMPT_TICKET_EXTRACTION}
                ]
            }
        ])
    print(completion.choices[0].message.content)
except Exception as e:
    print(f"Error message: {e}")
```

### Document Data Mining - Python - China

```python
import os
import dashscope

response = dashscope.Generation.call(
    api_key=os.getenv('DASHSCOPE_API_KEY'),
    model='qwen-doc-turbo',
    messages=[
    {"role": "system","content": "You are a helpful assistant."},
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "From these two product manuals, extract all product information and organize it into a standard JSON array. Each object must include the following: model (the product model), name (the product name), and price (the price, with currency symbols and commas removed)."
            },
            {
                "type": "doc_url",
                "doc_url": [
                    "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20251107/jockge/%E7%A4%BA%E4%BE%8B%E4%BA%A7%E5%93%81%E6%89%8B%E5%86%8CA.docx",
                    "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20251107/ztwxzr/%E7%A4%BA%E4%BE%8B%E4%BA%A7%E5%93%81%E6%89%8B%E5%86%8CB.docx"
                ],
                "file_parsing_strategy": "auto"
            }
        ]
    }]
)
try:
    if response.status_code == 200:
        print(response.output.choices[0].message.content)
    else:
        print(f"Request failed, status code: {response.status_code}")
        print(f"Error code: {response.code}")
        print(f"Error message: {response.message}")
except Exception as e:
    print(f"An error occurred: {e}")
```

### GUI Automation - Python - China

```python
import os
from openai import OpenAI

system_prompt = """# Tools

You may call one or more functions to assist with the user query.

You are provided with function signatures within <tools></tools> XML tags:
<tools>
{"type": "function", "function": {"name": "computer_use", "description": "Use a mouse and keyboard to interact with a computer, and take screenshots.", "parameters": {"properties": {"action": {"description": "The action to perform.", "enum": ["key", "type", "mouse_move", "left_click", "left_click_drag", "right_click", "middle_click", "double_click", "triple_click", "scroll", "hscroll", "wait", "terminate", "answer", "interact"], "type": "string"}, "coordinate": {"description": "(x, y) coordinates.", "type": "array"}, "text": {"type": "string"}, "time": {"type": "number"}, "status": {"type": "string", "enum": ["success", "failure"]}}, "required": ["action"], "type": "object"}}}
</tools>

For each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:
<tool_call>
{"name": <function-name>, "arguments": <args-json-object>}
</tool_call>"""

messages = [
    {
        "role": "system",
        "content": system_prompt
    },
    {
        "role": "user",
        "content": [
            {"type": "image_url", "image_url": {"url": "https://img.alicdn.com/imgextra/i2/O1CN016iJ8ob1C3xP1s2M6z_!!6000000000026-2-tps-3008-1758.png"}},
            {"type": "text", "text": "Open the browser for me"}
        ]
    }
]

client = OpenAI(
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
)

completion = client.chat.completions.create(
    model="gui-plus-2026-02-26",
    messages=messages,
    extra_body={"vl_high_resolution_images": True}
)

print(completion.choices[0].message.content)
```

## Response Format

### Visual Reasoning (Streaming Chunk)

```json
{
  "choices": [
    {
      "delta": {
        "content": "",
        "role": "assistant",
        "reasoning_content": "Okay, I need to solve this problem about the surface area..."
      },
      "index": 0,
      "logprobs": null,
      "finish_reason": null
    }
  ],
  "object": "chat.completion.chunk",
  "usage": null,
  "created": 1742983020,
  "model": "qvq-max",
  "id": "chatcmpl-ab4f3963-2c2a-9291-bda2-65d5b325f435"
}
```

**Key Fields**:
- `choices[].delta.reasoning_content` — The step-by-step thinking process generated by the model.
- `choices[].delta.content` — The final answer content.
- `usage.total_tokens` — Total tokens consumed (available in the final chunk if `include_usage` is enabled).

### Image and Video Understanding (Synchronous)

```json
{
  "choices": [
    {
      "message": {
        "content": "This is a photo taken on a beach. In the photo, a person and a dog are sitting on the sand...",
        "role": "assistant"
      },
      "finish_reason": "stop",
      "index": 0,
      "logprobs": null
    }
  ],
  "object": "chat.completion",
  "usage": {
    "prompt_tokens": 1270,
    "completion_tokens": 54,
    "total_tokens": 1324
  },
  "created": 1725948561,
  "model": "qwen3.6-plus",
  "id": "chatcmpl-0fd66f46-b09e-9164-a84f-3ebbbedbac15"
}
```

**Key Fields**:
- `choices[].message.content` — The text description or answer generated by the model.
- `usage.prompt_tokens` — Number of tokens in the input (including image tokens).
- `usage.completion_tokens` — Number of tokens in the output.

### Text Extraction (OCR) (DashScope Native)

```json
{
  "output": {
    "choices": [
      {
        "finish_reason": "stop",
        "message": {
          "role": "assistant",
          "content": [
            {
              "text": "```json\n{\n    \"Invoice Number\": \"24329116804000\",\n    \"Train Number\": \"G1948\"\n}\n```"
            }
          ]
        }
      }
    ]
  },
  "usage": {
    "total_tokens": 765,
    "input_tokens": 606,
    "output_tokens": 159,
    "image_tokens": 427
  },
  "request_id": "b3ca3bbb-2bdd-9367-90bd-f3f39e480db0"
}
```

**Key Fields**:
- `output.choices[].message.content[].text` — The extracted text or structured JSON data.
- `usage.image_tokens` — Number of tokens consumed specifically by the image input.

## Error Handling

| Code | Description | Recommended Action |
|------|-------------|--------------------|
| 400 | Bad Request - Invalid request format, missing required parameters, or malformed JSON. | Check request body and parameter constraints. |
| 401 | Unauthorized - Invalid or missing API key. | Ensure `DASHSCOPE_API_KEY` is set correctly and matches the region. |
| 403 | Forbidden - Access denied due to insufficient permissions or region restrictions. | Verify account permissions and region availability. |
| 429 | Too Many Requests - Rate limit exceeded. | Implement exponential backoff and retry logic. |
| 500 | Internal Server Error - Temporary server issue. | Retry after a short delay. |
| 502 | Bad Gateway - Failed to forward request to backend service. | Retry later. |
| 503 | Service Unavailable - Service temporarily unavailable. | Retry after a delay. |
| 504 | Gateway Timeout - Request timed out. | Reduce input size or increase client timeout. |
| InvalidParameter | One or more parameters are invalid. | Check parameter constraints and values. |
| Throttling | Request throttled due to high load. | Reduce request frequency or upgrade your plan. |
| File parsing in progress | The file is still being parsed (Document Data Mining). | Retry after a delay (e.g., 2 seconds). |

### Rate Limits & Retry
- **Standard Limit**: 100 QPS per model, with a maximum of 10 concurrent requests per model.
- **Retry Strategy**: For 429 and 5xx errors, implement an exponential backoff strategy. Check for the `Retry-After` header if provided.

## Requirements

- **Python SDK**: `pip install openai>=1.0.0` (for OpenAI-compatible mode) or `pip install dashscope>=1.14.0` (for native mode).
- **Java SDK**: DashScope SDK version `>= 2.19.0` (for visual reasoning) or `>= 2.21.8` (for OCR).
- **Node.js SDK**: `npm install openai` (standard OpenAI library).
- **Environment Variable**: `export DASHSCOPE_API_KEY=your_api_key`

## FAQ

**Q: How do I enable the thinking process for visual reasoning models?**
A: For hybrid-thinking models (like qwen3.6-plus), pass `extra_body={"enable_thinking": True}` in the OpenAI SDK. For models with a `-thinking` suffix, thinking is enabled by default and cannot be disabled. The thinking process will be returned in the `reasoning_content` field of the streaming chunks.

**Q: How are image tokens calculated?**
A: Image tokens are calculated based on the image resolution after smart resizing. For OCR and GUI tasks, you can control the resolution using `min_pixels` and `max_pixels`. Enabling `vl_high_resolution_images=True` bypasses the `max_pixels` limit for finer detail, which will increase token consumption.

**Q: Can I use the standard OpenAI SDK for these multimodal models?**
A: Yes. Most models support the OpenAI-compatible interface. Simply initialize the OpenAI client with your Bailian API key and the region-specific `base_url` (e.g., `https://dashscope.aliyuncs.com/compatible-mode/v1`).

**Q: How do I handle high-resolution images in OCR or GUI tasks?**
A: Pass `vl_high_resolution_images=True` in the `extra_body` (OpenAI SDK) or as a top-level parameter (DashScope SDK). This disables the default `max_pixels` limit and applies a fixed high-resolution policy, improving accuracy for dense text or small UI elements.

**Q: What is the token conversion rule for Audio Understanding?**
A: For the Qwen3-Omni-Captioner model, the total tokens are calculated as: Audio duration (in seconds) × 12.5. If the audio is less than 1 second, it is counted as 1 second.

## Pricing & Billing

### Billing Model
All multimodal models use a **per-token** billing model. Input tokens (including text, image, video, and audio tokens) and output tokens are priced separately.

### Price Reference

| Model / Tier | Input Price | Output Price |
|--------------|-------------|--------------|
| qwen3.6-plus | CNY 0.002 / 1K tokens | CNY 0.004 / 1K tokens |
| qwen3.6-flash | CNY 0.001 / 1K tokens | CNY 0.002 / 1K tokens |
| qwen3.6-35b-a3b | CNY 0.003 / 1K tokens | CNY 0.006 / 1K tokens |
| qwen3.5-plus | CNY 0.002 / 1K tokens | CNY 0.004 / 1K tokens |
| qwen3.5-flash | CNY 0.001 / 1K tokens | CNY 0.002 / 1K tokens |
| qwen3.5-397b-a17b | CNY 0.004 / 1K tokens | CNY 0.008 / 1K tokens |
| qwen3.5-122b-a10b | CNY 0.003 / 1K tokens | CNY 0.006 / 1K tokens |
| qwen3.5-27b | CNY 0.002 / 1K tokens | CNY 0.004 / 1K tokens |
| qwen3.5-35b-a3b | CNY 0.003 / 1K tokens | CNY 0.006 / 1K tokens |
| qwen3-vl-plus | CNY 0.003 / 1K tokens | CNY 0.006 / 1K tokens |
| qwen3-vl-flash | CNY 0.002 / 1K tokens | CNY 0.004 / 1K tokens |
| qwen3-vl-235b-a22b-thinking | CNY 0.005 / 1K tokens | CNY 0.010 / 1K tokens |
| qwen3-vl-235b-a22b-instruct | CNY 0.004 / 1K tokens | CNY 0.008 / 1K tokens |
| qwen3-vl-32b-thinking | CNY 0.004 / 1K tokens | CNY 0.008 / 1K tokens |
| qwen3-vl-32b-instruct | CNY 0.003 / 1K tokens | CNY 0.006 / 1K tokens |
| qwen3-vl-30b-a3b-thinking | CNY 0.004 / 1K tokens | CNY 0.008 / 1K tokens |
| qwen3-vl-30b-a3b-instruct | CNY 0.003 / 1K tokens | CNY 0.006 / 1K tokens |
| qwen3-vl-8b-thinking | CNY 0.002 / 1K tokens | CNY 0.004 / 1K tokens |
| qwen3-vl-8b-instruct | CNY 0.001 / 1K tokens | CNY 0.002 / 1K tokens |
| qwen-vl-max | CNY 0.004 / 1K tokens | CNY 0.008 / 1K tokens |
| qwen-vl-plus-latest | CNY 0.003 / 1K tokens | CNY 0.006 / 1K tokens |
| qwen2.5-vl-3b-instruct | CNY 0.001 / 1K tokens | CNY 0.002 / 1K tokens |
| qwen2.5-vl-7b-instruct | CNY 0.002 / 1K tokens | CNY 0.004 / 1K tokens |
| qwen2.5-vl-32b-instruct | CNY 0.004 / 1K tokens | CNY 0.008 / 1K tokens |
| qwen2.5-vl-72b-instruct | CNY 0.006 / 1K tokens | CNY 0.012 / 1K tokens |
| qvq-max | CNY 0.004 / 1K tokens | CNY 0.008 / 1K tokens |
| qvq-plus | CNY 0.003 / 1K tokens | CNY 0.006 / 1K tokens |
| qwen3-omni-30b-a3b-captioner | CNY 0.002 / 1K tokens | CNY 0.002 / 1K tokens |
| qwen-vl-ocr-latest | CNY 0.002 / 1K tokens | CNY 0.004 / 1K tokens |
| qwen-vl-ocr-2025-11-20 | CNY 0.0025 / 1K tokens | CNY 0.005 / 1K tokens |
| qwen-doc-turbo | CNY 0.0006 / 1K tokens | CNY 0.001 / 1K tokens |
| gui-plus | CNY 1.5 / 1M tokens | CNY 4.5 / 1M tokens |
| gui-plus-2026-02-26 | CNY 0.002 / 1K tokens | CNY 0.004 / 1K tokens |

*(Note: Dated snapshots of models like qwen-vl-max-2025-08-13 share the same pricing as their base model family.)*

### Free Tier
- **Visual Reasoning & Image/Video**: 1 million tokens free per month, valid for 90 days, only in China (Beijing) or Singapore regions.
- **Audio Understanding**: 1 million tokens free per month.
- **OCR**: 1 million tokens free per month (Beijing or Singapore regions only).
- **GUI Automation**: 1 million tokens for input and output each (valid for 90 days after activating Model Studio).
- **Document Data Mining**: No free quota.

### Usage Limits
- **General**: 100 QPS per model. Total input tokens must not exceed the model's maximum input limit.
- **Video**: Maximum video duration is 2 hours for qwen3.6 series, 1 hour for qwen3-vl series, and 10 minutes for other models.
- **Audio**: Max 40 minutes of audio per request.
- **OCR**: Max 8K tokens per request, max image size 10MB.
- **Document Data Mining**: Max 253,952 input tokens per request; max 32,768 output tokens; max 9,000 tokens per message.

### Billing Notes
- **Thinking Process**: The reasoning chain (`reasoning_content`) is billed as output tokens. If there is no thinking output, the non-thinking mode price applies.
- **High Resolution**: Enabling high-resolution mode (`vl_high_resolution_images=true`) bypasses the `max_pixels` limit, which will increase image token consumption.
- **Audio Conversion**: Total tokens = Audio duration (in seconds) × 12.5. If less than 1 second, counted as 1 second.
- **Document Mining**: PPT generation costs are based on token usage: input_tokens (document + outline) + output_tokens (outline). Image rendering and PPT file generation are not billed.
- **GUI Image Tokens**: Image tokens are calculated using smart_resize logic; minimum 10px dimensions; aspect ratio ≤ 200.

## Source Documents

- `Visual reasoning_5566501.xdita`
- `Image and video understanding_5169800.xdita`
- `Audio understanding Qwen3-Omni-Captioner_6100371.xdita`
- `Text extraction Qwen-OCR_5381717.xdita`
- `Data mining Qwen-Doc_5893944.xdita`
- `User interface interaction_6231423.xdita`
- `GUI-Plus A dedicated model for interface interaction_6239358.xdita`