# bailian-asr

Part of **BAILIAN**

<!-- intent-backlink:auto -->

> 💡 **Path Selection**: This skill is one implementation path for [Transcribe, recognize, and translate speech audio](../../intent/bailian-transcribe-speech/SKILL.md). If you're unsure which path to take, check the routing skill first.

# Alibaba Cloud Model Studio Speech and Audio Processing

## Capabilities Overview

| Sub-capability | Models | API Pattern | Description |
|----------------|--------|-------------|-------------|
| Real-Time Speech Recognition | qwen3-asr-flash-realtime, fun-asr-realtime, paraformer-realtime-v2, + 8 more | WebSocket | Transcribe live audio streams to text with low latency. |
| Audio File Transcription | qwen3-asr-flash, fun-asr, paraformer-v2, sensevoice-v1, + 5 more | Async Task / OpenAI Compatible | Transcribe pre-recorded audio files asynchronously. |
| Speech Translation | gummy-realtime-v1, gummy-chat-v1 | WebSocket / Streaming | Perform real-time speech recognition and translation simultaneously. |
| Standard Text-to-Speech | cosyvoice-v3-flash, qwen3-tts-flash, MiniMax/speech-2.8-hd, + 5 more | Synchronous | Generate complete audio files from text. |
| Real-Time Speech Synthesis | cosyvoice-v3-flash, qwen3-tts-flash-realtime, sambert-zhichu-v1, + 5 more | WebSocket | Stream audio generation in real-time for interactive applications. |
| Voice Cloning and Design | qwen-voice-enrollment, voice-enrollment, MiniMax/speech-2.8-turbo | Synchronous | Create custom voices from audio samples or text descriptions. |
| Custom Vocabulary | speech-biasing | Synchronous | Manage hotwords to improve recognition accuracy for domain-specific terms. |
| Music Generation | fun-music-v1, wan2.7-music | Synchronous | Generate music tracks from text prompts or lyrics. |

## Model Selection Guide

### Real-Time Speech Recognition
| Model ID | API Pattern |
|----------|-------------|
| qwen3-asr-flash-realtime | WebSocket |
| fun-asr-realtime | WebSocket |
| paraformer-realtime-v2 | WebSocket |
| paraformer-realtime-8k-v2 | WebSocket |

### Audio File Transcription
| Model ID | API Pattern |
|----------|-------------|
| qwen3-asr-flash | OpenAI Compatible / Synchronous |
| qwen3-asr-flash-filetrans | Async Task |
| fun-asr | Async Task |
| paraformer-v2 | Async Task |
| sensevoice-v1 | Async Task |

### Text-to-Speech (Standard & Real-Time)
| Model ID | API Pattern |
|----------|-------------|
| cosyvoice-v3-flash | Synchronous / WebSocket |
| qwen3-tts-flash | Synchronous |
| qwen3-tts-flash-realtime | WebSocket |
| MiniMax/speech-2.8-hd | Synchronous |
| sambert-zhichu-v1 | WebSocket |

### Voice Cloning & Music Generation
| Model ID | API Pattern |
|----------|-------------|
| qwen-voice-enrollment | Synchronous |
| voice-enrollment | Synchronous |
| fun-music-v1 | Synchronous |

## API Calling Modes

### Authentication
The primary and recommended authentication method is the **Bearer Token**.
- **Header Format**: `Authorization: Bearer $DASHSCOPE_API_KEY`
- **Environment Variable**: Store your API key in `DASHSCOPE_API_KEY`.
- Note: API keys are region-specific. A key generated for the China (Beijing) region will not work on the International (Singapore) endpoint, and vice versa.

### Service Endpoints
Endpoints are divided by region and protocol.

**China (Beijing) Region:**
- REST API: `https://dashscope.aliyuncs.com/api/v1/...`
- WebSocket API: `wss://dashscope.aliyuncs.com/api-ws/v1/...`

**International (Singapore) Region:**
- REST API: `https://dashscope-intl.aliyuncs.com/api/v1/...`
- WebSocket API: `wss://dashscope-intl.aliyuncs.com/api-ws/v1/...`

### Async Task Flow (File Transcription)
Used for long audio files (Fun-ASR, Paraformer, SenseVoice).
1. **Submit Task**: Send a POST request to the transcription endpoint with the header `X-DashScope-Async: enable`.
2. **Get Task ID**: Extract the `task_id` from the response.
3. **Poll Status**: Send GET requests to `/api/v1/tasks/{task_id}` until `task_status` is `SUCCEEDED` or `FAILED`.
4. **Fetch Result**: Download the final JSON transcript from the `transcription_url` provided in the success response.

### WebSocket Flow (Real-Time ASR / TTS)
Used for streaming audio (Qwen-ASR, CosyVoice, Paraformer-Realtime).
1. **Connect**: Open a WebSocket connection to the regional endpoint, passing the Bearer token in the headers.
2. **Initialize**: Send a `run-task` (DashScope native) or `session.update` (OpenAI-compatible) event to configure model, voice, and audio formats.
3. **Stream Data**: 
   - For ASR: Send binary audio frames or base64-encoded audio chunks.
   - For TTS: Send `continue-task` or `input_text_buffer.append` events with text chunks.
4. **Receive Events**: Listen for `result-generated` / `response.audio.delta` events containing incremental text or audio.
5. **Terminate**: Send `finish-task` or `session.finish` to gracefully close the session.

## Parameter Reference

### Real-Time Speech Recognition (WebSocket)
| Parameter | Type | Required | Default | Constraints | Description |
|-----------|------|----------|---------|-------------|-------------|
| model | string | Yes | - | Must be a valid realtime model ID | The ASR model to use. |
| format | string | Yes | - | pcm / wav / mp3 / opus / speex / aac / amr | Audio format of the input stream. |
| sample_rate | integer | Yes | - | 8000 / 16000 | Sample rate in Hz. |
| vocabulary_id | string | No | - | Valid ID from Custom Vocabulary API | ID of the hotword list to improve accuracy. |
| language_hints | array | No | - | e.g., ["zh", "en"] | Hints to improve accuracy for specific languages. |

### Audio File Transcription (Async Task)
| Parameter | Type | Required | Default | Constraints | Description |
|-----------|------|----------|---------|-------------|-------------|
| model | string | Yes | - | e.g., fun-asr, paraformer-v2 | The transcription model. |
| file_urls | array | Yes | - | Max 100 URLs, max 2GB per file | Publicly accessible URLs of audio files. |
| channel_id | array | No | [0] | Array of integers | Audio tracks to process. |
| diarization_enabled | boolean | No | false | true / false | Enable speaker diarization. |
| language_hints | array | No | - | e.g., ["zh", "en"] | Language codes to guide recognition. |

### Text-to-Speech (Synchronous)
| Parameter | Type | Required | Default | Constraints | Description |
|-----------|------|----------|---------|-------------|-------------|
| model | string | Yes | - | e.g., cosyvoice-v3-flash | The TTS model. |
| text | string | Yes | - | Max 10,000 characters | The text to synthesize. |
| voice | string | Yes | - | Valid voice ID | The voice to use (e.g., longanyang, Cherry). |
| format | string | No | mp3 | mp3 / pcm / wav / opus | Output audio format. |
| sample_rate | integer | No | 22050 | 8000 to 48000 | Output sample rate in Hz. |
| volume | integer | No | 50 | 0 to 100 | Volume level. |
| rate | float | No | 1.0 | 0.5 to 2.0 | Speech rate multiplier. |
| pitch | float | No | 1.0 | 0.5 to 2.0 | Pitch adjustment factor. |

### Voice Cloning
| Parameter | Type | Required | Default | Constraints | Description |
|-----------|------|----------|---------|-------------|-------------|
| model | string | Yes | - | qwen-voice-enrollment / voice-enrollment | The voice cloning model. |
| action | string | Yes | - | create / list / delete | The operation to perform. |
| target_model | string | Yes | - | Must match subsequent TTS model | The TTS model that will use this voice. |
| preferred_name | string | Yes | - | Max 16 chars, alphanumeric / underscore | A recognizable name for the voice. |
| audio.data | string | Yes | - | Data URI or Audio URL, max 10MB | The audio sample for cloning. |

## Code Examples

### Audio File Transcription - Python - Async Task
```python
from http import HTTPStatus
from dashscope.audio.asr import Transcription
import dashscope
import os
import json

# Set the regional endpoint. 
# For Singapore region, use: 'https://dashscope-intl.aliyuncs.com/api/v1'
dashscope.base_http_api_url = 'https://dashscope.aliyuncs.com/api/v1'
dashscope.api_key = os.getenv("DASHSCOPE_API_KEY")

task_response = Transcription.async_call(
    model='fun-asr',
    file_urls=['https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_female2.wav'],
    language_hints=['zh', 'en']
)

transcribe_response = Transcription.wait(task=task_response.output.task_id)

if transcribe_response.status_code == HTTPStatus.OK:
    print(json.dumps(transcribe_response.output, indent=4, ensure_ascii=False))
    print('transcription done!')
else:
    print('Error: ', transcribe_response.output.message)
```

### Real-Time Speech Recognition - Java - WebSocket
```java
import com.alibaba.dashscope.audio.asr.recognition.Recognition;
import com.alibaba.dashscope.audio.asr.recognition.RecognitionParam;
import com.alibaba.dashscope.utils.Constants;
import java.io.File;

public class Main {
    public static void main(String[] args) {
        // For Singapore region, use: "wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference"
        Constants.baseWebsocketApiUrl = "wss://dashscope.aliyuncs.com/api-ws/v1/inference";
        
        Recognition recognizer = new Recognition();
        RecognitionParam param = RecognitionParam.builder()
                .model("paraformer-realtime-v2")
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .format("wav")
                .sampleRate(16000)
                .parameter("language_hints", new String[]{"zh", "en"})
                .build();

        try {
            System.out.println("Recognition result: " + recognizer.call(param, new File("asr_example.wav")));
        } catch (Exception e) {
            e.printStackTrace();
        } finally {
            recognizer.getDuplexApi().close(1000, "bye");
        }
        System.exit(0);
    }
}
```

### Non-Real-Time Text-to-Speech - Python - Synchronous
```python
import os
from dashscope.audio.http_tts.http_speech_synthesizer import HttpSpeechSynthesizer

api_key = os.getenv("DASHSCOPE_API_KEY")

# Non-streaming call, returns an audio URL
result = HttpSpeechSynthesizer.call(
    model="cosyvoice-v3-flash",
    text="Today is a great day to build products that people love!",
    voice="longanhuan",
    format="wav",
    sample_rate=24000,
    stream=False,
    api_key=api_key,
)

print(f"Audio URL: {result.audio_url}")
print(f"Audio ID: {result.audio_id}")
print(f"Expiration time: {result.expires_at}")
```

### Voice Cloning - Bash - Synchronous
```bash
# Note: Use https://dashscope-intl.aliyuncs.com/api/v1/services/audio/tts/customization for the Singapore region.
curl -X POST 'https://dashscope.aliyuncs.com/api/v1/services/audio/tts/customization' \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
    "model": "qwen-voice-enrollment",
    "input": {
        "action": "create",
        "target_model": "qwen3-tts-vc-2026-01-22",
        "preferred_name": "guanyu",
        "audio": {
            "data": "data:audio/mpeg;base64,<YOUR_BASE64_AUDIO_DATA>"
        }
    }
}'
```

### Music Generation - Bash - Synchronous
```bash
curl -X POST 'https://dashscope.aliyuncs.com/api/v1/services/audio/music/generation' \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
    "model": "fun-music-v1",
    "input": {
        "prompt": "Fresh summer folk song, acoustic guitar and harmonica accompaniment, upbeat tempo",
        "gender": "female"
    }
}'
```

## Response Format

### Async Task Submission (File Transcription)
```json
{
  "output": {
    "task_status": "PENDING",
    "task_id": "c2e5d63b-96e1-4607-bb91-xxxxxxxxxxxx"
  },
  "request_id": "77ae55ae-be17-97b8-9942-xxxxxxxxxxxx"
}
```

**Key Fields**:
- `output.task_id` — The unique identifier used to poll for task completion.
- `output.task_status` — Current state: PENDING, RUNNING, SUCCEEDED, or FAILED.

### Synchronous TTS Response
```json
{
  "request_id": "ee88b03d-0457-9286-8c67-xxxxxxxxxxxx",
  "output": {
    "finish_reason": "stop",
    "audio": {
      "url": "http://dashscope-result-bj.oss-cn-beijing.aliyuncs.com/.../audio.wav",
      "id": "audio_ee88b03d-0457-9286-8c67-xxxxxxxxxxxx",
      "expires_at": 1772697707
    }
  },
  "usage": {
    "characters": 15
  }
}
```

**Key Fields**:
- `output.audio.url` — The temporary URL to download the generated audio file.
- `output.audio.expires_at` — Unix timestamp indicating when the URL expires (typically 24 hours).
- `usage.characters` — The number of characters billed for this synthesis request.

## Error Handling

| Error Code | Description | Recommended Action |
|------------|-------------|--------------------|
| 400 | Bad Request: Invalid parameters or malformed JSON. | Verify request body structure, model IDs, and parameter constraints. |
| 401 | Unauthorized: Invalid or missing API key. | Ensure `Authorization: Bearer $DASHSCOPE_API_KEY` is correctly set. |
| 403 | Forbidden: Access denied or region mismatch. | Check if your API key matches the endpoint region (China vs International). |
| 404 | Not Found: Resource or vocabulary ID does not exist. | Verify the ID exists and hasn't been deleted. |
| 429 | Too Many Requests: Rate limit exceeded. | Implement exponential backoff and reduce request frequency. |
| 500 | Internal Server Error: Unexpected server-side issue. | Retry the request after a short delay. |
| InvalidFile.DownloadFailed | The audio file URL cannot be accessed. | Ensure the file URL is publicly accessible and properly URL-encoded. |
| Audio.PreprocessError | Voice cloning audio failed preprocessing. | Provide a cleaner audio sample without background noise, matching the transcript. |

### Rate Limits & Retry
- Most REST APIs are limited to **100 QPS** per model.
- WebSocket connections typically allow up to **10 concurrent connections** per host.
- For `429` errors, implement an exponential backoff strategy. For Async Tasks, use the `wait()` SDK method or poll with a 1-2 second delay to avoid hitting query limits.

## Requirements

- **Python SDK**: `pip install dashscope>=1.25.17` (Recommended for latest TTS/ASR features).
- **Java SDK**: `com.alibaba:dashscope-sdk-java:2.22.15` or later.
- **Environment Variable**: `export DASHSCOPE_API_KEY=your_api_key`
- **Audio Requirements**: Real-time ASR typically requires 16kHz, 16-bit, mono PCM audio. Voice cloning requires 10-20 seconds of clean audio (WAV/MP3/M4A, max 10MB).

## FAQ

**Q: How do I handle long audio files for transcription?**
A: Use the Async Task API (e.g., `fun-asr` or `paraformer-v2`). It supports files up to 2GB and 12 hours in duration. Submit the task, poll the `task_id`, and download the result from the returned URL.

**Q: Why is my WebSocket connection closing unexpectedly during Real-Time ASR?**
A: WebSocket connections may time out if there is prolonged silence. Enable the `heartbeat` parameter in your request to keep the connection alive, or ensure you send a `finish-task` / `session.finish` event when the audio stream ends.

**Q: How can I improve recognition accuracy for domain-specific terms or names?**
A: Use the Custom Vocabulary API (`speech-biasing` model) to create a hotword list. Assign weights (1-5) to your terms, then pass the returned `vocabulary_id` in your ASR requests.

**Q: What is the difference between `server_commit` and `commit` modes in Real-Time TTS?**
A: In `server_commit` mode, the server automatically segments the text buffer and triggers synthesis based on punctuation. In `commit` mode, you must manually send an `input_text_buffer.commit` event to trigger synthesis, giving you precise control over latency and chunking.

**Q: Are generated audio URLs permanent?**
A: No. Audio URLs returned by synchronous TTS and Async ASR tasks are temporary and typically expire after 24 hours. You must download and store the files in your own persistent storage (like OSS) before they expire.

## Pricing & Billing

### Billing Model
Pricing varies by capability:
- **ASR (Real-time & File)**: Billed per second of audio processed or per 1,000 tokens.
- **TTS**: Billed per 1,000 characters or tokens processed.
- **Voice Cloning**: Billed per creation attempt (with free tiers available).
- **Music Generation**: Billed per minute of generated audio or per request.

### Price Reference

| Model / Capability | Input Price | Output Price | Other Fees |
|--------------------|-------------|--------------|------------|
| fun-asr (File ASR) | ~0.00022 CNY / sec | - | - |
| paraformer-v2 (File ASR) | ~0.00008 CNY / sec | - | - |
| cosyvoice-v3-flash (TTS) | 0.002 CNY / 1K tokens | 0.002 CNY / 1K tokens | - |
| qwen3-tts-flash (TTS) | 0.0002 CNY / 1K tokens | 0.0004 CNY / 1K tokens | - |
| MiniMax/speech-2.8-hd (TTS) | 3.5 CNY / 10K chars | 3.5 CNY / 10K chars | 9.9 CNY voice unlock fee |
| Voice Cloning (Qwen/Cosy) | - | - | 0.01 CNY / attempt |
| fun-music-v1 (Music) | 0.002 CNY / min | 0.002 CNY / min | - |

### Free Tier
- **ASR / TTS**: Many models include 1,000,000 free tokens or 100 free minutes per month upon activation.
- **Voice Cloning**: 1,000 free voice creation attempts within 90 days of activating Model Studio.

### Billing Notes
- Async tasks are billed upon completion. If a task fails due to user error (e.g., invalid file format), minimum charges may still apply.
- For TTS, punctuation and SSML tags are generally excluded from character counts, but this varies slightly by model.
- Audio files generated by TTS are stored temporarily; storage fees do not apply, but download bandwidth may be subject to standard network rates.

## Source Documents

- Client events_5722209.xdita
- Java SDK_5907036.xdita
- Python SDK_5905421.xdita
- Real-time multimodal_5722208.xdita
- Server events_5722210.xdita
- Real-time long-form speech recognition Gummy_5465112.xdita
- Real-time short speech recognition Gummy_5470425.xdita
- WebSocket API_5523267.xdita
- iOS SDK_6232482.xdita
- Android SDK_6223775.xdita
- Client events_6564851.xdita
- Java SDK_6124089.xdita
- Python SDK_6124088.xdita
- Real-time speech recognition Fun-ASR_6124087.xdita
- Server-side events_6564852.xdita
- Java SDK_5497417.xdita
- Paraformer real-time speech recognition client events_6564849.xdita
- Python SDK_5497421.xdita
- Real-time speech recognition Paraformer_4759742.xdita
- Client events_6184310.xdita
- Interaction flow_6185930.xdita
- Java SDK_6203300.xdita
- Qwen-ASR-Realtime Python SDK - API reference_6203288.xdita
- Real-time speech recognition Qwen-ASR-Realtime_6184304.xdita
- Server events_6184673.xdita
- WebSocket API_6185930.xdita
- Audio file recognition Qwen-ASR_6183219.xdita
- Java SDK_5509678.xdita
- Python SDK_5509720.xdita
- RESTful API_5509740.xdita
- Java SDK_6006103.xdita
- Python SDK_6006104.xdita
- RESTful API_6006106.xdita
- Recording file recognition Fun-ASR_6124084.xdita
- Recording file recognition Paraformer_4759743.xdita
- iOS SDK_6190271.xdita
- Speech-to-text_6488501.xdita
- Audio file recognition - Fun-ASRParaformerSenseVoice_5603927.xdita
- Audio file recognition - Qwen_6019701.xdita
- Non-real-time speech recognition_6019701.xdita
- Custom hotwords_6560848.xdita
- Non-real-time Speech Synthesis MiniMax_6410314.xdita
- Synchronous speech synthesis_6410695.xdita
- HTTP API reference_6528508.xdita
- Java SDK Reference_6528341.xdita
- Non-real-time speech synthesis CosyVoice_6473969.xdita
- Python SDK reference_6528342.xdita
- Speech synthesis CosyVoice_6473969.xdita
- Speech synthesis Qwen-TTS_5615937.xdita
- Voice list_6235383.xdita
- Android SDK_6169714.xdita
- Java SDK_5505671.xdita
- Python SDK_5505804.xdita
- Speech Synthesis MiniMax_6410314.xdita
- Android SDK_6164947.xdita
- Client events_6551229.xdita
- Java SDK_5490310.xdita
- Python SDK_5490851.xdita
- Real-time speech synthesis CosyVoice_4995365.xdita
- Server-side events_6551246.xdita
- WebSocket API reference_5261848.xdita
- WebSocket API_5261848.xdita
- Client events_5877973.xdita
- Interaction flow for real-time speech synthesis_5934372.xdita
- Java SDK_5907082.xdita
- Python SDK_5907037.xdita
- Real-time speech synthesis Qwen-TTS-Realtime_5877972.xdita
- Server events_5877974.xdita
- WebSocket API_5934372.xdita
- Client events_6551994.xdita
- Real-time speech synthesis Sambert_4759665.xdita
- Sambert Speech Synthesis WebSocket API_5300092.xdita
- Server-sent events_6551995.xdita
- WebSocket API_5300092.xdita
- Qwen voice cloning_5991773.xdita
- Qwen voice design_6285161.xdita
- Voice cloning_6411076.xdita
- Voice management_6468432.xdita
- HTTP API reference_6495058.xdita
- Java SDK reference_6495059.xdita
- Python SDK reference_6495061.xdita
- Voice Cloning API reference_6495057.xdita
- Voice Design_6551975.xdita
- Real-time speech synthesis - Qwen_5876806.xdita
- Real-time speech synthesis_5876806.xdita
- Speech synthesis - Qwen_5585737.xdita
- Non-real-time speech synthesis_5585737.xdita
- Voice list_6475333.xdita
- SSML_6551098.xdita
- Voice cloning_6527639.xdita
- Voice Design_6551073.xdita
- Voice cloning_6551072.xdita
- Music generation API referenceFun-Music_6533469.xdita
- Music generationFun-music_6533469.xdita
- Music generation_6547626.xdita
- Music generation_6533468.xdita