DaaS / Products / Document AI RAG Pipeline

Document AI RAG Pipeline

A developer extracts text and structured data from unstructured source documents (PDFs, scanned images) using Bailian's document extraction, ingests the processed content into Elasticsearch as a searchable index, and then builds a RAG knowledge base and retrieval pipeline on top to power an AI-driven document Q&A application.

Products involved

Scenario

Use this pipeline when you need to transform unstructured enterprise documents (PDFs, scanned invoices, technical manuals) into a searchable, AI-ready knowledge base. It bridges high-accuracy visual/text extraction with scalable vector search, enabling low-latency, context-grounded Q&A applications without manual data preprocessing.

Integration steps

Extract document content: Call Bailian’s extraction API with qvq-max for complex layout parsing:

``bash curl -X POST https://dashscope.aliyuncs.com/api/v1/services/aigc/document-process/extract \ -H "Authorization: Bearer $BAILIAN_API_KEY" \ -d '{"model": "qvq-max", "input": {"url": "https://docs.example.com/manual.pdf"}, "parameters": {"enable_thinking": true, "output_format": "json"}}' ``

Parse & chunk output: Extract text_blocks and tables from the JSON response. Split text into 512-token chunks with 10% overlap using tiktoken.
Generate embeddings: Vectorize each chunk via Bailian’s text-embedding-v3:

``python import dashscope dashscope.api_key = os.getenv("BAILIAN_API_KEY") resp = dashscope.TextEmbedding.call(model="text-embedding-v3", input=chunks) vectors = [item["embedding"] for item in resp["output"]["embeddings"]] ``

Index in Elasticsearch: Push chunks + vectors using _bulk for high-throughput writes (up to 100 QPS):

``bash curl -X POST "https://$ES_HOST:9200/doc-rag-index/_bulk?refresh=wait_for" \ -H "Content-Type: application/json" \ -d @bulk_payload.json ` Map embedding as dense_vector with dims: 1024 and index: true`.

Register ES as a RAG data source: Attach the index to Bailian’s knowledge base:

``bash curl -X POST https://dashscope.aliyuncs.com/api/v1/knowledge-bases \ -H "Authorization: Bearer $BAILIAN_API_KEY" \ -d '{"name": "doc-rag-kb", "data_source": {"type": "elasticsearch", "endpoint": "$ES_HOST", "index": "doc-rag-index", "vector_field": "embedding"}}' ``

Configure retrieval & reranking: Set top_k: 5, enable hybrid search (BM25 + KNN), and attach gte-rerank to filter context before LLM generation.

Architecture

Data flows unidirectionally: Bailian’s qvq-max extracts and structures raw documents → chunks are embedded via text-embedding-v3 → payloads are batched into Elasticsearch for scalable hybrid search → Bailian’s RAG engine queries ES, reranks results, and injects context into the generation prompt. ES handles storage and retrieval; Bailian manages AI reasoning, embedding, and orchestration.

Prerequisites

Active Alibaba Cloud Bailian workspace with API key (BAILIAN_API_KEY)
Running Elasticsearch/OpenSearch cluster (v8.0+) with dense_vector and knn support
Python 3.9+ with dashscope and elasticsearch SDKs installed
Pre-configured ES index mapping matching embedding dimensions (1024)

Common pitfalls

Token limit overflow: Skipping chunking causes text-embedding-v3 to truncate; enforce strict 512-token limits with overlap.
ES visibility lag: Omitting ?refresh=wait_for on bulk writes delays retrieval; always use explicit refresh or schedule background sync.
Vector dimension mismatch: ES dense_vector dims must exactly match the model output (1024), or KNN queries will throw mapping errors.
Hybrid search imbalance: Overweighting BM25 vs. vector similarity degrades semantic recall; tune alpha (0.5–0.7) in Bailian’s retrieval config.

Typical questions

build RAG from PDF documents
extract documents and build knowledge base
从PDF提取数据构建RAG
document extraction to search pipeline
ingest scanned documents for AI Q&A
文档提取后导入ElasticSearch做检索
OCR extract then build RAG
unstructured documents to retrieval system