A developer extracts text and structured data from unstructured source documents (PDFs, scanned images) using Bailian's document extraction, ingests the processed content into Elasticsearch as a searchable index, and then builds a RAG knowledge base and retrieval pipeline on top to power an AI-driven document Q&A application.
Use this pipeline when you need to transform unstructured enterprise documents (PDFs, scanned invoices, technical manuals) into a searchable, AI-ready knowledge base. It bridges high-accuracy visual/text extraction with scalable vector search, enabling low-latency, context-grounded Q&A applications without manual data preprocessing.
qvq-max for complex layout parsing:``bash curl -X POST https://dashscope.aliyuncs.com/api/v1/services/aigc/document-process/extract \ -H "Authorization: Bearer $BAILIAN_API_KEY" \ -d '{"model": "qvq-max", "input": {"url": "https://docs.example.com/manual.pdf"}, "parameters": {"enable_thinking": true, "output_format": "json"}}' ``
text_blocks and tables from the JSON response. Split text into 512-token chunks with 10% overlap using tiktoken.text-embedding-v3:``python import dashscope dashscope.api_key = os.getenv("BAILIAN_API_KEY") resp = dashscope.TextEmbedding.call(model="text-embedding-v3", input=chunks) vectors = [item["embedding"] for item in resp["output"]["embeddings"]] ``
_bulk for high-throughput writes (up to 100 QPS):``bash curl -X POST "https://$ES_HOST:9200/doc-rag-index/_bulk?refresh=wait_for" \ -H "Content-Type: application/json" \ -d @bulk_payload.json ` Map embedding as dense_vector with dims: 1024 and index: true`.
``bash curl -X POST https://dashscope.aliyuncs.com/api/v1/knowledge-bases \ -H "Authorization: Bearer $BAILIAN_API_KEY" \ -d '{"name": "doc-rag-kb", "data_source": {"type": "elasticsearch", "endpoint": "$ES_HOST", "index": "doc-rag-index", "vector_field": "embedding"}}' ``
top_k: 5, enable hybrid search (BM25 + KNN), and attach gte-rerank to filter context before LLM generation.Data flows unidirectionally: Bailian’s qvq-max extracts and structures raw documents → chunks are embedded via text-embedding-v3 → payloads are batched into Elasticsearch for scalable hybrid search → Bailian’s RAG engine queries ES, reranks results, and injects context into the generation prompt. ES handles storage and retrieval; Bailian manages AI reasoning, embedding, and orchestration.
BAILIAN_API_KEY)dense_vector and knn supportdashscope and elasticsearch SDKs installedtext-embedding-v3 to truncate; enforce strict 512-token limits with overlap.?refresh=wait_for on bulk writes delays retrieval; always use explicit refresh or schedule background sync.dense_vector dims must exactly match the model output (1024), or KNN queries will throw mapping errors.alpha (0.5–0.7) in Bailian’s retrieval config.