DaaS / Products / Document AI RAG Pipeline

Document AI RAG Pipeline

A developer extracts text and structured data from unstructured source documents (PDFs, scanned images) using Bailian's document extraction, ingests the processed content into Elasticsearch as a searchable index, and then builds a RAG knowledge base and retrieval pipeline on top to power an AI-driven document Q&A application.

Products involved

Scenario

Use this pipeline when you need to transform unstructured enterprise documents (PDFs, scanned invoices, technical manuals) into a searchable, AI-ready knowledge base. It bridges high-accuracy visual/text extraction with scalable vector search, enabling low-latency, context-grounded Q&A applications without manual data preprocessing.

Integration steps

  1. Extract document content: Call Bailian’s extraction API with qvq-max for complex layout parsing:
  2. ``bash curl -X POST https://dashscope.aliyuncs.com/api/v1/services/aigc/document-process/extract \ -H "Authorization: Bearer $BAILIAN_API_KEY" \ -d '{"model": "qvq-max", "input": {"url": "https://docs.example.com/manual.pdf"}, "parameters": {"enable_thinking": true, "output_format": "json"}}' ``

  3. Parse & chunk output: Extract text_blocks and tables from the JSON response. Split text into 512-token chunks with 10% overlap using tiktoken.
  4. Generate embeddings: Vectorize each chunk via Bailian’s text-embedding-v3:
  5. ``python import dashscope dashscope.api_key = os.getenv("BAILIAN_API_KEY") resp = dashscope.TextEmbedding.call(model="text-embedding-v3", input=chunks) vectors = [item["embedding"] for item in resp["output"]["embeddings"]] ``

  6. Index in Elasticsearch: Push chunks + vectors using _bulk for high-throughput writes (up to 100 QPS):
  7. ``bash curl -X POST "https://$ES_HOST:9200/doc-rag-index/_bulk?refresh=wait_for" \ -H "Content-Type: application/json" \ -d @bulk_payload.json ` Map embedding as dense_vector with dims: 1024 and index: true`.

  8. Register ES as a RAG data source: Attach the index to Bailian’s knowledge base:
  9. ``bash curl -X POST https://dashscope.aliyuncs.com/api/v1/knowledge-bases \ -H "Authorization: Bearer $BAILIAN_API_KEY" \ -d '{"name": "doc-rag-kb", "data_source": {"type": "elasticsearch", "endpoint": "$ES_HOST", "index": "doc-rag-index", "vector_field": "embedding"}}' ``

  10. Configure retrieval & reranking: Set top_k: 5, enable hybrid search (BM25 + KNN), and attach gte-rerank to filter context before LLM generation.

Architecture

Data flows unidirectionally: Bailian’s qvq-max extracts and structures raw documents → chunks are embedded via text-embedding-v3 → payloads are batched into Elasticsearch for scalable hybrid search → Bailian’s RAG engine queries ES, reranks results, and injects context into the generation prompt. ES handles storage and retrieval; Bailian manages AI reasoning, embedding, and orchestration.

Prerequisites

Common pitfalls

Typical questions