DaaS / Products / Document Extraction to Searchable Index Pipeline

Document Extraction to Searchable Index Pipeline

A developer extracts text and structured data from PDFs, scanned images, and other documents using Bailian's document understanding capabilities, then ingests the processed content into Elasticsearch to build a searchable full-text index over previously unsearchable document archives.

Products involved

Scenario

When developers must unlock full-text search across legacy PDFs, scanned invoices, or image-heavy archives, they need a pipeline that converts unstructured pixels into query-ready text. This workflow combines Bailian’s vision-language extraction with Elasticsearch’s high-throughput ingestion and relevance tuning to transform static document dumps into a production-grade, low-latency search index.

Integration steps

  1. Configure Bailian Extraction: Set DASHSCOPE_API_KEY and target the qvq-max model. Call POST https://dashscope.aliyuncs.com/compatible-mode/v1/chat/completions with {"model": "qvq-max", "enable_thinking": true} to preserve complex layouts during OCR.
  2. Extract & Normalize Payloads: Send base64-encoded documents. Parse the JSON response to isolate {"doc_id": "inv_001", "content": "...", "tables": [...]}. Strip headers/footers using regex before indexing.
  3. Define ES Index Mapping: Run PUT /document-archive with mappings.properties.content.type: "text" and mappings.properties.doc_id.type: "keyword". Set "settings.index.refresh_interval": "30s" to optimize write throughput.
  4. Batch Ingest via Bulk API: Use POST /_bulk with {"index": {"_index": "document-archive", "_id": "inv_001"}} followed by the document JSON. Keep payloads under 5MB per request to sustain up to 100 QPS.
  5. Stage & Commit Changes: Ingest with ?refresh=false to buffer writes. Once the batch completes, trigger POST /document-archive/_refresh to atomically commit staged documents to the searchable index.
  6. Apply Relevance Tuning: Deploy neural reranking via POST /_search with "rank": {"type": "rrf", "window_size": 100} or load domain terms using PUT /_ingest/pipeline/synonym-pipeline to expand query matching.

Architecture

Raw files enter a client-side processor or message queue. Bailian acts as the extraction layer, running multimodal inference to convert PDFs/images into structured JSON. The normalized output is pushed to Elasticsearch via the Bulk API. Elasticsearch handles inverted index construction, storage, and query routing, while optional ingest pipelines and rerankers intercept queries to refine result ordering before returning to the client.

Prerequisites

Common pitfalls

Typical questions