A developer extracts text and structured data from PDFs, scanned images, and other documents using Bailian's document understanding capabilities, then ingests the processed content into Elasticsearch to build a searchable full-text index over previously unsearchable document archives.
When developers must unlock full-text search across legacy PDFs, scanned invoices, or image-heavy archives, they need a pipeline that converts unstructured pixels into query-ready text. This workflow combines Bailian’s vision-language extraction with Elasticsearch’s high-throughput ingestion and relevance tuning to transform static document dumps into a production-grade, low-latency search index.
DASHSCOPE_API_KEY and target the qvq-max model. Call POST https://dashscope.aliyuncs.com/compatible-mode/v1/chat/completions with {"model": "qvq-max", "enable_thinking": true} to preserve complex layouts during OCR.{"doc_id": "inv_001", "content": "...", "tables": [...]}. Strip headers/footers using regex before indexing.PUT /document-archive with mappings.properties.content.type: "text" and mappings.properties.doc_id.type: "keyword". Set "settings.index.refresh_interval": "30s" to optimize write throughput.POST /_bulk with {"index": {"_index": "document-archive", "_id": "inv_001"}} followed by the document JSON. Keep payloads under 5MB per request to sustain up to 100 QPS.?refresh=false to buffer writes. Once the batch completes, trigger POST /document-archive/_refresh to atomically commit staged documents to the searchable index.POST /_search with "rank": {"type": "rrf", "window_size": 100} or load domain terms using PUT /_ingest/pipeline/synonym-pipeline to expand query matching.Raw files enter a client-side processor or message queue. Bailian acts as the extraction layer, running multimodal inference to convert PDFs/images into structured JSON. The normalized output is pushed to Elasticsearch via the Bulk API. Elasticsearch handles inverted index construction, storage, and query routing, while optional ingest pipelines and rerankers intercept queries to refine result ordering before returning to the client.
qvq-max model quota and API keyrequests or official elasticsearch clientqvq-max limits; chunk by page or section before sending to Bailian.refresh=true per document drops throughput below 100 QPS; always use refresh=false and batch 500–1000 docs.enable_thinking strips table boundaries and reading order; validate extraction output against complex forms before indexing.ingest_pipeline remove processors before bulk ingestion.