DaaS / Products / Document-Aware App with Unified Search

Document-Aware App with Unified Search

A developer builds an application using Supabase as the primary CRUD datastore for both structured records and metadata extracted from uploaded documents (via Bailian OCR/document understanding), then syncs all records — native structured data and OCR-extracted content — into Elasticsearch for unified full-text search across the entire dataset.

Products involved

Scenario

Use this workflow when building a document-heavy application where Supabase manages transactional CRUD operations and file metadata, while Bailian extracts structured text from uploaded PDFs or images. The extracted content and native records are synchronized into Elasticsearch to deliver a single, low-latency full-text search interface across both structured and unstructured data.

Integration steps

Initialize Supabase Schema: Create documents (id, file_url, ocr_status) and records tables. Enable RLS for secure CRUD operations.
Trigger Bailian Extraction: On file upload, call POST https://dashscope.aliyuncs.com/api/v1/services/document/document-async/parse with {"model": "doc-parser-v2", "input": {"file_url": "<supabase_storage_url>"}}.
Store Extracted Content: Poll GET /api/v1/tasks/{task_id} until status: "SUCCEEDED". Update Supabase via PATCH /rest/v1/documents/{id} with {"ocr_text": "<extracted>", "ocr_status": "ready"}.
Configure Elasticsearch Index: Create a unified index: PUT /unified-search with {"mappings": {"properties": {"source": {"type": "keyword"}, "content": {"type": "text", "analyzer": "standard"}}}}.
Sync Supabase Records: Listen for changes via Supabase Realtime. Format and push to ES using POST /_bulk: {"index":{"_index":"unified-search","_id":"<id>"}}\n{"source":"supabase","content":"<record_data>"}.
Sync OCR Results: When ocr_status="ready", batch-extract and push to ES: {"index":{"_index":"unified-search","_id":"<doc_id>_ocr"}}\n{"source":"bailian","content":"<ocr_text>"}.
Execute Unified Query: Search across both: GET /unified-search/_search with {"query":{"multi_match":{"query":"<input>","fields":["content"]}}}.

Architecture

Supabase serves as the primary transactional datastore and object storage. Bailian acts as an asynchronous processing layer, consuming storage URLs and returning parsed text/tables. A lightweight sync worker bridges both to Elasticsearch, which functions as a read-optimized, denormalized search cache. Data flows unidirectionally (Supabase/Bailian → ES), ensuring ACID compliance at the source while ES handles high-throughput query routing and relevance scoring.

Prerequisites

Supabase project with Storage enabled and service role key
Bailian (DashScope) API key with document parsing permissions
Elasticsearch v8.x+ cluster with _bulk API access and network routing
Sync runtime (Supabase Edge Functions or Node.js worker)
IAM roles configured for cross-service access (Storage → Bailian, Worker → ES)

Common pitfalls

Mapping Conflicts: Supabase JSONB arrays often break ES object mappings. Use dynamic: "strict" or an ES flatten ingest processor to normalize payloads.
Sync Race Conditions: Indexing before Bailian finishes causes partial search results. Gate ES writes behind an ocr_status = "ready" flag.
Bulk API Throttling: Sending individual payloads exhausts ES thread pools. Batch requests to 5–10MB and implement exponential backoff on 429 errors.
Expired Storage URLs: Supabase signed URLs time out. Pass permanent storage.object paths to Bailian and cache resolved URLs in ES metadata.

Typical questions

search across structured data and PDFs
Supabase app with document OCR and full-text search
extract documents store in Supabase search in ES
Supabase数据加文档OCR统一搜索
上传PDF提取内容存入Supabase并同步ES搜索
unified search for app data and scanned documents
Bailian OCR Supabase Elasticsearch pipeline
document management app with Supabase and ES

FAQ

Q: How do I implement unified full-text search across structured data and uploaded documents using Supabase and Elasticsearch? A: You can achieve unified full-text search across structured records and uploaded documents by syncing your Supabase datastore directly into Elasticsearch. The architecture uses Bailian OCR to extract content from files like PDFs, which is then stored alongside native data and automatically indexed by a dedicated pipeline for comprehensive searching.