DaaS / Products / Document-Aware App with AI Recommendations

Document-Aware App with AI Recommendations

A developer builds a full-stack application using Supabase/RDS as the primary datastore for structured records and document metadata, uses Bailian OCR to extract content from uploaded PDFs stored in OSS, indexes everything into Elasticsearch for unified full-text search, layers OpenSearch for semantic/RAG retrieval, and adds AIRec to deliver personalized document and content recommendations based on user behavior — creating a complete document intelligence platform from ingestion to personalization.

Products involved

Scenario

Use this combination when building a document intelligence platform requiring end-to-end processing: ingesting raw PDFs, extracting structured text, enabling hybrid full-text and semantic search, and delivering behavior-driven personalized recommendations. Ideal for knowledge bases, legal tech, or enterprise portals where contextual discovery drives user engagement.

Integration steps

Upload to OSS: Push raw files via CLI: ossutil cp ./docs/ oss://my-doc-bucket/raw/ --recursive.
Extract with Bailian: Call POST https://dashscope.aliyuncs.com/api/v1/services/aigc/document-parse/async with {"model": "doc-parser-v1", "input": {"file_url": "oss://my-doc-bucket/raw/report.pdf"}}. Poll /api/v1/tasks/{task_id} until status: "SUCCEEDED".
Store in Supabase: Insert metadata via REST: POST https://<proj>.supabase.co/rest/v1/documents with headers apikey: <key> and body {"title": "...", "ocr_text": "..."}.
Index to Elasticsearch: Sync using es-ingest-documents pipeline. Logstash config: input { http_poller { urls => { supabase => "https://<proj>.supabase.co/rest/v1/documents" } } } output { elasticsearch { hosts => ["https://<es-endpoint>:9200"] index => "docs-index" } }.
Enable Semantic Search in OpenSearch: Generate embeddings and index: POST https://<os-endpoint>/docs-index/_doc/{id} with {"vector_field": <1024-dim-array>, "text": "..."}. Set mapping: "knn": true, "knn.algo_param.ef_search": 100.
Configure AIRec: Register items: POST https://airec.aliyuncs.com/v2/openapi/instances/{inst}/items with {"itemId": "...", "itemType": "document"}. Log interactions: POST .../behaviors with {"behaviorType": "click", "itemId": "..."}.
Query & Recommend: Fetch results: GET https://airec.aliyuncs.com/v2/openapi/instances/{inst}/recommendations?userId={uid}&size=10. Merge with ES/OpenSearch hybrid queries for final UI rendering.

Architecture

Raw files reside in OSS. Bailian OCR asynchronously extracts text/metadata, which Supabase persists as structured records. A sync pipeline pushes this data to Elasticsearch for BM25 full-text indexing. OpenSearch consumes the same corpus to generate and store dense vector embeddings for RAG queries. AIRec operates in parallel, ingesting item metadata and real-time user behavior logs to train a ranking model. The app orchestrates queries across ES (keyword), OpenSearch (semantic), and AIRec (personalized) to deliver unified results.

Prerequisites

Provisioned OSS bucket, Bailian API key, Supabase project, Elasticsearch & OpenSearch clusters, and AIRec instance.
VPC peering between Supabase, ES, and OpenSearch; IAM roles granting OSS read access to Bailian.
OpenSearch k-NN mapping explicitly matching Bailian’s embedding dimension (e.g., 1024).
AIRec schema pre-configured with itemId, userId, and behaviorType fields.

Common pitfalls

Vector dimension mismatch: Bailian defaults to 1024 dims; OpenSearch k-NN must explicitly match or ingestion fails with IllegalArgumentException.
Supabase-to-ES sync lag: Direct webhooks cause race conditions under load. Use Supabase Realtime + a message queue to batch ES writes.
AIRec cold start: Recommendations return empty until ~500 interactions are logged. Seed with popularity-based fallback queries initially.
OCR text truncation: Bailian caps output at 10k tokens/page. Chunk documents before indexing to preserve full context in ES.

Typical questions

Supabase app with OCR search and personalized recommendations
document management platform with AI recommendations
upload PDFs extract search and recommend documents
full-stack document intelligence app with personalization
Bailian OCR Supabase ES OpenSearch AIRec pipeline
文档管理应用加智能推荐
Supabase文档OCR加语义搜索加个性化推荐
从文档上传到智能推荐完整应用

FAQ

Q: How does the Document-Aware App with AI Recommendations architecture handle document ingestion, search, and personalization? A: The architecture integrates Supabase or RDS for metadata, Bailian OCR for PDF extraction, Elasticsearch and OpenSearch for full-text and semantic search, and AIRec for personalized recommendations. Uploaded PDFs are stored in OSS and processed by Bailian OCR before indexing into Elasticsearch for unified search. OpenSearch then layers semantic retrieval on top, while AIRec delivers personalized suggestions based on user behavior to complete the pipeline.