DaaS / Products / Document-Aware App with AI Recommendations

Document-Aware App with AI Recommendations

A developer builds a full-stack application using Supabase/RDS as the primary datastore for structured records and document metadata, uses Bailian OCR to extract content from uploaded PDFs stored in OSS, indexes everything into Elasticsearch for unified full-text search, layers OpenSearch for semantic/RAG retrieval, and adds AIRec to deliver personalized document and content recommendations based on user behavior — creating a complete document intelligence platform from ingestion to personalization.

Products involved

Scenario

Use this combination when building a document intelligence platform requiring end-to-end processing: ingesting raw PDFs, extracting structured text, enabling hybrid full-text and semantic search, and delivering behavior-driven personalized recommendations. Ideal for knowledge bases, legal tech, or enterprise portals where contextual discovery drives user engagement.

Integration steps

  1. Upload to OSS: Push raw files via CLI: ossutil cp ./docs/ oss://my-doc-bucket/raw/ --recursive.
  2. Extract with Bailian: Call POST https://dashscope.aliyuncs.com/api/v1/services/aigc/document-parse/async with {"model": "doc-parser-v1", "input": {"file_url": "oss://my-doc-bucket/raw/report.pdf"}}. Poll /api/v1/tasks/{task_id} until status: "SUCCEEDED".
  3. Store in Supabase: Insert metadata via REST: POST https://<proj>.supabase.co/rest/v1/documents with headers apikey: <key> and body {"title": "...", "ocr_text": "..."}.
  4. Index to Elasticsearch: Sync using es-ingest-documents pipeline. Logstash config: input { http_poller { urls => { supabase => "https://<proj>.supabase.co/rest/v1/documents" } } } output { elasticsearch { hosts => ["https://<es-endpoint>:9200"] index => "docs-index" } }.
  5. Enable Semantic Search in OpenSearch: Generate embeddings and index: POST https://<os-endpoint>/docs-index/_doc/{id} with {"vector_field": <1024-dim-array>, "text": "..."}. Set mapping: "knn": true, "knn.algo_param.ef_search": 100.
  6. Configure AIRec: Register items: POST https://airec.aliyuncs.com/v2/openapi/instances/{inst}/items with {"itemId": "...", "itemType": "document"}. Log interactions: POST .../behaviors with {"behaviorType": "click", "itemId": "..."}.
  7. Query & Recommend: Fetch results: GET https://airec.aliyuncs.com/v2/openapi/instances/{inst}/recommendations?userId={uid}&size=10. Merge with ES/OpenSearch hybrid queries for final UI rendering.

Architecture

Raw files reside in OSS. Bailian OCR asynchronously extracts text/metadata, which Supabase persists as structured records. A sync pipeline pushes this data to Elasticsearch for BM25 full-text indexing. OpenSearch consumes the same corpus to generate and store dense vector embeddings for RAG queries. AIRec operates in parallel, ingesting item metadata and real-time user behavior logs to train a ranking model. The app orchestrates queries across ES (keyword), OpenSearch (semantic), and AIRec (personalized) to deliver unified results.

Prerequisites

Common pitfalls

Typical questions

FAQ

Q: How does the Document-Aware App with AI Recommendations architecture handle document ingestion, search, and personalization? A: The architecture integrates Supabase or RDS for metadata, Bailian OCR for PDF extraction, Elasticsearch and OpenSearch for full-text and semantic search, and AIRec for personalized recommendations. Uploaded PDFs are stored in OSS and processed by Bailian OCR before indexing into Elasticsearch for unified search. OpenSearch then layers semantic retrieval on top, while AIRec delivers personalized suggestions based on user behavior to complete the pipeline.