DaaS / Products / Document-Aware App with RAG Recommendations

Document-Aware App with RAG Recommendations

A developer builds a full-stack application using Supabase as the primary CRUD datastore, extracts text from uploaded documents (PDFs, scanned images) via Bailian's document understanding, indexes content into Elasticsearch for unified full-text search, builds a RAG knowledge base for conversational retrieval, and layers AIRec-powered semantic recommendations on top — delivering an intelligent document management platform with personalized search, chatbot Q&A, and content suggestions.

Products involved

Scenario

Use this integration when building an intelligent document management platform that requires automated OCR extraction, unified full-text/vector search, conversational RAG Q&A, and personalized content discovery. It bridges raw unstructured files with a Supabase-backed CRUD layer, Elasticsearch indexing, and AIRec-driven semantic recommendations.

Integration steps

  1. Upload & Extract: Push PDFs/scans to OSS. Trigger Bailian via POST https://dashscope.aliyuncs.com/api/v1/services/document-understanding/async/process with {"oss_uri": "oss://<bucket>/<file>", "task_type": "ocr_layout_analysis"}.
  2. Store Metadata in Supabase: On status: SUCCEEDED, insert structured output via supabase.from('documents').insert({ id: uuid, title, extracted_text, oss_url }).
  3. Index in Elasticsearch: Sync text to ES using _bulk. Define mapping: {"mappings": {"properties": {"content": {"type": "text", "analyzer": "ik_max_word"}, "embedding": {"type": "dense_vector", "dims": 1024}}}}.
  4. Generate Embeddings: Chunk extracted_text and call Bailian POST /v1/services/embedding/text-embedding/v2 with {"model": "text-embedding-v2", "input": {"texts": ["<chunk>"]}}. Update ES embedding field via POST /_update/<index>/_doc/<id>.
  5. Build RAG Pipeline: Query ES with hybrid retrieval: {"query": {"knn": {"field": "embedding", "query_vector": <vec>, "k": 5, "filter": {"match": {"content": "<query>"}}}}}. Inject top hits into LLM context.
  6. Enable AIRec Recommendations: Log interactions via POST https://airec.cn-shanghai.aliyuncs.com/v2/openapi/instances/<id>/actions with {"action_type": "click", "item_id": "<doc_id>", "user_id": "<uid>"}. Train a semantic rule targeting document_id to surface related files.

Architecture

Raw files land in OSS and trigger Bailian for async OCR/layout extraction. Extracted text and metadata persist in Supabase as the system of record. A background worker chunks text, generates dense vectors via Bailian, and pushes both to Elasticsearch. The RAG layer queries ES using hybrid BM25+KNN retrieval, while AIRec ingests real-time interaction logs to train a personalized recommendation model that surfaces contextually relevant documents alongside search results.

Prerequisites

Common pitfalls

Typical questions

FAQ

Q: How can I build a full-stack document application with PDF extraction, search, RAG chatbot, and AI recommendations? A: You can build this application by using Supabase as the primary datastore, Bailian for extracting text from PDFs, Elasticsearch for indexing and search, and AIRec for semantic recommendations. This combination delivers an intelligent platform featuring unified full-text search, a RAG-based chatbot for conversational Q&A, and personalized content suggestions.