DaaS / Products / Document-Aware App with RAG Recommendations

Document-Aware App with RAG Recommendations

A developer builds a full-stack application using Supabase as the primary CRUD datastore, extracts text from uploaded documents (PDFs, scanned images) via Bailian's document understanding, indexes content into Elasticsearch for unified full-text search, builds a RAG knowledge base for conversational retrieval, and layers AIRec-powered semantic recommendations on top — delivering an intelligent document management platform with personalized search, chatbot Q&A, and content suggestions.

Products involved

Scenario

Use this integration when building an intelligent document management platform that requires automated OCR extraction, unified full-text/vector search, conversational RAG Q&A, and personalized content discovery. It bridges raw unstructured files with a Supabase-backed CRUD layer, Elasticsearch indexing, and AIRec-driven semantic recommendations.

Integration steps

Upload & Extract: Push PDFs/scans to OSS. Trigger Bailian via POST https://dashscope.aliyuncs.com/api/v1/services/document-understanding/async/process with {"oss_uri": "oss://<bucket>/<file>", "task_type": "ocr_layout_analysis"}.
Store Metadata in Supabase: On status: SUCCEEDED, insert structured output via supabase.from('documents').insert({ id: uuid, title, extracted_text, oss_url }).
Index in Elasticsearch: Sync text to ES using _bulk. Define mapping: {"mappings": {"properties": {"content": {"type": "text", "analyzer": "ik_max_word"}, "embedding": {"type": "dense_vector", "dims": 1024}}}}.
Generate Embeddings: Chunk extracted_text and call Bailian POST /v1/services/embedding/text-embedding/v2 with {"model": "text-embedding-v2", "input": {"texts": ["<chunk>"]}}. Update ES embedding field via POST /_update/<index>/_doc/<id>.
Build RAG Pipeline: Query ES with hybrid retrieval: {"query": {"knn": {"field": "embedding", "query_vector": <vec>, "k": 5, "filter": {"match": {"content": "<query>"}}}}}. Inject top hits into LLM context.
Enable AIRec Recommendations: Log interactions via POST https://airec.cn-shanghai.aliyuncs.com/v2/openapi/instances/<id>/actions with {"action_type": "click", "item_id": "<doc_id>", "user_id": "<uid>"}. Train a semantic rule targeting document_id to surface related files.

Architecture

Raw files land in OSS and trigger Bailian for async OCR/layout extraction. Extracted text and metadata persist in Supabase as the system of record. A background worker chunks text, generates dense vectors via Bailian, and pushes both to Elasticsearch. The RAG layer queries ES using hybrid BM25+KNN retrieval, while AIRec ingests real-time interaction logs to train a personalized recommendation model that surfaces contextually relevant documents alongside search results.

Prerequisites

Alibaba Cloud account with Bailian, OSS, Elasticsearch, and AIRec instances provisioned
Supabase project with documents and user_interactions tables
DASHSCOPE_API_KEY and AIREC_ACCESS_KEY configured in environment
ES cluster with ik analyzer and dense_vector field support enabled
Node.js/Python runtime with @supabase/supabase-js, elasticsearch, and dashscope SDKs

Common pitfalls

Vector dimension mismatch: Bailian’s text-embedding-v2 outputs 1024 dims. If ES mapping uses 768, indexing fails. Explicitly align dims in the index template.
IAM permission gaps: Bailian’s async processor lacks oss:GetObject rights, causing extraction timeouts. Attach AliyunOSSReadOnlyAccess to the service role.
AIRec cold start: Recommendations return empty until ≥1,000 interaction events are logged. Seed with synthetic click/view events during QA.
Chunk overlap loss: RAG context drops mid-sentence if chunking uses fixed windows without overlap. Use overlap=128 and sentence-boundary splitting.

Typical questions

build document app with search and recommendations
Supabase app with PDF extraction and RAG chatbot
document management platform with personalized recommendations
upload PDFs extract search and recommend content
full-stack doc app with RAG and AIRec
Supabase加文档OCR加RAG加智能推荐
构建文档管理应用并接入RAG和推荐系统
PDF上传提取搜索推荐聊天机器人一体化

FAQ

Q: How can I build a full-stack document application with PDF extraction, search, RAG chatbot, and AI recommendations? A: You can build this application by using Supabase as the primary datastore, Bailian for extracting text from PDFs, Elasticsearch for indexing and search, and AIRec for semantic recommendations. This combination delivers an intelligent platform featuring unified full-text search, a RAG-based chatbot for conversational Q&A, and personalized content suggestions.