DaaS / Products / Document AI: OCR to RAG to Recommendations

Document AI: OCR to RAG to Recommendations

A developer uploads raw scanned documents to OSS, uses Bailian OCR to extract text and structured data, deploys OpenSearch embedding models to vectorize content for RAG retrieval, then layers AIRec personalized recommendations on top — creating a complete pipeline from raw scans to intelligent, personalized document discovery.

Products involved

Scenario

This pipeline is essential when building enterprise knowledge bases or digital libraries where legacy scanned documents must become discoverable. It bridges unstructured physical archives with modern AI workflows, enabling developers to deliver context-aware semantic search (RAG) and personalized content recommendations without manual data labeling.

Integration steps

  1. Upload to OSS: Push raw PDFs/images to your bucket.
  2. ossutil cp ./scanned/ oss://my-doc-bucket/raw/ --recursive

  3. Extract with Bailian: Call bailian-extract-documents via DashScope.
  4. POST https://dashscope.aliyuncs.com/api/v1/services/aigc/document-ocr Payload: {"model": "qwen-vl-max", "input": {"oss_uri": "oss://my-doc-bucket/raw/doc1.pdf"}, "parameters": {"output_format": "json"}}

  5. Ingest to Elasticsearch: Use es-ingest-documents to map extracted JSON.
  6. PUT /documents/_bulk{"index": {"_id": "doc1"}} + {"title": "...", "content": "...", "tags": ["finance", "2024"]}

  7. Vectorize with OpenSearch: Load an embedding model and index dense vectors.
  8. POST /_plugins/_ml/models/_load{"model_id": "text-embedding-v3"} Map vector_field as knn_vector (dims: 1024) and bulk-index embeddings.

  9. Sync to AIRec: Push item metadata using AIRec PushItems API.
  10. POST /v2/openapi/instances/{instance_id}/scenes/{scene_id}/items with item_id, category, status, and features.

  11. Serve Recommendations: Call AIRec Recommend API, passing user context and RAG-retrieved document IDs to rank personalized suggestions.

Architecture

Data flows unidirectionally from storage to intelligence. OSS acts as the durable landing zone for raw scans. Bailian processes files asynchronously, returning structured JSON (text, tables, layout). Elasticsearch stores cleaned content for fast full-text filtering and metadata queries. OpenSearch hosts the dense vector index, handling semantic similarity searches for RAG context retrieval. Finally, AIRec consumes item metadata and interaction logs to generate personalized recommendation feeds, closing the loop from ingestion to discovery.

Prerequisites

Common pitfalls

Typical questions

FAQ

Q: What is the complete pipeline for processing scanned documents through OCR, RAG, and personalized recommendations? A: The complete pipeline begins by uploading raw scanned documents to OSS, where Bailian OCR extracts the text and structured data. OpenSearch embedding models then vectorize this content for RAG retrieval, and AIRec is layered on top to generate personalized document recommendations.