DaaS / Products / Document AI: OCR to RAG to Recommendations

Document AI: OCR to RAG to Recommendations

A developer uploads raw scanned documents to OSS, uses Bailian OCR to extract text and structured data, deploys OpenSearch embedding models to vectorize content for RAG retrieval, then layers AIRec personalized recommendations on top — creating a complete pipeline from raw scans to intelligent, personalized document discovery.

Products involved

Scenario

This pipeline is essential when building enterprise knowledge bases or digital libraries where legacy scanned documents must become discoverable. It bridges unstructured physical archives with modern AI workflows, enabling developers to deliver context-aware semantic search (RAG) and personalized content recommendations without manual data labeling.

Integration steps

Upload to OSS: Push raw PDFs/images to your bucket.

ossutil cp ./scanned/ oss://my-doc-bucket/raw/ --recursive

Extract with Bailian: Call bailian-extract-documents via DashScope.

POST https://dashscope.aliyuncs.com/api/v1/services/aigc/document-ocr Payload: {"model": "qwen-vl-max", "input": {"oss_uri": "oss://my-doc-bucket/raw/doc1.pdf"}, "parameters": {"output_format": "json"}}

Ingest to Elasticsearch: Use es-ingest-documents to map extracted JSON.

PUT /documents/_bulk → {"index": {"_id": "doc1"}} + {"title": "...", "content": "...", "tags": ["finance", "2024"]}

Vectorize with OpenSearch: Load an embedding model and index dense vectors.

POST /_plugins/_ml/models/_load → {"model_id": "text-embedding-v3"} Map vector_field as knn_vector (dims: 1024) and bulk-index embeddings.

Sync to AIRec: Push item metadata using AIRec PushItems API.

POST /v2/openapi/instances/{instance_id}/scenes/{scene_id}/items with item_id, category, status, and features.

Serve Recommendations: Call AIRec Recommend API, passing user context and RAG-retrieved document IDs to rank personalized suggestions.

Architecture

Data flows unidirectionally from storage to intelligence. OSS acts as the durable landing zone for raw scans. Bailian processes files asynchronously, returning structured JSON (text, tables, layout). Elasticsearch stores cleaned content for fast full-text filtering and metadata queries. OpenSearch hosts the dense vector index, handling semantic similarity searches for RAG context retrieval. Finally, AIRec consumes item metadata and interaction logs to generate personalized recommendation feeds, closing the loop from ingestion to discovery.

Prerequisites

Provisioned Alibaba Cloud instances: OSS bucket, Elasticsearch cluster, OpenSearch cluster, Bailian (DashScope) workspace, AIRec instance.
IAM RAM role with AliyunOSSFullAccess, AliyunElasticsearchFullAccess, AliyunOpenSearchFullAccess, and AliyunAIRecFullAccess.
DASHSCOPE_API_KEY exported in your runtime environment.
OpenSearch ML plugin enabled with a compatible embedding model pre-downloaded.
AIRec scene_id and instance_id configured with a valid recommendation strategy.

Common pitfalls

OCR Layout Fragmentation: Bailian returns page-level chunks by default. Failing to merge overlapping blocks or strip headers creates noisy ES documents that degrade RAG accuracy.
Vector Dimension Mismatch: OpenSearch requires exact dimension alignment (e.g., 1024). Mismatched embeddings during bulk indexing cause silent knn query failures.
ES-to-AIRec Sync Latency: Pushing items before ES _refresh completes results in stale catalogs. Implement a webhook to trigger AIRec sync only after successful indexing.
AIRec Cold Start: New archives lack interaction history. Without seeding user_behavior logs or enabling content-based fallback rules, AIRec returns empty or generic recommendations.

Typical questions

full pipeline from scanned documents to recommendations
OCR extract then RAG then personalized recommendations
scanned PDFs to semantic search and AIRec
从扫描文档OCR到RAG检索再到智能推荐完整链路
文档处理加语义搜索加推荐全链路
upload PDFs build RAG add recommendations end-to-end
raw scans to personalized knowledge base
OSS Bailian OpenSearch RAG AIRec完整流水线

FAQ

Q: What is the complete pipeline for processing scanned documents through OCR, RAG, and personalized recommendations? A: The complete pipeline begins by uploading raw scanned documents to OSS, where Bailian OCR extracts the text and structured data. OpenSearch embedding models then vectorize this content for RAG retrieval, and AIRec is layered on top to generate personalized document recommendations.