A developer uploads raw scanned documents to OSS, uses Bailian OCR to extract text and structured data, deploys OpenSearch embedding models to vectorize content for RAG retrieval, then layers AIRec personalized recommendations on top — creating a complete pipeline from raw scans to intelligent, personalized document discovery.
This pipeline is essential when building enterprise knowledge bases or digital libraries where legacy scanned documents must become discoverable. It bridges unstructured physical archives with modern AI workflows, enabling developers to deliver context-aware semantic search (RAG) and personalized content recommendations without manual data labeling.
ossutil cp ./scanned/ oss://my-doc-bucket/raw/ --recursive
bailian-extract-documents via DashScope.POST https://dashscope.aliyuncs.com/api/v1/services/aigc/document-ocr Payload: {"model": "qwen-vl-max", "input": {"oss_uri": "oss://my-doc-bucket/raw/doc1.pdf"}, "parameters": {"output_format": "json"}}
es-ingest-documents to map extracted JSON.PUT /documents/_bulk → {"index": {"_id": "doc1"}} + {"title": "...", "content": "...", "tags": ["finance", "2024"]}
POST /_plugins/_ml/models/_load → {"model_id": "text-embedding-v3"} Map vector_field as knn_vector (dims: 1024) and bulk-index embeddings.
PushItems API.POST /v2/openapi/instances/{instance_id}/scenes/{scene_id}/items with item_id, category, status, and features.
Recommend API, passing user context and RAG-retrieved document IDs to rank personalized suggestions.Data flows unidirectionally from storage to intelligence. OSS acts as the durable landing zone for raw scans. Bailian processes files asynchronously, returning structured JSON (text, tables, layout). Elasticsearch stores cleaned content for fast full-text filtering and metadata queries. OpenSearch hosts the dense vector index, handling semantic similarity searches for RAG context retrieval. Finally, AIRec consumes item metadata and interaction logs to generate personalized recommendation feeds, closing the loop from ingestion to discovery.
AliyunOSSFullAccess, AliyunElasticsearchFullAccess, AliyunOpenSearchFullAccess, and AliyunAIRecFullAccess.DASHSCOPE_API_KEY exported in your runtime environment.scene_id and instance_id configured with a valid recommendation strategy.knn query failures._refresh completes results in stale catalogs. Implement a webhook to trigger AIRec sync only after successful indexing.user_behavior logs or enabling content-based fallback rules, AIRec returns empty or generic recommendations.Q: What is the complete pipeline for processing scanned documents through OCR, RAG, and personalized recommendations? A: The complete pipeline begins by uploading raw scanned documents to OSS, where Bailian OCR extracts the text and structured data. OpenSearch embedding models then vectorize this content for RAG retrieval, and AIRec is layered on top to generate personalized document recommendations.