DaaS / Products / Custom ML OCR-to-Recommendations RAG Pipeline

Custom ML OCR-to-Recommendations RAG Pipeline

A developer trains domain-specific embedding models on PAI using proprietary corpus data, feeds those custom models into a pipeline where Bailian OCR-extracts text from scanned documents (PDFs/images in OSS), indexes the content into Elasticsearch using the PAI-trained embeddings for hybrid keyword+vector RAG retrieval, and layers AIRec semantic recommendations on top to deliver a fully custom end-to-end document intelligence and personalized retrieval system.

Products involved

Scenario

Use this pipeline when you need to transform proprietary scanned documents into a searchable, personalized knowledge base. It combines PAI-trained domain embeddings, Bailian OCR extraction, Elasticsearch/OpenSearch hybrid retrieval, and AIRec semantic ranking to deliver context-aware document recommendations tailored to user behavior.

Integration steps

Ingest raw files to OSS: Upload PDFs/images using ossutil cp -r ./scanned_docs oss://rag-corpus/ --meta "Content-Type:application/pdf".
Extract text via Bailian: Call Document Understanding API: POST https://dashscope.aliyuncs.com/api/v1/services/aigc/document-processing/generation with payload {"model": "doc-parser-v2", "input": {"oss_uri": "oss://rag-corpus/"}}. Extract output.chunks.
Train embeddings on PAI: Prepare JSONL corpus and submit job: pai submit --job-name domain-emb --framework pytorch --script train.py --dataset oss://rag-corpus/train.jsonl --output oss://rag-corpus/models/.
Deploy to OpenSearch ML: Register the .pt model: PUT /_plugins/_ml/models/_upload {"model_id": "domain-emb-v1", "model_path": "oss://rag-corpus/models/model.pt"}.
Index in Elasticsearch: Create hybrid index: PUT /rag_docs {"mappings": {"properties": {"content": {"type": "text"}, "embedding": {"type": "dense_vector", "dims": 768}}}}. Bulk ingest via POST /_bulk pairing Bailian text with PAI-generated vectors.
Configure AIRec recommendations: Map ES fields to AIRec schema: POST /v2/openapi/instances/{inst}/data/documents with {"itemId": "doc_1", "title": "...", "tags": ["legal", "contract"]}. Enable semantic_recommend strategy.
Execute hybrid RAG query: POST /rag_docs/_search {"query": {"hybrid": {"text_query": {"query": "compliance clause"}, "vector_query": {"field": "embedding", "vector": [0.12, ...]}}}, "ext": {"airec_rank": true}}.

Architecture

OSS stores raw and processed artifacts. Bailian asynchronously parses documents into structured text. PAI trains domain-specific embeddings on curated data, which are deployed to OpenSearch for inference. Elasticsearch serves as the unified vector+keyword index, executing hybrid retrieval. AIRec ingests ES metadata and user interaction logs to re-rank results and surface personalized document recommendations.

Prerequisites

RAM roles with AliyunOSSFullAccess, PAIWorkspaceAdmin, BailianFullAccess, ESFullAccess, and AIRecFullAccess.
OSS bucket with versioning enabled for raw/processed separation.
Bailian API key with Document Understanding quota.
PAI workspace with GPU instance (e.g., ecs.gn7i-c8g1.2xlarge).
Elasticsearch 8.x cluster with knn and OpenSearch ML plugins enabled.
AIRec instance with custom recommendation schema.

Common pitfalls

Dimension mismatch: PAI output dims must exactly match ES dense_vector mapping. Mismatches cause mapper_parsing_exception during bulk ingest.
OCR layout degradation: Bailian struggles with multi-column PDFs. Enable layout_analysis: true and validate chunk boundaries before vectorization.
Hybrid score skew: BM25 often dominates vector relevance. Use rank_combination: {"type": "rrf", "k": 60} to balance keyword precision and semantic recall.
AIRec cold start: New items lack click signals. Seed with content_tags from ES and set exploration_rate: 0.1 to force initial exposure.

Typical questions

train custom embeddings then OCR documents and recommend
PAI trained models powering OCR RAG with recommendations
domain-specific embedding model for scanned document RAG
custom ML embeddings plus Bailian OCR plus AIRec pipeline
训练自定义嵌入模型后做OCR文档抽取和智能推荐
PAI训练嵌入加Bailian文档抽取加推荐系统
full pipeline from model training to OCR extraction to recommendations
custom trained vector search on scanned documents with recommendations

FAQ

Q: How does the custom ML OCR-to-recommendations RAG pipeline integrate model training, document extraction, and recommendations? A: The pipeline combines domain-specific embedding models trained on PAI with Bailian OCR extraction and AIRec semantic recommendations to deliver a fully custom end-to-end document intelligence system. It extracts text from scanned documents in OSS, indexes the content in Elasticsearch using the custom embeddings for hybrid keyword and vector retrieval, and applies AIRec to generate personalized recommendations.