DaaS / Products / Custom ML OCR-to-Recommendations RAG Pipeline

Custom ML OCR-to-Recommendations RAG Pipeline

A developer trains domain-specific embedding models on PAI using proprietary corpus data, feeds those custom models into a pipeline where Bailian OCR-extracts text from scanned documents (PDFs/images in OSS), indexes the content into Elasticsearch using the PAI-trained embeddings for hybrid keyword+vector RAG retrieval, and layers AIRec semantic recommendations on top to deliver a fully custom end-to-end document intelligence and personalized retrieval system.

Products involved

Scenario

Use this pipeline when you need to transform proprietary scanned documents into a searchable, personalized knowledge base. It combines PAI-trained domain embeddings, Bailian OCR extraction, Elasticsearch/OpenSearch hybrid retrieval, and AIRec semantic ranking to deliver context-aware document recommendations tailored to user behavior.

Integration steps

  1. Ingest raw files to OSS: Upload PDFs/images using ossutil cp -r ./scanned_docs oss://rag-corpus/ --meta "Content-Type:application/pdf".
  2. Extract text via Bailian: Call Document Understanding API: POST https://dashscope.aliyuncs.com/api/v1/services/aigc/document-processing/generation with payload {"model": "doc-parser-v2", "input": {"oss_uri": "oss://rag-corpus/"}}. Extract output.chunks.
  3. Train embeddings on PAI: Prepare JSONL corpus and submit job: pai submit --job-name domain-emb --framework pytorch --script train.py --dataset oss://rag-corpus/train.jsonl --output oss://rag-corpus/models/.
  4. Deploy to OpenSearch ML: Register the .pt model: PUT /_plugins/_ml/models/_upload {"model_id": "domain-emb-v1", "model_path": "oss://rag-corpus/models/model.pt"}.
  5. Index in Elasticsearch: Create hybrid index: PUT /rag_docs {"mappings": {"properties": {"content": {"type": "text"}, "embedding": {"type": "dense_vector", "dims": 768}}}}. Bulk ingest via POST /_bulk pairing Bailian text with PAI-generated vectors.
  6. Configure AIRec recommendations: Map ES fields to AIRec schema: POST /v2/openapi/instances/{inst}/data/documents with {"itemId": "doc_1", "title": "...", "tags": ["legal", "contract"]}. Enable semantic_recommend strategy.
  7. Execute hybrid RAG query: POST /rag_docs/_search {"query": {"hybrid": {"text_query": {"query": "compliance clause"}, "vector_query": {"field": "embedding", "vector": [0.12, ...]}}}, "ext": {"airec_rank": true}}.

Architecture

OSS stores raw and processed artifacts. Bailian asynchronously parses documents into structured text. PAI trains domain-specific embeddings on curated data, which are deployed to OpenSearch for inference. Elasticsearch serves as the unified vector+keyword index, executing hybrid retrieval. AIRec ingests ES metadata and user interaction logs to re-rank results and surface personalized document recommendations.

Prerequisites

Common pitfalls

Typical questions

FAQ

Q: How does the custom ML OCR-to-recommendations RAG pipeline integrate model training, document extraction, and recommendations? A: The pipeline combines domain-specific embedding models trained on PAI with Bailian OCR extraction and AIRec semantic recommendations to deliver a fully custom end-to-end document intelligence system. It extracts text from scanned documents in OSS, indexes the content in Elasticsearch using the custom embeddings for hybrid keyword and vector retrieval, and applies AIRec to generate personalized recommendations.