A developer trains domain-specific embedding models on PAI using proprietary corpus data, feeds those custom models into a pipeline where Bailian OCR-extracts text from scanned documents (PDFs/images in OSS), indexes the content into Elasticsearch using the PAI-trained embeddings for hybrid keyword+vector RAG retrieval, and layers AIRec semantic recommendations on top to deliver a fully custom end-to-end document intelligence and personalized retrieval system.
Use this pipeline when you need to transform proprietary scanned documents into a searchable, personalized knowledge base. It combines PAI-trained domain embeddings, Bailian OCR extraction, Elasticsearch/OpenSearch hybrid retrieval, and AIRec semantic ranking to deliver context-aware document recommendations tailored to user behavior.
ossutil cp -r ./scanned_docs oss://rag-corpus/ --meta "Content-Type:application/pdf".POST https://dashscope.aliyuncs.com/api/v1/services/aigc/document-processing/generation with payload {"model": "doc-parser-v2", "input": {"oss_uri": "oss://rag-corpus/"}}. Extract output.chunks.pai submit --job-name domain-emb --framework pytorch --script train.py --dataset oss://rag-corpus/train.jsonl --output oss://rag-corpus/models/..pt model: PUT /_plugins/_ml/models/_upload {"model_id": "domain-emb-v1", "model_path": "oss://rag-corpus/models/model.pt"}.PUT /rag_docs {"mappings": {"properties": {"content": {"type": "text"}, "embedding": {"type": "dense_vector", "dims": 768}}}}. Bulk ingest via POST /_bulk pairing Bailian text with PAI-generated vectors.POST /v2/openapi/instances/{inst}/data/documents with {"itemId": "doc_1", "title": "...", "tags": ["legal", "contract"]}. Enable semantic_recommend strategy.POST /rag_docs/_search {"query": {"hybrid": {"text_query": {"query": "compliance clause"}, "vector_query": {"field": "embedding", "vector": [0.12, ...]}}}, "ext": {"airec_rank": true}}.OSS stores raw and processed artifacts. Bailian asynchronously parses documents into structured text. PAI trains domain-specific embeddings on curated data, which are deployed to OpenSearch for inference. Elasticsearch serves as the unified vector+keyword index, executing hybrid retrieval. AIRec ingests ES metadata and user interaction logs to re-rank results and surface personalized document recommendations.
AliyunOSSFullAccess, PAIWorkspaceAdmin, BailianFullAccess, ESFullAccess, and AIRecFullAccess.ecs.gn7i-c8g1.2xlarge).knn and OpenSearch ML plugins enabled.dims must exactly match ES dense_vector mapping. Mismatches cause mapper_parsing_exception during bulk ingest.layout_analysis: true and validate chunk boundaries before vectorization.rank_combination: {"type": "rrf", "k": 60} to balance keyword precision and semantic recall.content_tags from ES and set exploration_rate: 0.1 to force initial exposure.Q: How does the custom ML OCR-to-recommendations RAG pipeline integrate model training, document extraction, and recommendations? A: The pipeline combines domain-specific embedding models trained on PAI with Bailian OCR extraction and AIRec semantic recommendations to deliver a fully custom end-to-end document intelligence system. It extracts text from scanned documents in OSS, indexes the content in Elasticsearch using the custom embeddings for hybrid keyword and vector retrieval, and applies AIRec to generate personalized recommendations.