A team trains domain-specific embedding models and fine-tunes LLMs on PAI, then builds an end-to-end document intelligence pipeline that ingests scanned PDFs/images via Bailian OCR, embeds extracted text with the custom-trained models, indexes into Elasticsearch for hybrid BM25+vector retrieval, and deploys the full RAG system to production with OpenSearch answer generation.
Use this combination when your engineering team must transform unstructured scanned documents into a production-ready conversational search system. It bridges domain-specific AI training, automated OCR extraction, hybrid retrieval, and LLM-powered answer generation into a single automated pipeline.
aliyun oss cp ./scanned_docs/ oss://rag-pipeline-bucket/raw/ --recursive.POST https://dashscope.aliyuncs.com/api/v1/services/ocr/document_parse with payload {"model": "doc-parser-v2", "input": {"file_url": "oss://rag-pipeline-bucket/raw/doc1.pdf"}}. Parse the JSON response to extract structured text blocks.POST https://<pai-eas-endpoint>/v1/embeddings with {"model": "custom-domain-embed", "input": ["chunk_1_text", ...]}.PUT /rag-index/_mapping {"properties": {"content": {"type": "text", "analyzer": "standard"}, "embedding": {"type": "dense_vector", "dims": 768, "index": true, "similarity": "cosine"}}}. Bulk index documents using POST /_bulk.PUT /_plugins/_ml/pipelines/rag-processor {"processors": [{"neural_sparse": {"field": "embedding"}}, {"text_expansion": {"field": "content", "model_id": "fine-tuned-llm-id"}}]}.POST /rag-index/_search using a bool query combining match (BM25) and knn (vector). Route top-5 hits to OpenSearch’s ML connector to generate the final answer via your PAI-fine-tuned LLM.Data flows unidirectionally from OSS (storage) to Bailian (OCR/text extraction). Extracted chunks are vectorized by PAI-hosted custom models and indexed into Elasticsearch, which maintains both BM25 keyword and dense vector fields. OpenSearch sits at the query layer, executing hybrid retrieval, applying reranking logic, and invoking the fine-tuned LLM to synthesize natural language answers from the retrieved context.
requests, elasticsearch, and opensearch-py SDKs installeddims in the mapping before bulk ingestion.weight parameters in the bool query to balance keyword and semantic matches.pipeline.batch_size: 100 and monitor _nodes/stats memory usage.Q: How do I build a custom-trained OCR RAG pipeline that processes scanned documents and deploys to production? A: You build this pipeline by training domain-specific embedding models on PAI and ingesting scanned PDFs or images via Bailian OCR. The extracted text is embedded with your custom models, indexed into Elasticsearch for hybrid BM25 and vector retrieval, and deployed to production with OpenSearch handling the final answer generation.