DaaS / Products / Custom-Trained Embedding Models Powering RAG Inference

Custom-Trained Embedding Models Powering RAG Inference

Train domain-specific embedding models on PAI and store optimized vector indexes in OSS (Skill 1), then feed those custom embeddings into a fully custom RAG pipeline where a fine-tuned LLM deployed on Bailian generates answers grounded in the vector-retrieved context (Skill 3), creating a complete train-to-inference loop with both custom embeddings and a custom generative model.

Products involved

Scenario

Use this workflow when generic embeddings and foundation models underperform on proprietary domain data, requiring a fully customized RAG pipeline. By training both the embedding model and generative LLM on PAI and orchestrating them through OSS, Elasticsearch/OpenSearch, and Bailian, you achieve high-precision semantic retrieval and domain-accurate generation in a single train-to-inference loop.

Integration steps

  1. Ingest corpus to OSS: ossutil cp -r ./data oss://<bucket>/raw-docs/
  2. Train embedding model on PAI: pai submit --workspace <ws-id> --job-name emb-train --framework pytorch --script train_emb.py --data oss://<bucket>/raw-docs/ --output oss://<bucket>/models/emb-v1/
  3. Fine-tune LLM on PAI: pai submit --job-name llm-ft --framework deepspeed --script train_llm.py --base-model qwen-7b --data oss://<bucket>/qa-pairs/ --output oss://<bucket>/models/llm-ft/
  4. Deploy LLM to Bailian: POST https://dashscope.aliyuncs.com/api/v1/models with payload {"model_name": "custom-llm-v1", "model_path": "oss://<bucket>/models/llm-ft/"}
  5. Generate & store vectors: Run batch inference with the trained model, saving outputs as Parquet to oss://<bucket>/vector-indexes/.
  6. Configure ES hybrid index: PUT /rag-index {"mappings": {"properties": {"embedding": {"type": "dense_vector", "dims": 768}, "text": {"type": "text"}}}}
  7. Orchestrate RAG: Query ES via POST /rag-index/_search {"knn": {"field": "embedding", "query_vector": [...], "k": 5}}, then route chunks to Bailian: POST https://dashscope.aliyuncs.com/compatible-mode/v1/chat/completions with {"model": "custom-llm-v1", "messages": [{"role": "user", "content": f"Context: {chunks}\nQuestion: {query}"}]}

Architecture

Raw documents reside in OSS as the immutable source of truth. PAI consumes this data to train the embedding model (retrieval layer) and generative LLM (reasoning layer). The embedding model outputs high-dimensional vectors stored in OSS and indexed in Elasticsearch/OpenSearch for low-latency knn search. At runtime, the application queries the vector index, retrieves top-k context chunks, and routes them alongside the user prompt to the Bailian-deployed LLM endpoint, closing the end-to-end RAG loop.

Prerequisites

Common pitfalls

Typical questions

FAQ

Q: How do I build an end-to-end RAG pipeline using custom-trained embeddings and a fine-tuned LLM? A: You can create a complete train-to-inference RAG loop by training domain-specific embedding models on PAI, storing the optimized vector indexes in OSS, and feeding them into a pipeline where a fine-tuned LLM deployed on Bailian generates answers grounded in the vector-retrieved context. This workflow is implemented through predefined integration combinations such as the Full Custom RAG and Full-Stack Custom RAG skills.