DaaS / Products / Custom LLM + Embeddings Full RAG System

Custom LLM + Embeddings Full RAG System

A developer uses PAI to both fine-tune a domain-specific LLM and train custom embedding models, builds a vector retrieval pipeline with OpenSearch/Elasticsearch and OSS, then deploys the entire RAG system through Bailian where the custom LLM generates answers grounded in custom embeddings for maximum domain accuracy.

Products involved

Scenario

This integration is required when developers need a fully customized Retrieval-Augmented Generation (RAG) system that leverages proprietary domain data. By fine-tuning both a domain-specific LLM and custom embedding models on PAI, then orchestrating vector retrieval via OpenSearch/Elasticsearch and deploying the inference pipeline through Bailian, teams achieve maximum domain accuracy and controlled data residency.

Integration steps

Fine-tune the LLM on PAI: Submit a supervised fine-tuning job: aliyun pai CreateJob --JobName "domain-llm-sft" --AlgorithmSpec "qwen2.5-7b-sft" --DatasetUri "oss://<bucket>/sft_data.jsonl" --InstanceType "ecs.gn7i-c8g1.2xlarge".
Train Custom Embeddings: Run a parallel embedding training job: aliyun pai CreateJob --JobName "custom-emb-train" --AlgorithmSpec "text-embedding-v3" --DatasetUri "oss://<bucket>/emb_corpus.csv" --OutputUri "oss://<bucket>/models/emb_v1".
Deploy LLM to Bailian: Register the trained PAI model as a managed Bailian endpoint: aliyun bailian CreateModel --ModelName "domain-llm-v1" --ModelSource "PAI" --ModelId "<pai-job-id>" --EndpointType "managed".
Provision Vector Index in OpenSearch/ES: Create a k-NN optimized index: PUT /rag_vectors { "settings": { "index.knn": true }, "mappings": { "properties": { "embedding": { "type": "dense_vector", "dims": 1024, "method": { "name": "hnsw", "space_type": "cosine" } }, "chunk_text": { "type": "text" } } } }.
Ingest & Embed Data: Use the trained embedding model to vectorize documents, store raw chunks in OSS, and batch-index vectors: POST /_bulk with {"index": {"_index": "rag_vectors"}} payloads.
Configure Bailian RAG Pipeline: Bind the vector store and LLM endpoint in Bailian: aliyun bailian CreateApplication --Name "DomainRAG" --Model "domain-llm-v1" --RetrievalConfig '{"vector_store": "opensearch", "endpoint": "<es-endpoint>", "index": "rag_vectors", "top_k": 5}'.
Validate & Deploy: Test the pipeline via aliyun bailian InvokeApplication --AppId "<app-id>" --Query "domain-specific-question" --Stream true.

Architecture

PAI acts as the training engine for both the generative LLM and embedding models. OSS serves as the centralized data lake for raw datasets, training artifacts, and chunked documents. OpenSearch/Elasticsearch hosts the dense vector indexes for low-latency k-NN retrieval. Bailian orchestrates the runtime RAG workflow: it intercepts user queries, executes hybrid retrieval against OpenSearch, injects top-k context into the prompt, and routes generation to the deployed Bailian LLM endpoint.

Prerequisites

Alibaba Cloud account with RAM roles granting AliyunPAIFullAccess, AliyunBailianFullAccess, and AliyunOpenSearchFullAccess.
Active PAI workspace with GPU quota (e.g., ecs.gn7i series).
Provisioned OpenSearch/Elasticsearch cluster with k-NN plugin enabled.
OSS bucket with lifecycle policies for training data.
Domain dataset formatted as JSONL (SFT) and CSV (embeddings).

Common pitfalls

Dimension mismatch: The dims parameter in the OpenSearch index must exactly match the output dimension of the PAI-trained embedding model (e.g., 1024 vs 768), otherwise vector ingestion fails.
Cross-region latency: Training on PAI in one region while querying OpenSearch in another introduces >50ms latency per retrieval step; keep all resources in the same VPC/region.
OSS permission boundaries: Bailian and OpenSearch require explicit RAM roles (AliyunServiceRoleForBailian) to read OSS chunks; missing sts:AssumeRole breaks the ingestion pipeline.
Prompt grounding failure: If Bailian’s retrieval_config lacks temperature: 0.1 and strict system_prompt constraints, the LLM may hallucinate instead of strictly using retrieved context.

Typical questions

build fully custom RAG with fine-tuned LLM and embeddings
train custom LLM and custom embeddings for RAG
PAI end-to-end custom AI pipeline
fine-tune LLM and train embeddings then deploy RAG
全自定义RAG系统微调模型加自定义嵌入
PAI训练自定义大模型和嵌入向量部署百炼RAG
custom model RAG with domain-specific LLM
train both LLM and embeddings deploy Bailian

FAQ

Q: How do I build and deploy a fully custom RAG system using fine-tuned LLMs and custom embeddings? A: You build and deploy this system by using PAI to fine-tune a domain-specific LLM and train custom embedding models, then routing the entire pipeline through Bailian. The architecture pairs these trained components with a vector retrieval pipeline built on OpenSearch or Elasticsearch and OSS. This configuration ensures the custom LLM generates answers strictly grounded in your custom embeddings for maximum domain accuracy.