DaaS / Products / Custom LLM + Embeddings Full RAG System

Custom LLM + Embeddings Full RAG System

A developer uses PAI to both fine-tune a domain-specific LLM and train custom embedding models, builds a vector retrieval pipeline with OpenSearch/Elasticsearch and OSS, then deploys the entire RAG system through Bailian where the custom LLM generates answers grounded in custom embeddings for maximum domain accuracy.

Products involved

Scenario

This integration is required when developers need a fully customized Retrieval-Augmented Generation (RAG) system that leverages proprietary domain data. By fine-tuning both a domain-specific LLM and custom embedding models on PAI, then orchestrating vector retrieval via OpenSearch/Elasticsearch and deploying the inference pipeline through Bailian, teams achieve maximum domain accuracy and controlled data residency.

Integration steps

  1. Fine-tune the LLM on PAI: Submit a supervised fine-tuning job: aliyun pai CreateJob --JobName "domain-llm-sft" --AlgorithmSpec "qwen2.5-7b-sft" --DatasetUri "oss://<bucket>/sft_data.jsonl" --InstanceType "ecs.gn7i-c8g1.2xlarge".
  2. Train Custom Embeddings: Run a parallel embedding training job: aliyun pai CreateJob --JobName "custom-emb-train" --AlgorithmSpec "text-embedding-v3" --DatasetUri "oss://<bucket>/emb_corpus.csv" --OutputUri "oss://<bucket>/models/emb_v1".
  3. Deploy LLM to Bailian: Register the trained PAI model as a managed Bailian endpoint: aliyun bailian CreateModel --ModelName "domain-llm-v1" --ModelSource "PAI" --ModelId "<pai-job-id>" --EndpointType "managed".
  4. Provision Vector Index in OpenSearch/ES: Create a k-NN optimized index: PUT /rag_vectors { "settings": { "index.knn": true }, "mappings": { "properties": { "embedding": { "type": "dense_vector", "dims": 1024, "method": { "name": "hnsw", "space_type": "cosine" } }, "chunk_text": { "type": "text" } } } }.
  5. Ingest & Embed Data: Use the trained embedding model to vectorize documents, store raw chunks in OSS, and batch-index vectors: POST /_bulk with {"index": {"_index": "rag_vectors"}} payloads.
  6. Configure Bailian RAG Pipeline: Bind the vector store and LLM endpoint in Bailian: aliyun bailian CreateApplication --Name "DomainRAG" --Model "domain-llm-v1" --RetrievalConfig '{"vector_store": "opensearch", "endpoint": "<es-endpoint>", "index": "rag_vectors", "top_k": 5}'.
  7. Validate & Deploy: Test the pipeline via aliyun bailian InvokeApplication --AppId "<app-id>" --Query "domain-specific-question" --Stream true.

Architecture

PAI acts as the training engine for both the generative LLM and embedding models. OSS serves as the centralized data lake for raw datasets, training artifacts, and chunked documents. OpenSearch/Elasticsearch hosts the dense vector indexes for low-latency k-NN retrieval. Bailian orchestrates the runtime RAG workflow: it intercepts user queries, executes hybrid retrieval against OpenSearch, injects top-k context into the prompt, and routes generation to the deployed Bailian LLM endpoint.

Prerequisites

Common pitfalls

Typical questions

FAQ

Q: How do I build and deploy a fully custom RAG system using fine-tuned LLMs and custom embeddings? A: You build and deploy this system by using PAI to fine-tune a domain-specific LLM and train custom embedding models, then routing the entire pipeline through Bailian. The architecture pairs these trained components with a vector retrieval pipeline built on OpenSearch or Elasticsearch and OSS. This configuration ensures the custom LLM generates answers strictly grounded in your custom embeddings for maximum domain accuracy.