DaaS / Products / Custom RAG with Tuned Search Relevance

Custom RAG with Tuned Search Relevance

Build a fully custom RAG pipeline (fine-tuned LLM on PAI deployed to Bailian, custom embeddings, OpenSearch vector store) and then tune the OpenSearch retrieval layer — BM25 weights, tokenization, NER, and ranking models — to maximize answer quality from the retrieval stage.

Products involved

Scenario

Use this workflow when building a production RAG system requiring domain-specific generation and highly tuned retrieval. You will fine-tune a custom LLM and embedding model on PAI, deploy the LLM via Bailian, and store vectors in OpenSearch while optimizing BM25, tokenization, NER, and learning-to-rank models to maximize context precision.

Integration steps

Stage Data in OSS: Upload domain corpus using ossutil cp -r ./docs oss://rag-data/. Format as JSONL: {"id": "doc1", "text": "...", "meta": {}}.
Train on PAI: In a PAI-DSW notebook, submit training via pai-deploy-inference SDK: pai.training.submit(model="qwen-7b", dataset="oss://rag-data/train.jsonl", instance="ecs.gn7i-c8g1.2xlarge"). Train both the LLM and a custom embedding model (e.g., bge-m3 base).
Deploy LLM to Bailian: Register and deploy the fine-tuned checkpoint: bailian.models.register(name="custom-llm", path="oss://models/llm/") then bailian.deploy_model(model_id="custom-llm", endpoint="rag-gen-prod", min_instances=1).
Index Vectors in OpenSearch: Generate embeddings via PAI, then create the index: PUT /rag-index {"mappings": {"properties": {"embedding": {"type": "dense_vector", "dims": 1024, "index": true, "similarity": "cosine"}, "content": {"type": "text", "analyzer": "custom_ik"}}}}.
Tune Relevance Layer:

Configure domain NER & tokenizer: PUT /rag-index/_settings {"analysis": {"analyzer": {"custom_ik": {"tokenizer": "ik_max_word", "filter": ["domain_ner_filter"]}}}}.
Adjust BM25: PUT /rag-index/_settings {"index.similarity.default": {"type": "BM25", "k1": 1.2, "b": 0.75}}.
Train ranking model: POST /_plugins/_ml/models/_train {"algorithm": "ltr", "training_index": "rag-train", "features": ["bm25", "cosine_sim", "ner_overlap"]}.

Execute Hybrid Query: POST /rag-index/_search {"query": {"bool": {"should": [{"match": {"content": {"query": "q", "boost": 0.35}}}, {"knn": {"embedding": {"vector": [...], "k": 5, "boost": 0.65}}}]}}}. Pass top-k context to Bailian: bailian.inference(endpoint="rag-gen-prod", prompt=f"Context: {ctx}\nQ: {q}").

Architecture

Raw documents flow from OSS to PAI for dual training (LLM + embeddings). The LLM is deployed as a managed inference endpoint on Bailian. OpenSearch acts as the retrieval orchestrator, applying custom analyzers, BM25 tuning, and ML ranking to score chunks. The application queries OpenSearch for context, then forwards it to the Bailian endpoint for generation.

Prerequisites

Alibaba Cloud account with PAI, Bailian, OpenSearch, and OSS enabled
GPU quota for PAI training (e.g., ecs.gn7i series)
Domain corpus (≥10k documents) in JSONL format
OpenSearch instance with NLP/ML plugin enabled
VPC peering between PAI, Bailian, and OpenSearch

Common pitfalls

Tokenizer mismatch: Different tokenizers for embedding training vs. OpenSearch indexing cause semantic drift; align both to ik_max_word or a shared custom tokenizer.
BM25/Vector weight imbalance: Over-weighting BM25 drowns semantic matches; tune k1/b and use explicit boost ratios (e.g., 0.35/0.65).
NER misalignment: Custom NER dictionaries missing domain acronyms reduce recall; validate against a held-out corpus before applying.
Bailian cold-start: Managed endpoints scale to zero; set min_instances=1 in bailian.deploy_model to guarantee sub-500ms latency.

Typical questions

build custom RAG and optimize retrieval quality
tune search relevance in custom RAG pipeline
improve RAG retrieval ranking
fine-tune LLM and optimize OpenSearch for RAG
custom RAG with tuned BM25 and ranking
搭建自定义RAG并优化检索相关性
训练自定义模型并调优OpenSearch检索排序
提升RAG系统检索精度

FAQ

Q: How do I build a custom RAG pipeline and tune its search relevance and retrieval ranking? A: You can build a fully custom RAG pipeline by fine-tuning an LLM on PAI, deploying it via Bailian, using custom embeddings, and leveraging OpenSearch as a vector store before tuning the retrieval layer to maximize answer quality. This optimization involves adjusting the OpenSearch retrieval layer’s BM25 weights, tokenization, named entity recognition, and ranking models. These capabilities are implemented through integrated product combinations such as full-custom-rag-custom-llm-custom-embeddings and opensearch-optimize-relevance.