Build a fully custom RAG pipeline (fine-tuned LLM on PAI deployed to Bailian, custom embeddings, OpenSearch vector store) and then tune the OpenSearch retrieval layer — BM25 weights, tokenization, NER, and ranking models — to maximize answer quality from the retrieval stage.
Use this workflow when building a production RAG system requiring domain-specific generation and highly tuned retrieval. You will fine-tune a custom LLM and embedding model on PAI, deploy the LLM via Bailian, and store vectors in OpenSearch while optimizing BM25, tokenization, NER, and learning-to-rank models to maximize context precision.
ossutil cp -r ./docs oss://rag-data/. Format as JSONL: {"id": "doc1", "text": "...", "meta": {}}.pai-deploy-inference SDK: pai.training.submit(model="qwen-7b", dataset="oss://rag-data/train.jsonl", instance="ecs.gn7i-c8g1.2xlarge"). Train both the LLM and a custom embedding model (e.g., bge-m3 base).bailian.models.register(name="custom-llm", path="oss://models/llm/") then bailian.deploy_model(model_id="custom-llm", endpoint="rag-gen-prod", min_instances=1).PUT /rag-index {"mappings": {"properties": {"embedding": {"type": "dense_vector", "dims": 1024, "index": true, "similarity": "cosine"}, "content": {"type": "text", "analyzer": "custom_ik"}}}}.PUT /rag-index/_settings {"analysis": {"analyzer": {"custom_ik": {"tokenizer": "ik_max_word", "filter": ["domain_ner_filter"]}}}}.PUT /rag-index/_settings {"index.similarity.default": {"type": "BM25", "k1": 1.2, "b": 0.75}}.POST /_plugins/_ml/models/_train {"algorithm": "ltr", "training_index": "rag-train", "features": ["bm25", "cosine_sim", "ner_overlap"]}.POST /rag-index/_search {"query": {"bool": {"should": [{"match": {"content": {"query": "q", "boost": 0.35}}}, {"knn": {"embedding": {"vector": [...], "k": 5, "boost": 0.65}}}]}}}. Pass top-k context to Bailian: bailian.inference(endpoint="rag-gen-prod", prompt=f"Context: {ctx}\nQ: {q}").Raw documents flow from OSS to PAI for dual training (LLM + embeddings). The LLM is deployed as a managed inference endpoint on Bailian. OpenSearch acts as the retrieval orchestrator, applying custom analyzers, BM25 tuning, and ML ranking to score chunks. The application queries OpenSearch for context, then forwards it to the Bailian endpoint for generation.
ecs.gn7i series)ik_max_word or a shared custom tokenizer.k1/b and use explicit boost ratios (e.g., 0.35/0.65).min_instances=1 in bailian.deploy_model to guarantee sub-500ms latency.Q: How do I build a custom RAG pipeline and tune its search relevance and retrieval ranking? A: You can build a fully custom RAG pipeline by fine-tuning an LLM on PAI, deploying it via Bailian, using custom embeddings, and leveraging OpenSearch as a vector store before tuning the retrieval layer to maximize answer quality. This optimization involves adjusting the OpenSearch retrieval layer’s BM25 weights, tokenization, named entity recognition, and ranking models. These capabilities are implemented through integrated product combinations such as full-custom-rag-custom-llm-custom-embeddings and opensearch-optimize-relevance.