DaaS / Products / Deploy Complete RAG System with AI Models

Deploy Complete RAG System with AI Models

Deploy AI models (embeddings and LLM) on Alibaba Cloud Linux for inference serving, then deploy a RAG application using Elasticsearch as the vector knowledge base that calls these models to build an end-to-end enterprise AI chatbot with retrieval-augmented generation.

Products involved

Scenario

Developers use this workflow when they need a self-hosted, enterprise-grade RAG chatbot that processes proprietary documents without sending data to third-party APIs. By deploying embedding and LLM models directly on Alibaba Cloud Linux and using Elasticsearch as a low-latency vector knowledge base, teams achieve full data sovereignty, customizable retrieval pipelines, and scalable inference.

Integration steps

  1. Prepare Alibaba Cloud Linux: Launch an ECS instance with alinux3 image. Install NVIDIA drivers and Docker: sudo yum install -y nvidia-driver docker-ce.
  2. Deploy Inference Server: Run vLLM containers for your LLM and embedding models:
  3. ``bash docker run --gpus all -p 8000:8000 vllm/vllm-openai:latest \ --model Qwen/Qwen-7B-Chat --tensor-parallel-size 1 --api-key sk-xxx ``

  4. Configure Elasticsearch Index: Create a vector index with dense_vector mapping for 1024-dim embeddings:
  5. ``json PUT /rag-knowledge { "mappings": { "properties": { "text": { "type": "text" }, "embedding": { "type": "dense_vector", "dims": 1024, "index": true, "similarity": "cosine" } } } ``

  6. Ingest & Embed Documents: Chunk documents, call http://localhost:8000/v1/embeddings, and bulk-index into ES:
  7. ``python es = elasticsearch.Elasticsearch("https://<es-endpoint>:9200", basic_auth=("elastic", "<pwd>")) es.bulk(index="rag-knowledge", operations=[{"index": {"_id": i}}, {"text": chunk, "embedding": vec}] for i, (chunk, vec) in enumerate(batch)) ``

  8. Build Retrieval Pipeline: Query ES with knn search using the query embedding:
  9. ``json POST /rag-knowledge/_search { "knn": { "field": "embedding", "query_vector": [0.12, ...], "k": 5, "num_candidates": 100 } } ``

  10. Synthesize Response: Pass top-k chunks to http://localhost:8000/v1/chat/completions with temperature: 0.1 and max_tokens: 2048.

Architecture

Alibaba Cloud Linux hosts the inference endpoints (LLM + embeddings) via GPU-accelerated containers. The RAG application acts as the orchestrator: it routes user queries to the local embedding API, performs vector similarity search against Elasticsearch’s knn engine, and feeds the top-k retrieved passages back to the local LLM for answer generation. Elasticsearch exclusively manages document storage, metadata filtering, and sub-millisecond vector retrieval.

Prerequisites

Common pitfalls

Typical questions