Deploy AI models (embeddings and LLM) on Alibaba Cloud Linux for inference serving, then deploy a RAG application using Elasticsearch as the vector knowledge base that calls these models to build an end-to-end enterprise AI chatbot with retrieval-augmented generation.
Developers use this workflow when they need a self-hosted, enterprise-grade RAG chatbot that processes proprietary documents without sending data to third-party APIs. By deploying embedding and LLM models directly on Alibaba Cloud Linux and using Elasticsearch as a low-latency vector knowledge base, teams achieve full data sovereignty, customizable retrieval pipelines, and scalable inference.
alinux3 image. Install NVIDIA drivers and Docker: sudo yum install -y nvidia-driver docker-ce.``bash docker run --gpus all -p 8000:8000 vllm/vllm-openai:latest \ --model Qwen/Qwen-7B-Chat --tensor-parallel-size 1 --api-key sk-xxx ``
dense_vector mapping for 1024-dim embeddings:``json PUT /rag-knowledge { "mappings": { "properties": { "text": { "type": "text" }, "embedding": { "type": "dense_vector", "dims": 1024, "index": true, "similarity": "cosine" } } } ``
http://localhost:8000/v1/embeddings, and bulk-index into ES:``python es = elasticsearch.Elasticsearch("https://<es-endpoint>:9200", basic_auth=("elastic", "<pwd>")) es.bulk(index="rag-knowledge", operations=[{"index": {"_id": i}}, {"text": chunk, "embedding": vec}] for i, (chunk, vec) in enumerate(batch)) ``
knn search using the query embedding:``json POST /rag-knowledge/_search { "knn": { "field": "embedding", "query_vector": [0.12, ...], "k": 5, "num_candidates": 100 } } ``
http://localhost:8000/v1/chat/completions with temperature: 0.1 and max_tokens: 2048.Alibaba Cloud Linux hosts the inference endpoints (LLM + embeddings) via GPU-accelerated containers. The RAG application acts as the orchestrator: it routes user queries to the local embedding API, performs vector similarity search against Elasticsearch’s knn engine, and feeds the top-k retrieved passages back to the local LLM for answer generation. Elasticsearch exclusively manages document storage, metadata filtering, and sub-millisecond vector retrieval.
knn plugin activeelasticsearch, requests, and tiktoken installeddense_vector mapping; otherwise, knn queries fail with mapper_parsing_exception.CUDA_VISIBLE_DEVICES isolation causes OOM crashes.verify_certs=False or mount CA bundles.max_tokens and chunk-size caps (≤512 tokens).