DaaS / Products / Custom RAG with Edge-Optimized Model Serving

Custom RAG with Edge-Optimized Model Serving

Train custom embedding models on PAI and orchestrate a RAG application via Bailian across OpenSearch/Elasticsearch/OSS, then deploy the generative inference endpoint on PAI's managed serving with auto-scaling and model versioning, fronted by a Cloudflare Worker edge proxy for global low-latency API delivery with caching and rate limiting.

Products involved

Scenario

Use this architecture when off-the-shelf embeddings fail to capture domain-specific jargon or compliance requirements, and you require globally distributed, low-latency RAG inference. By training custom embeddings on PAI, orchestrating retrieval via Bailian across OpenSearch/OSS, and fronting PAI-EAS generative endpoints with a Cloudflare edge proxy, you achieve high-precision, auto-scaled AI delivery with built-in caching and rate limiting.

Integration steps

  1. Stage raw data in OSS: Upload domain documents using ossutil cp -r ./data oss://<bucket>/raw/ --recursive.
  2. Train custom embeddings on PAI: Mount the OSS bucket in a PAI-DSW notebook and submit the job: pai submit --job-name custom-emb --oss-path oss://<bucket>/raw/ --framework pytorch --instance-type ecs.gn6v-c8g1.2xlarge.
  3. Index vectors in OpenSearch: Export trained embeddings to OSS, then ingest via the _bulk API with a dense_vector mapping. Configure query-time inference: PUT /_ingest/pipeline/vector-ingest {"processors": [{"inference": {"model_id": "custom-emb"}}]}.
  4. Deploy generative model on PAI-EAS: Package the LLM and deploy via eascmd create service.json, enabling auto-scaling: "AutoScaling": {"MinInstances": 2, "MaxInstances": 10, "TargetCpuUtilization": 70} and versioning via --version v1.
  5. Orchestrate RAG in Bailian: In the Bailian console, create an application, bind the OpenSearch index as the knowledge base, and set the PAI-EAS endpoint as the backend. Configure retrieval: top_k: 5, similarity_threshold: 0.75.
  6. Deploy Cloudflare Worker edge proxy: Write a Worker to intercept /v1/chat/completions, cache responses with cache.put(request, response, { ttl: 3600 }), and enforce limits via RateLimit bindings. Route to origin: fetch("https://<pai-eas-endpoint>/predict", { method: "POST" }).

Architecture

Raw documents flow from OSS into PAI for custom embedding training. Resulting vectors are persisted in OpenSearch/Elasticsearch for semantic search. Bailian acts as the orchestration layer, querying OpenSearch for context and routing prompts to the PAI-EAS generative endpoint. The Cloudflare Worker sits at the edge, handling TLS termination, response caching, and rate limiting before forwarding requests to Bailian/PAI-EAS. Alinux serves as the underlying runtime for PAI nodes and containerized inference services.

Prerequisites

Common pitfalls

Typical questions

FAQ

Q: How do I train custom embeddings, orchestrate a RAG application, and deploy it with edge-optimized inference? A: You can implement this workflow by training custom embedding models on PAI and orchestrating the RAG application via Bailian across OpenSearch, Elasticsearch, or OSS. The system deploys the generative inference endpoint on PAI’s managed serving with auto-scaling and model versioning, while a Cloudflare Worker edge proxy handles global low-latency delivery with caching and rate limiting.