DaaS / Products / AI Model with Edge API Gateway

AI Model with Edge API Gateway

Deploy an AI inference model on Alibaba Cloud Linux (Alinux) as the backend, then deploy a Cloudflare Worker as an edge proxy/API gateway to route and cache client requests to the model endpoint, providing low-latency global access.

Products involved

Scenario

Use this integration when you need to serve compute-heavy AI inference models (e.g., Qwen-7B) globally with low-latency routing. Alibaba Cloud Linux handles GPU/CPU inference, while a Cloudflare Worker acts as a distributed edge API gateway to cache responses, enforce rate limits, and securely route traffic.

Integration steps

Provision Alinux Backend: Launch an instance running Alibaba Cloud Linux 3.2104 LTS 64-bit. SSH in and install NVIDIA drivers + CUDA for GPU acceleration.
Deploy the AI Model: Run pip install vllm && vllm serve Qwen/Qwen-7B-Chat --host 0.0.0.0 --port 8000 --tensor-parallel-size 2. This exposes a REST API at http://<alinux-ip>:8000/v1/completions.
Initialize Worker: Run npm create cloudflare@latest ai-gateway -- --type worker and cd ai-gateway.
Write Proxy Logic: In src/index.ts, implement edge routing and caching:

``typescript export default { async fetch(request: Request, env: Env, ctx: ExecutionContext) { const cache = caches.default; let res = await cache.match(request); if (!res) { res = await fetch(https://${env.BACKEND}/v1/completions, { method: request.method, headers: { ...request.headers, 'X-Auth': env.KEY } }); const cached = new Response(res.body, res); cached.headers.set('Cache-Control', 'public, max-age=30'); ctx.waitUntil(cache.put(request, cached)); return cached; } return res; } }; ``

Configure Environment: In wrangler.toml, set name = "ai-gateway", main = "src/index.ts", and add [vars] for BACKEND and KEY.
Deploy: Run wrangler deploy and bind a custom domain via the Cloudflare dashboard.

Architecture

Client requests hit the nearest Cloudflare PoP. The Worker checks the edge cache for identical prompts. On a miss, it forwards HTTPS requests to the Alinux backend. Alinux runs the vLLM engine, generates tokens, and returns JSON. The Worker caches the payload, strips internal headers, and returns it to the client, handling TLS termination and IP masking.

Prerequisites

Alibaba Cloud account with Alinux 3.2104 LTS instance (GPU recommended)
Cloudflare account with wrangler CLI installed globally
Node.js 18+ environment
Pre-downloaded model weights or Hugging Face token
Public IP/NAT gateway for Alinux

Common pitfalls

Worker Timeouts: Standard Workers enforce a 30s CPU limit. Long inference will fail; implement streaming (text/event-stream) or upgrade to Workers Unbound.
Over-Caching: Caching identical prompts works, but caching user-specific responses causes data leakage. Vary cache keys by session or disable caching for non-idempotent routes.
Unrestricted Backend: Failing to restrict Alinux security groups to Cloudflare IP ranges (cf.com/ips/) exposes your inference endpoint to direct scraping.
Missing CORS: The Worker must handle OPTIONS preflight requests and forward Content-Type: application/json, or browser clients will block the proxy.

Typical questions

deploy AI model with edge proxy
deploy model behind cloudflare worker
model serving with CDN gateway
deploy AI inference with global routing
部署AI模型加边缘代理
模型部署配合cloudflare worker
deploy model with API gateway
serve model at edge