DaaS / Products / AI Model with Edge API Gateway

AI Model with Edge API Gateway

Deploy an AI inference model on Alibaba Cloud Linux (Alinux) as the backend, then deploy a Cloudflare Worker as an edge proxy/API gateway to route and cache client requests to the model endpoint, providing low-latency global access.

Products involved

Scenario

Use this integration when you need to serve compute-heavy AI inference models (e.g., Qwen-7B) globally with low-latency routing. Alibaba Cloud Linux handles GPU/CPU inference, while a Cloudflare Worker acts as a distributed edge API gateway to cache responses, enforce rate limits, and securely route traffic.

Integration steps

  1. Provision Alinux Backend: Launch an instance running Alibaba Cloud Linux 3.2104 LTS 64-bit. SSH in and install NVIDIA drivers + CUDA for GPU acceleration.
  2. Deploy the AI Model: Run pip install vllm && vllm serve Qwen/Qwen-7B-Chat --host 0.0.0.0 --port 8000 --tensor-parallel-size 2. This exposes a REST API at http://<alinux-ip>:8000/v1/completions.
  3. Initialize Worker: Run npm create cloudflare@latest ai-gateway -- --type worker and cd ai-gateway.
  4. Write Proxy Logic: In src/index.ts, implement edge routing and caching:
  5. ``typescript export default { async fetch(request: Request, env: Env, ctx: ExecutionContext) { const cache = caches.default; let res = await cache.match(request); if (!res) { res = await fetch(https://${env.BACKEND}/v1/completions, { method: request.method, headers: { ...request.headers, 'X-Auth': env.KEY } }); const cached = new Response(res.body, res); cached.headers.set('Cache-Control', 'public, max-age=30'); ctx.waitUntil(cache.put(request, cached)); return cached; } return res; } }; ``

  6. Configure Environment: In wrangler.toml, set name = "ai-gateway", main = "src/index.ts", and add [vars] for BACKEND and KEY.
  7. Deploy: Run wrangler deploy and bind a custom domain via the Cloudflare dashboard.

Architecture

Client requests hit the nearest Cloudflare PoP. The Worker checks the edge cache for identical prompts. On a miss, it forwards HTTPS requests to the Alinux backend. Alinux runs the vLLM engine, generates tokens, and returns JSON. The Worker caches the payload, strips internal headers, and returns it to the client, handling TLS termination and IP masking.

Prerequisites

Common pitfalls

Typical questions