Deploy an AI inference model on Alibaba Cloud Linux (Alinux) as the backend, then deploy a Cloudflare Worker as an edge proxy/API gateway to route and cache client requests to the model endpoint, providing low-latency global access.
Use this integration when you need to serve compute-heavy AI inference models (e.g., Qwen-7B) globally with low-latency routing. Alibaba Cloud Linux handles GPU/CPU inference, while a Cloudflare Worker acts as a distributed edge API gateway to cache responses, enforce rate limits, and securely route traffic.
pip install vllm && vllm serve Qwen/Qwen-7B-Chat --host 0.0.0.0 --port 8000 --tensor-parallel-size 2. This exposes a REST API at http://<alinux-ip>:8000/v1/completions.npm create cloudflare@latest ai-gateway -- --type worker and cd ai-gateway.src/index.ts, implement edge routing and caching:``typescript export default { async fetch(request: Request, env: Env, ctx: ExecutionContext) { const cache = caches.default; let res = await cache.match(request); if (!res) { res = await fetch(https://${env.BACKEND}/v1/completions, { method: request.method, headers: { ...request.headers, 'X-Auth': env.KEY } }); const cached = new Response(res.body, res); cached.headers.set('Cache-Control', 'public, max-age=30'); ctx.waitUntil(cache.put(request, cached)); return cached; } return res; } }; ``
wrangler.toml, set name = "ai-gateway", main = "src/index.ts", and add [vars] for BACKEND and KEY.wrangler deploy and bind a custom domain via the Cloudflare dashboard.Client requests hit the nearest Cloudflare PoP. The Worker checks the edge cache for identical prompts. On a miss, it forwards HTTPS requests to the Alinux backend. Alinux runs the vLLM engine, generates tokens, and returns JSON. The Worker caches the payload, strips internal headers, and returns it to the client, handling TLS termination and IP masking.
wrangler CLI installed globallytext/event-stream) or upgrade to Workers Unbound.cf.com/ips/) exposes your inference endpoint to direct scraping.OPTIONS preflight requests and forward Content-Type: application/json, or browser clients will block the proxy.