01 · System design

An LLM chatbot over 550M SKUs.

The system that turns "how many ergonomic chairs are in the West-region warehouse" into a BigQuery SQL plan, executes it, and streams a grounded answer back. Click any node to see what it does, how it fails, and where the latency lives.

02 · Live retrieval

Semantic search over this site.

Type a question — "Go production gateway" or "big data spark". Every project description, role, and skill on this site is indexed below with TF-IDF over normalised tokens, scored by cosine similarity at keystroke time. Honest about what it is — no LLM in the loop, no API call.

corpus 14 docs vocab — terms scored in — ms algorithm TF·IDF + cos

try production llm gateway in Go rag pipeline vertex ai big data spark airflow walmart inventory chatbot payment backend fintech

03 · Deep dive

Designing go-llm-gateway: a production proxy for LLM traffic.

Shubham Bakre · ~6 min read · Go 1.21 · zero external dependencies

Every team that puts an LLM in front of users eventually rebuilds the same five things in some form: auth, rate limiting, cost tracking, model failover, and structured logging. The fifth one is the only one anyone enjoys.

go-llm-gateway is the version of those five things I keep coming back to. It is intentionally small — under 800 lines of Go, no external dependencies beyond net/http — because the alternative is to ship a 40MB binary that re-implements bad versions of http.ServeMux. The whole point is that an LLM gateway is boring HTTP with a few decisions that have to be right.

1 · The shape of the request

Every request gets a single struct shape. The gateway accepts OpenAI-compatible POST /v1/chat/completions and rewrites model internally — so a downstream switch from gpt-4o to gemini-1.5-pro is a config change, not a client change.

// internal/proxy/request.go type ChatRequest struct { Model string `json:"model"` Messages []Message `json:"messages"` MaxTokens int `json:"max_tokens,omitempty"` Temperature float64 `json:"temperature,omitempty"` Stream bool `json:"stream,omitempty"` }

2 · Rate limiting — per-IP token bucket

A sync.Mutex-protected map of net.IP → *bucket is enough for a single instance handling thousands of RPS. The bucket leaks tokens at a configured rate; when it empties, requests return 429. No Redis dependency for v1 — and crucially, no distributed rate limit lying to you about its accuracy.

type bucket struct { tokens float64 last time.Time capacity float64 rate float64 // tokens per second } func (b *bucket) allow() bool { now := time.Now() elapsed := now.Sub(b.last).Seconds() b.tokens = min(b.capacity, b.tokens + elapsed*b.rate) b.last = now if b.tokens < 1 { return false } b.tokens-- return true }

3 · Cost tracking with atomics

Two atomic.Int64 counters per model — one for prompt tokens, one for completion tokens. Per-model cost is config; the gateway exposes GET /metrics with the running total in USD. No locks on the hot path. A background goroutine flushes to a logger every 10s.

4 · Failover

A request lists models in priority order: [gemini-1.5-pro, gpt-4o, claude-3-sonnet]. On any 5xx or context-deadline error from the primary, the gateway retries against the next model with the same payload, decrementing a budget. Once the budget hits zero, it returns the last upstream error verbatim. Failures should be loud, not hidden.

  ┌─ primary: gemini-1.5-pro     ─┐  5xx
  │                                  │   │
  └→ fallback: gpt-4o            ─┘   │  5xx
     │                                  │
     └→ fallback: claude-3-sonnet ─┘   │
        │                                  │
        └→ return last upstream error     ↓

5 · Structured logging

Every request emits one JSON line at the end of its lifecycle — request_id, model, route, latency, prompt_tokens, completion_tokens, cost_usd, status. That's the contract any observability tool can read. No log levels, no debug spew. One line in, one line out.

The unsexy decision underneath all of this: do not introduce a dependency until the absence of it is causing a real bug.

What's next

Streaming SSE pass-through — currently buffers the response; needs proper flushing for streamed completions.
Per-tenant budgets — the cost tracker is per-model, not per-customer. The right shape is (tenant_id, model) → counter.
Embedding & rerank routes — same proxy shape, different endpoints. Probably a 50-line addition.
Drop-in OpenTelemetry — tracing for the failover chain so the long-tail latency is visible.

The gateway has been the load-bearing piece in three different projects I've shipped. The code is public: github.com/shubhambakre/go-llm-gateway. If you find a sharp edge, file an issue — or better, send the patch.

· · ·

04 · Interactive Game

A 1-Minute Skillset Sprint.

A fully functional, retro synthwave 8-bit platformer game where you play as an Engineering Wizard, dodging bugs, collecting skillset items, and deploying to production before the sprint deadline. Built entirely on HTML5 Canvas using Web Audio API for custom audio synthesis.