Gangsta AI

Inside an AI Aggregator: Fanning Out to 30+ Models at Once

2026-06-30 · 2 min read

Querying one LLM is a POST request. Querying thirty of them, in parallel, and rendering the answers as they stream in — without one slow provider stalling the whole page or one dead API key taking down the request — is an architecture problem. Here's how a real multi-model aggregator holds together.

1. A provider abstraction that hides the chaos

Every provider has its own auth, request shape, streaming format, and failure modes. The first job is a normalization layer: one internal queryModel(prompt, model, attachments) interface, and per-provider adapters that translate to/from each API. New model next month? Add an adapter, not a rewrite.

2. Fan-out with independent failure

The core move is a parallel fan-out where each model call is isolated. One provider returning a 429 or a 500 must not reject the batch. In practice that means each call is its own promise with its own timeout and its own try/catch, and the UI renders whatever comes back — a grid of cards that fill in as each model responds, with failures shown as "this model errored" rather than a blank screen.

3. Streaming without head-of-line blocking

Users will forgive latency they can watch. Stream every model independently so the fast ones (a small, cheap model) paint in under a second while a heavy reasoning model is still thinking. The trick is not to gate rendering on the slowest response.

4. Cost control as a first-class concern

Thirty models per query gets expensive fast. Guardrails that matter: per-request and per-user cost caps, a hard ceiling that halts fan-out, cheaper default model sets with elite models behind a tier, and token accounting per provider (tokens ÷ 1000 × rate) tracked in real time — not reconciled at month-end.

5. Vision, files, and the lowest common denominator

The moment you accept image or document attachments, you inherit every provider's quirks. Not all models decode the same formats (HEIC is a notorious offender). Normalize attachments server-side at the ingest chokepoint — convert once, before fan-out — so every provider receives something it can actually read, and set a vision-capable fallback for models that choke.

6. Fallbacks all the way down

Every model gets a safeFallback. If the primary errors, retry on a comparable model of the same capability class (vision → vision, reasoning → reasoning). A text-only fallback for a vision task just fails differently.

Why build it at all?

Because the alternative — humans tab-switching between four subscriptions to sanity-check one answer — doesn't scale. Gangsta AI is this architecture in production: one prompt, fanned out to 30+ models, streamed back side by side. The hard parts were never the API calls. They were isolation, streaming, cost, and the attachment lowest-common-denominator. Get those right and the "compare every AI" product falls out of the architecture.

Try Gangsta AI free →

Related reading: Benchmarks Lie: How to Actually Evaluate LLMs for Your Use Case · Catching AI Hallucinations With Multi-Model Consensus · Frontier Models in 2026: A Field Guide · Best AI for Coding in 2026: ChatGPT vs Claude vs Gemini vs Grok

The Frontier Models · All articles