⬡
Shard-Based Execution
Infinite Memory via Sharding
Transformer models exceeding available VRAM are partitioned into pipeline stages
and streamed across multiple browser tabs running in parallel — each tab owns a
contiguous slice of the model graph. Inter-tab latency is amortized across prefill
and decode phases, sidestepping local GPU memory ceilings entirely.
// Tab 0 → Layers 0–11 (embed + first half)
// Tab 1 → Layers 12–23 (second half + LM head)
BroadcastChannel("shard-grid") → activations
⇄
Model Swarming
P2P Model Swarming
Multi-gigabyte model weights are distributed using torrent-style WebRTC data
channels. Nodes that have already downloaded shards become seeders, dramatically
accelerating warm-up time across the fleet. All weights are cached persistently in
the Origin Private File System (OPFS) — survive restarts, survive eviction.
OPFS → persistent model cache
WebRTC DataChannel → shard transport
Seeder ratio ↑ → warm-up latency ↓
⚙
Unified Runtime
Heterogeneous Backends
A single unified scheduler dispatches tensor ops to the fastest backend
available per device. WebGPU for discrete and integrated GPUs, WebNN for
neural-engine accelerated Silicon, WebAssembly SIMD as the universal fallback.
One graph, any hardware — no specialised builds required.
Backend priority: WebGPU → WebNN → Wasm
Runtime detection → optimal kernel dispatch
INT8 / FP16 quantisation per backend caps
⟳
OOM Recovery
Unkillable Compute
Out-of-memory crashes are an expected system event, not an error. When a shard
worker tab is killed by the browser's memory pressure heuristic, the orchestrator
transparently spawns a replacement in a hidden background tab, reloads the cached
weights from OPFS, and re-syncs pipeline state — all before the calling code sees
a timeout.
OOM detected → respawn hidden tab
OPFS cache → instant weight reload
Pipeline state sync → <500ms recovery