FacialScore

Why We Built a Chip for Agent Meshes, Not for Large Models

CortexPod Engineering Blog — March 2026 Ho Chi Minh City / Singapore

“The H100 is the best computer ever built for a problem that enterprise AI is leaving behind.”

We get a version of the same question at every investor meeting, every technical review, every conference hallway conversation:

“Why build a new chip when you can just run on H100s?”

It is not a naive question. The H100 is genuinely extraordinary. 80GB HBM3e. 3.35TB/s memory bandwidth. NVLink 4.0 for multi-chip tensor parallelism. A CUDA ecosystem that has been refined across a dozen hardware generations and hundreds of thousands of engineer-years. NVIDIA has spent decades creating the most capable general-purpose AI accelerator ever built.

And yet here we are — three years into a company, burning through a seed round, trying to tape out a chip.

This post is the honest answer to that question. Not the investor deck version. The engineering version — the one that starts with the three walls we hit, walks through the specific reasons software cannot knock them down, and ends with why we believe the moment to build this chip is right now, not five years ago when the workload didn’t exist, and not five years from now when someone larger will build it instead.

Part 1: The Workload Changed. The Industry Hasn’t Caught Up.

Here is the sentence that contains the entire thesis: the dominant enterprise AI inference pattern of 2026 is structurally different from the pattern that GPU hardware was designed to serve.

This is not an opinion about future trends. It is an observation about what is running in production today.

The evolution of LLM inference — 2023 to 2026

What 2023 Looked Like

In 2023, the dominant inference workload was conceptually simple: one user sends one prompt to one large model and receives one response. The model is large — 70B, 175B parameters — and the challenge is moving its enormous weight matrix through a computation pipeline as fast as possible.

This is a computation problem. More FLOPS means faster generation. More memory bandwidth means faster weight loading. More HBM means larger models. Every architectural decision in the H100 — the 80GB HBM3e, the 3.35TB/s bandwidth, the massive tensor cores — is optimized for this problem.

The GPU was the right tool. It was designed for exactly this.

What 2026 Looks Like

Fast forward three years. The model is no longer the product. The mesh is the product.

A Vietnamese bank processing loan applications does not deploy one LLaMA-3 70B model. It deploys eight agents operating simultaneously on the same application:

Real agent mesh: Vietnamese bank loan processing pipeline

A Router (7B, <10ms SLA) that classifies the request and dispatches to the mesh
A Researcher (70B, <500ms) that extracts financial data from the 200-page application
A Data Fetcher (13B, <200ms) that queries the SBV regulatory database
A Fact Checker (13B, <300ms) that cross-validates every claim the Researcher made
A Compliance Auditor (13B, <200ms) that scans against Vietnamese banking guidelines — this one has a hard regulatory deadline
A Writer (70B, <1000ms) that synthesizes everything into a structured briefing
A Translator (13B, <300ms) that localizes the briefing to Vietnamese
A Summarizer (7B, <500ms) that produces the executive summary

All eight agents are active simultaneously. All eight reference the same 200-page source document. All eight hand off partial results to each other as they complete their work. The Researcher’s extracted data informs the Fact Checker. The Fact Checker’s validated claims inform the Writer. The Compliance Auditor runs in parallel with a hard deadline that cannot wait for the Writer.

This is not a research demo or a technical curiosity. This is production AI at a regulated financial institution. This is the workload that BFSI enterprises across Asia are deploying today, and it is growing at 48.5% CAGR — the fastest subsegment in the AI market.

And the GPU was not designed for it. Not because NVIDIA made a mistake. Because the workload did not exist when the H100 architecture was locked in.

Part 2: The Three Walls

Before we built CortexPod, we built agent-mesh systems on H100 clusters. We were not naive about the challenges — we knew GPUs were not purpose-built for this. What surprised us was how quickly the performance ceilings became absolute.

There is an important distinction between a software bottleneck and an architectural constraint. A software bottleneck can be optimized away — better scheduling, smarter caching, more efficient kernels. An architectural constraint is encoded in the hardware’s memory hierarchy or execution model. No amount of code changes what physics allows across PCIe.

We hit three architectural constraints. Each time, we optimized our way to a ceiling, and then the ceiling did not move.

Three walls no software optimization resolves

Wall One: Context Capacity

CUDA MPS on an H100 supports approximately 8 concurrent model contexts before memory pressure forces context swapping to host memory. This is not a driver limitation or a CUDA quirk. It is a consequence of HBM capacity.

Three 70B W4A8 models: 3 × 35GB = 105GB. An H100 has 80GB HBM. The math does not work. Before you have even considered KV caches, LoRA adapters, or activation buffers, you cannot hold three large models simultaneously on a single chip.

The eight-agent Vietnamese bank pipeline above requires holding eight model contexts active simultaneously. On H100:

Option A: over-provision hardware — 4–6 H100s just for weight footprint, adding $120K–$180K in hardware that sits mostly idle
Option B: serialize agents that should run in parallel — incurring latency penalties that compound across every sequential handoff

Neither option is acceptable for a production system with a compliance agent on a hard deadline. And neither is a software problem.

CortexChip’s answer is architectural: 256 hardware-managed Pod slots, each with dedicated 2MB SRAM allocation, backed by 96GB GDDR7. A 32-agent financial mesh runs natively on a single chip, no swapping required.

The wall: 8 practical contexts on H100. The CMFC’s answer: 256.

Wall Two: The 50–200ms Handoff

The second wall is the one that kills production SLAs.

When the Researcher agent in our bank pipeline completes its analysis and needs to hand off its accumulated context — the extracted financial data, the intermediate reasoning, the relevant document sections — to the Fact Checker, what happens?

On a GPU cluster, the full sequence is:

Serialize the KV cache to host RAM over PCIe. At PCIe 5.0 x16’s theoretical 64GB/s peak, moving a 4GB KV context takes approximately 62ms before any scheduling overhead. In practice, with contention and fragmentation, it is often longer.
Route through the software scheduler — a daemon running at OS context-switch frequency — to locate the Fact Checker’s context slot and determine where to put the incoming state.
Deserialize back from host RAM into the target agent’s VRAM allocation.
CUDA context switch — save the current executing model’s register state, flush relevant HBM pages, load the Fact Checker’s weights into active registers.

Under realistic load on a production H100 cluster with 32 agents competing for shared VRAM and PCIe bandwidth, this sequence takes 50–200ms per handoff.

Our pipeline has ten inter-agent exchanges per document. At 50ms each, that is 500ms of pure coordination overhead before a single output token reaches the loan officer. At 200ms each, it is two seconds.

We profiled this exhaustively. We tried better serialization. We tuned the scheduler. We experimented with pinned memory, huge pages, and NUMA-aware allocation. The ceiling did not move substantially, because the bottleneck was not in any of those places.

The bottleneck was the PCIe bus. Sixty-two milliseconds at theoretical peak. Physics.

The only way to eliminate this is to eliminate the PCIe hop for agent state transfers — which means the coordination fabric has to live on the same die as the inference compute. That is the CortexMesh Fabric Controller: an on-chip hardware block that transfers KV cache state between agent contexts at ~1TB/s fabric bandwidth, achieving <2ms P99 end-to-end handoff, with zero CPU involvement.

The 25–100× improvement is not a software optimization. It is what happens when you remove an order-of-magnitude hardware constraint from the critical path.

Wall Three: Computing the Same Thing Four Times

The third wall is the most wasteful, and the one that compounds most brutally at scale.

When four agents in the same mesh all process the same source document — and in our bank pipeline, they all do — each agent independently computes the full KV cache for that document. At W4A8 precision, the KV cache for a 200-page document with a 128K-token context on a 70B model runs approximately 21GB per agent.

Four agents × 21GB = 84GB of KV cache, of which 63GB is identical — it is the same document, processed through the same model architecture, producing the same KV tensors. The GPU cluster computes it four times and stores four copies.

On a single H100 with 80GB HBM: impossible without paging. On an 8×H100 NVL node with 640GB aggregate: it fits, but each agent still independently computes its 21GB.

Software workarounds exist — SGLang’s prefix caching, RadixAttention, vLLM’s paged attention — and they help. They reduce the storage of redundant KV data after the first computation. But they do not eliminate the computation: the first agent always pays the full 21GB compute cost. Subsequent agents pay a cache-lookup overhead. And all of them require explicit software orchestration, adding latency and engineering complexity.

CortexChip’s CMFC implements hardware shared KV cache: one computation, N reads, via a hardware Shared Read Bus with coherency managed in silicon. The shared document context is computed once, stored once at 21GB, and read by all four agents at <200ns latency over a hardware broadcast path.

The result: 75% of the redundant compute is eliminated at the hardware level, automatically, without orchestration.

Part 3: What Makes This an Architectural Problem, Not a Software Problem

Every serious engineer we show these numbers to goes through the same mental process. They nod at the problems. Then they start listing software solutions:

“What about NVIDIA Dynamo’s disaggregated scheduling?” “What about SGLang’s prefix caching?” “What about tensor parallelism across multiple GPUs?”

These are reasonable responses. The industry has been working on software mitigations for years, and they are real improvements. The question is not whether they help — they do. The question is how far they get.

The optimization ceiling — why software cannot close this gap

The ceiling breaks down as follows:

vLLM PagedAttention / prefix caching: Eliminates redundant KV storage after the first computation, and reduces re-encoding for repeated document prefixes. Real improvement — approximately 25–30% reduction in memory footprint for workloads with shared prefixes. The first agent still pays the full computation cost. Estimated combined gain for our target workload: ~30%.

CUDA MPS: Allows multiple processes to share GPU resources with reduced context switch overhead. Real improvement over baseline GPU multi-tenancy. No priority model within MPS, so interactive and batch workloads compete equally. Context switches still take 20–50ms — improved from the 100ms+ baseline, but not approaching the 2ms target. Estimated gain: ~20%.

NVIDIA Dynamo (GTC 2026): Disaggregates prefill and decode across separate hardware pools, eliminating the most common source of TTFT degradation in coupled prefill/decode systems. This is a genuine architectural improvement at the software layer. Two problems: it requires Blackwell hardware (export-controlled in every Tier 2 Asian market CortexPod serves), and it does not provide hardware-level concurrent Pod scheduling, sub-2ms context switching, or per-Pod SRAM isolation. It optimizes for throughput on large single models, not for concurrent agent coordination.

Combined optimistic best case with all three: approximately 50% improvement over baseline GPU agent-mesh performance.

Fifty percent is good. It is not enough. Here is why:

The three walls are not software bottlenecks that can be progressively optimized away. They are architectural properties fixed in the H100’s design:

PCIe bandwidth ceiling: 64GB/s maximum. Moving 4GB over PCIe = minimum ~62ms. No CUDA optimization changes this. It is physics.
CUDA non-preemptive scheduling: Long prefill operations cannot be interrupted for interactive requests. CUDA streams have no priority preemption model. Adding priority requires a completely different scheduling architecture — not a driver update.
HBM fixed at 80GB: 105GB of 70B models does not fit. Full stop.

Software can work around architectural constraints, but working around them is categorically different from solving them. The 50% software improvement leaves you at 25–100ms handoff latency and 8 practical concurrent contexts. CortexChip’s hardware targets <2ms handoff and 256 native contexts. These are not different points on the same optimization curve. They are different architectures.

Part 4: The Physics of the Decision

We want to be precise about what CortexChip claims and does not claim.

CortexChip does not claim to be a better general-purpose GPU. It is not. For training workloads, for FP16 inference, for single-large-model throughput maximization, an H100 is better. The H100 was built for those problems and it executes them extraordinarily well.

CortexChip claims to be the right architecture for a specific workload class: concurrent multi-agent W4A8 inference with shared document context, operating under heterogeneous SLA requirements, at the cost and supply chain constraints of the Asian enterprise market.

For that workload, the performance comparison is not “H100 is 2× faster than CortexChip on the same task.” It is “CortexChip serves a 32-agent mesh on a single chip where H100 requires 4–6 chips, with 10–100× lower handoff latency and 75% less redundant computation.”

The economics follow from the architecture.

Economics of a 256-agent production mesh

A 256-agent production mesh on H100 hardware: 32 chips, $960,000, 22,400W, ~8 practical concurrent contexts.

The same workload on CortexChip v1.0: 1 chip, $3,000, 300W, 256 native concurrent contexts.

The 320× capital cost difference is not because CortexChip is 320× better at anything. It is because the comparison unit is different. On H100, you need 32 chips to hold the weight footprint for the workload. On CortexChip, you need one. That asymmetry is entirely a consequence of the architectural mismatch between the GPU’s single-model design and the agent-mesh workload’s multi-context requirements.

For an AI startup in Vietnam or Indonesia raising a $3M Series A, the difference between $960,000 and $3,000 in hardware cost is not a footnote. It is the difference between feasible and impossible.

Part 5: Why Now — The Timing Argument

People sometimes ask why CortexPod is building this now rather than waiting for NVIDIA to solve it. It is a fair question, and it has a specific answer.

NVIDIA is solving it, in software, at the wrong layer. GTC 2026’s Dynamo announcement — disaggregated prefill/decode, inference microservices, dynamic batching across heterogeneous request types — is an explicit acknowledgment that the single-model inference architecture is inadequate for agent workloads. The GPU pipeline is being architecturally dismembered in software to approximate what dedicated coordination silicon provides natively.

This is not a criticism of NVIDIA’s approach. It is the right move given their constraint: they cannot redesign the H100’s memory hierarchy or scheduling model in software. They can add coordination logic on top of an existing architecture. CortexChip builds the coordination logic into the architecture from the start.

The window to establish this silicon is real and time-bounded. The APAC enterprise AI market is 18–24 months behind US adoption maturity, but it is closing fast. The enterprises that will anchor the AI infrastructure contracts in Vietnam, India, Indonesia, and Thailand in 2026–2027 are making infrastructure decisions right now. Once those decisions are made, switching costs rise sharply. The window to establish a hardware relationship is not infinite.

The supply constraint adds urgency. The US AI Diffusion Rule (January 2025) limits Vietnam, India, Indonesia, Thailand, and Malaysia to approximately 50,000 GPU equivalents annually through 2027. H100 is not just expensive in these markets — it is legally and practically constrained in the quantities required for serious production deployment. A chip manufactured outside the NVIDIA/TSMC supply chain, by a company building for this specific market, at $3,000 per unit, addresses a constraint that no amount of software optimization on inaccessible hardware can solve.

IBM said it plainly in March 2026: “A new class of chips for agentic workloads will emerge.”

The question is not whether. It is who and when.

Part 6: What We Are Actually Building

CortexChip is not a GPU with agent features bolted on. It is a chip designed from the ground up around the coordination problem.

The central block — the CortexMesh Fabric Controller (CMFC) — occupies 25% of the die. On an H100, 0% of the die is dedicated to inter-agent coordination. The entire coordination stack on GPU runs in software at OS frequency. The CMFC runs in hardware at 1GHz, with four dedicated primitives:

Context Slot Manager: 256-entry hardware directory, <100ns lookup, no software involvement
State Transfer Engine: direct KV cache transfer between context slots at ~1TB/s on-fabric bandwidth, <1.5ms for a 4GB block
SLA Arbiter: 256 hardware priority lanes with deadline enforcement, <50ns arbitration cycle
Shared Read Bus: hardware-coherent KV cache broadcast for multi-agent document sharing, <200ns

The remaining 75% of the die is necessary infrastructure: the Tensor Core Array for the actual inference computation (W4A8 native, 2 TFLOPS INT8 effective), the memory subsystem (512MB on-chip SRAM + 96GB GDDR7 via FOPLP packaging through ASE Group, no TSMC dependency), and the CXL 2.0 interface for multi-chip cluster scale-out.

We chose 12nm FinFET over 5nm for three reasons: the workload is memory-bandwidth-bound, not FLOPS-bound (additional FLOPS from advanced nodes do not translate to inference performance); the NRE is $60–80M versus $500M+ (the difference between feasible and impossible at seed stage); and 12nm on Samsung SF12 or GlobalFoundries 12LP+ eliminates TSMC CoWoS dependency — a supply chain prerequisite for the APAC customers we serve.

The W4A8 accuracy story is honest: our hardware calibration assist registers recover 0.3–0.5 percentage points versus software-only W4A8, landing at –1.4 to –1.8% versus FP16 on MMLU, HumanEval, and GSM8K. For financial document review and legal contract analysis, this is within acceptable enterprise thresholds. For safety-critical clinical AI, we recommend task-specific validation.

Part 7: What We Don’t Know Yet

We are in FPGA validation. The CMFC is running on a Xilinx Alveo U280 at 250MHz — approximately one-fifth of the 1.2GHz target silicon clock. The state machine correctness and arbitration logic validate cleanly. We have not yet seen the chip at speed.

The critical unknown is GDDR7 effective bandwidth under real attention workloads. The theoretical 2.1× margin over required bandwidth (1.5TB/s effective vs 700GB/s required for 70B W4A8 at <50ms TTFT) is calculated assuming ~75% bandwidth utilization efficiency. Real transformer attention with irregular KV cache access patterns may reduce this. The FPGA phase measures it. If the margin is insufficient — specifically, if FPGA emulation shows 13B TTFT exceeding 500ms — the memory architecture requires revision before the $60–80M tape-out commitment.

This is the correct risk-management sequence: $200K in FPGA validation before $60–80M in silicon.

The tape-out will almost certainly be late. First silicon is always late. The GPU-phase deployment platform — a vLLM-backed inference service that onboards customers before silicon ships — is designed to retain those customers through the delay. The same YAML pod definitions, the same API endpoints, the same Python client. Performance improves when CortexChip hardware arrives. No customer action required.

Conclusion: The Chip That Doesn’t Exist Yet

The agent-mesh era of enterprise AI did not exist when the H100 was designed. It exists now. It is the production workload at banks, law firms, hospitals, and research institutions across Asia. It is growing at 48.5% CAGR. And the chip built for it does not yet exist.

That is the honest answer to “why build a new chip when you can just run on H100s?”

Not because H100 is bad. Because the workload has changed, and the hardware hasn’t caught up.

The GPU is the best computer ever built for a problem that enterprise AI is leaving behind. CortexChip is being built for the problem enterprise AI is moving toward: not raw computation, but coordination — hundreds of specialized agents, operating in parallel, sharing context at hardware speed, producing outcomes none of them could produce alone.

The brain in our logo has six lobes. Each one is a different agent type — router, researcher, writer, compliance, fact-checker, summarizer. They operate in parallel, share context, hand off state, and produce something none of them could produce alone.

That is the agent mesh. That is what we built the chip for.

Technical status note: CMFC architecture is in FPGA validation phase on Xilinx Alveo U280. Production silicon performance targets are design objectives pending tape-out validation. All latency figures (P99 <2ms handoff, <200ns KV broadcast, <50ns SLA arbitration) reflect FPGA-validated design correctness at 250MHz; full-speed silicon behavior requires first-silicon bring-up at 1.2GHz target clock.

Tags: #AIHardware #AgentMesh #MultiAgentAI #ChipDesign #InferenceASIC #APACAI #EnterpriseAI #WhyWeBuild #Semiconductor #DeepTech #StartupEngineering