← writing

How to Outcompete a Data Center

May 28, 2026

Ultimately, the reason why frontier LLM inference is cloud-based is profitability. The vast majority of the actual complexity in building an AI model beyond R&D is done in the training run; inference with an existing model, from a single GPU, is more or less as simple as attaching a static file to a for loop. If it were the case that frontier performance for LLM inference could be achieved at the same cost, then there would be a much larger market for workstations like the DGX Spark, and vendor companies like NVIDIA would be spending significantly more subsidizing open-source labs to sidestep the problem of value capture. But as it stands, a centralized GPU server rack is capable of running frontier models (with profitable unit economics, if not gross margins) and no existing local solution really can.

As a quick refresher: while most computer programs run on CPUs, AI workloads tend to run better on GPUs because they just happen to be better at the relevant computation of pointwise operations and large matrix multiplies. Most GPUs have dedicated VRAM in which the LLM has to sit. The amount of VRAM you have determines the size of the largest model that you can run. Stuff like variable-cost token storage is also relevant, and lots of engineering work is done to lower this value. The demand to increase the amount of stored memory is the most direct explanation of why it is that SK Hynix's stock price has increased >700% in the past year.

A second bottleneck is in memory bandwidth - the speed at which memory can go back & forth between the VRAM and the actual computational units of the GPU (or TPU, or NPU). This bandwidth gets used to shuffle in parts of the LLM model, which is essentially a frozen set of matrices, as well as the data relevant to each specific session. The former is fixed cost, and the latter is a much smaller variable cost - meaning that running many sessions at once, for many users at once, is the natural cost-effective option.

Modern LLM infrastructure is about getting to the scale at which this fixed cost is as small as possible, and you can increase your batch size (the number of sessions you serve at once) until you become bottlenecked by the actual internal computation of the AI chips instead of just storage. As an example, Huawei's most recently published paper on hosting DeepSeek-R1 shows them running 384 NPUs with a total of 49 TB of RAM to serve a model that itself uses a total of 700 GB, and using the leftover memory for the caches of up to 36,864 total - 96 per die.

This fact has been known for some time, but its consequences are massively underdiscussed by local inference hobbyists and providers. Almost any local system that can run a single model can run multiple versions of that model at once, and so there's a massive free lunch in finding basically any way to work the growing catalogue of multi-agent systems & harnesses into your final setup. But skimming through hobbyist forums like r/LocalLLaMA, and even looking at the strategy of would-be serious local inference players like truffle.net, there doesn't seem to be any acknowledgement of this fact. The answer to the title starts, at least, with making full use of your already existing resources.

Beyond that, what I see as the next frontier for improving this balance involves more detailed hardware work and LLM co-design. ASIC providers like Cerebras are a good step in that direction, but I don't see them as the final destination; they build with fixed assumptions about what kind of computation you're doing, but don't go quite all the way. Look at something like the Taalas HC1: this is an inference-only chip that boasts speeds that are >10x higher than even the specialized chips that we mentioned by embedding the model into hardware. Although it comes at the cost of flexibility, it allows you to avoid the significant slowdown that comes with having to shuffle model layers in and out of the computational registers. Especially notable for our goal is that it completely eliminates the aforementioned fixed cost in storing those weights, meaning that the benefit of centralized inference completely disappears.

This is one of the primary reasons why I've spent the past few months designing my work around the implications of hardware like this. I think that reducing reliance on data center-based AI inference is something that can only be achieved by starting with a small base of passionate people, so you aren't going to get anywhere by hoping that the open frontier becomes "good enough" to serve in a sub-par package.