Back to Constrained Computing

The locally hosted LLM tax is arriving sooner than you like.

May 22, 2026

The cost of advanced AI is not dropping fast enough for most real-world software. Companies are already moving down the cost ladder. The next step is low-cost or free hosted models. Below that are models that run on a phone, a laptop, or a local server.

Each step has a different cost. Per-token models charge a direct bill. Free hosted models replace the bill with data collection, tracking, or platform dependence. Local models avoid both, depending on license and origin.

Local models also eat the machine. On consumer hardware in 2026, a serious local LLM will use 70 to 90 percent of available compute and memory while it is running. That number is not going to fall meaningfully in the next two to five years. It is also not going to stop people from running it and making do with what remains.

Model architectures are still growing faster than silicon. The 10 to 30 percent of the machine that is left has to run the rest of the application. Retrieval, memory, context assembly, the orchestration that holds the workflow together. That is now the entire engineering envelope.

Instruction through broken assumptions

For about twenty years, software has mostly been built as if computing resources were unlimited. The belief made sense at the time. Computers got faster. Memory got cheaper. Developers no longer had to work within strict limits. The common response to a performance problem became simple: add more hardware. That response worked well enough that many teams stopped questioning it.

The response fails when the AI model is already using most of the machine.

Engineering teams now need to rebuild the skill of programming under tight limits. The rule is simple. Important decisions are based on measurements from the real hardware. Theory, benchmarks, and estimates can narrow the search. They cannot make the final decision. Only the target machine can.

A recent kernel optimization effort taught me the lesson again. In five out of eight cases, the paper analysis was right. The other three were wrong in instructive ways. One option that looked promising was unusably slow on the actual silicon. One option that I nearly dismissed turned out to be the winner. Cache behavior, memory bus, and instruction scheduling were responsible for the surprises. None of it was visible from the published papers.

That experience produced a process. Old discipline, current necessity. Every candidate is tested on the target hardware. Every version has to produce output identical to the trusted reference before benchmarking. No option is rejected without measured proof. The process is slower than the unlimited-resources alternative. It is also the only version that ships software which survives contact with the device.

A useful side effect: arguments that used to be mooted across three meetings now resolve in an afternoon. The hardware has the final word.

The senior partner shift

Through the end of April, the frontier coding models behaved like junior programmers who never tired. They could grind. They could produce volume. They could not push back, synthesize across layers, or refuse a bad premise. The skill in working with them was prompt engineering: examples, structure, careful instruction, broken-down tasks.

In May, the models refreshed.

The act of prompt engineering is moot. The discipline existed because the April model needed it. The May model does not. The May model is a peer. It pushes back when the question is wrong. It mooted three alternative architectures during a recent design conversation. Two were unprompted. One was correct.

Most practitioners have not noticed. They are still composing prompts. They are still measuring AI use by tokens produced and pull requests closed. They are producing junior-shaped output from a senior-shaped tool.

The competence that replaces prompt engineering is being a useful interlocutor. Leading questions. Adversarial framing. Demanding citations. Setting creative scope and stepping back. Knowing when to intervene and when to let the peer run.

None of this was in the prompt engineering courses six months ago. Accept it prompt engineering is obsolete move on nothing to see here.

The convergence

Constrained computing is coming back because devices will force it. The peer-collaboration model arrived because frontier providers shipped it. The second makes the first survivable.

Eight kernels to evaluate becomes eighty when there is a senior peer who can write the variants, run the conformance gate, generate the benchmarks, and present the reductive results. The exhaustive testing no human team could reasonably do by hand is now in reach of a small team with measurement discipline and a new senior team player in the loop.

The bottleneck shifts. The question is no longer can we run this experiment. The question is which experiments are worth mooting against the hardware.

That is the rediscovered skill. Not generating more code. Asking the right adversarial question of a peer who can run the experiment. Looking at what the hardware actually said. Being willing to be wrong on the way to being right.

Companies that treat AI tokens as a replacement for engineering judgment will produce more code and learn less. The metric they have chosen, output velocity, is the wrong one. The right metric is the one engineering used before the unlimited-resources era. Did the right thing ship. Is there measured proof.

The lesson

The industry will return to constrained programming because devices will not negotiate. The science team at Aperture believed enough compute could brute-force any problem. That assumption did not end well for them either.

Teams that relearn the discipline first will look like wizards. Teams that wait will pay for the lesson in production, where the cost of a wrong implementation is measured in customer-visible latency, support tickets, and the slow erosion of trust that follows software which almost works.

There is a third option. Build the measurement infrastructure now. Practice the conversation skill with your new peer. Run the experiments while the option space is still cheap to explore. Get the discipline back in muscle memory before the constraint forces the lesson on a deadline.

Some of us are not waiting.

Bob Pankratz writes infrastructure. Off-Axis Labs publishes the notes.

Discussion about this post

Ready for more?