Two by hand, the rest by machine

The role for Local AI in the space between deterministic and unbounded work has landed. Did you notice?

Jun 03, 2026

A model can maintain a port if you build the checker first. Most of the surprises were not where we expected.

The substrate underneath our current project is dense discrete mathematics. It is heavily scrutinized, tedious to build, tedious to tune, tedious to maintain. The system it supports has to run in many languages. The old answer to that problem is to hire port maintainers. The new answer is to throw a model at it. We took the new answer and watched what happened.

The bet is plain. Build a foundational example by hand. Let a model maintain the rest. This is the strategy and the lessons that followed.

The constraint

We are building a memory substrate that has to return the exact same answer on every machine. Not a similar answer. The same bytes. That requirement is what lets one machine trust what another one computed, and it is also why we could not keep building the normal way.

You cannot hand-maintain a byte-identical system in a dozen languages. The labor does not scale and the versions drift. Every language added is another implementation to keep in lockstep with the others as the design moves. For a system whose value depends on byte-for-byte agreement, drift is not a defect you fix later. It is the end of the product.

We made the call early. Two strong languages with multiplatform support and solid test harness, Swift and Rust, written and maintained by hand. Those two are the source of truth. Two on purpose, not one. Two independent implementations that agree on every byte are far stronger evidence the design is right than one implementation that only ever agrees with itself. Where the two agree, we freeze the result as a test case. That frozen set is the conformance checker.

Go is the tiebreaker for the rare case where Swift and Rust disagree and the cause is north pointing south. Rare. Inevitable.

Every other language is generated by a model and held to the same checker. If a model can write a correct port and keep it correct as the design moves, adding a language stops being a hiring decision and becomes a generation problem. The reference stays small enough to audit. The fan-out stays cheap.

What follows is what it actually took to earn that.

The loop

The method is plain. We ask a local coding model, a 30-billion-parameter model running 4-bit on a single Mac M5 Max — this is not work worthy of frontier AI spend — to write a port of one function. We run the checker. We keep what passes.

At a normal sampling temperature the model gets a hard function right about a third of the time and wrong the rest. We sample a few dozen times, throw out the failures, keep the passes. Then we fine-tune the model on those survivors, which are nothing more than its own correct answers with the checker’s stamp on them.

After a few hundred steps of that, the model writes the function correctly on the first try, temperature zero, every time. Stock model: fails. Same model trained on its own verified output: passes, deterministically.

We proved it first on Rust, where we had a hand-written reference to grade the model against. That was the point. Before trusting the method on a language we do not maintain, we wanted to watch it reproduce one we do. Then we took it to the five languages we actually want a model to own: Python, Go, JavaScript, Julia, and C#. It held in all five. No human wrote a correct answer. No human labeled anything. The checker did that work.

One example was enough

The first number that stopped us was how little data it took.

You would expect to need many verified examples to move a model. We did not. One worked. A single correct port, trained for a few hundred steps, was enough to lock the model’s first-try output onto the right answer. Two examples gave the same result. So did three. There was no curve to climb. One was the whole effect.

That is the first sign you are not teaching. You cannot teach a function from one example. You can only point at one. The capability was already in the model. The single example told it which of its own instincts to trust.

The failures were never the math

When a port failed we kept it and read it. Almost none of the failures were wrong math. The model wrote the hash constants from memory, correctly, nearly every time. What broke was always smaller and dumber than the math.

In Go, the model wrote a kitchen-sink list of imports and Go refuses to compile when one goes unused. In JavaScript, it tried to turn a byte array into a number using a call that throws on byte arrays. In C#, it handed a byte array to something that wanted a plain integer, and those types do not convert. In Julia, it called popcount, which Julia does not have under that name.

None of these are hard problems. They are the kind of thing a linter quietly fixes for a human. But each one is a single conventional detail with one right answer, and a model sampling freely will sometimes reach for the wrong one. The fixes were small and paid once. A sentence in the prompt, or a deterministic cleanup pass. One fix per language, never per port.

The clearest case was Python. One function kept failing for a different reason: the model knew the hash but not the exact order to feed the inputs, because we had never written that order down. We added one paragraph of documentation describing it. The pass rate on that function went from zero out of sixteen tries to fifteen out of sixteen. We changed no code and changed nothing about the model. We told it the part we had forgotten to say.

Correct examples can still be bad teachers

Two of the languages embarrassed us, and both taught the same lesson.

Every example we trained on had passed the checker, so each was correct by definition. Correct, it turns out, is not the same as good to imitate.

In JavaScript, most of our verified examples happened to end many lines the same way, with a long repeated piece of mask syntax. Train on enough of that and the model learns the rhythm too well. At temperature zero it falls into a loop, repeating that piece forever, and never finishes the function. Each example was individually perfect. The blend was poison.

In C#, the verified examples used two slightly different ways of writing one branch. The model learned both, then spliced them together into something that would not compile.

Both fixes were the same. Fewer examples, more uniform. Which returns to the point. What matters is not how many correct answers you have. It is which ones you choose to imitate. The checker tells you what is correct. It does not tell you what is worth copying.

One sentence, three times the yield

Late in the work we tried to make this measurable. For each language we counted how many genuinely different ways the model wrote the same function. Call it spread. More spread, lower pass rate. That was the prediction and it held loosely.

The sharper test was an intervention. JavaScript had the widest spread and the worst yield, seventeen percent. We added one sentence to the prompt telling the model to write the mask one specific way instead of leaving it open, and we wrote the prediction down before running it. Yield should rise.

It went to fifty-two percent. Three times the rate from one sentence. The looping problem disappeared at the same time and for the same reason. We had narrowed the model’s options, and with fewer options it stopped wandering. Producing more correct answers and producing one reliable answer turned out to be the same problem with the same fix.

The test that settled it

The functions so far all demanded an exact byte match. We also have functions that work in floating point, where an exact match is the wrong test because different machines round differently. For those, the checker accepts an answer that is close enough.

We ran the same loop on three of them. Two — an entropy measure and a z-score — the model got right on the first try with no training at all. That had never once happened on the exact functions.

The reason is the whole argument in a single observation. Those two are textbook formulas the model knows cold, and a close-enough target is wide enough to accept its answer as written. When the model knows the formula and the target is forgiving, it is already correct. There is nothing to fix, so training does nothing.

Which finally explains why the exact functions needed training at all. Not because the model lacked the knowledge. Because an exact target is unforgiving and the model’s best guess landed just beside it. Training never added knowledge. It nudged a deterministic answer the last few bytes onto a target the model was already circling.

The third floating-point function was a Fourier transform, and it was the exception that holds the rule. The model genuinely does not know that one well; it produced a correct version about one time in eighty. But one in eighty is not zero. The checker found the one correct version, we trained on it alone, and the model began producing it exactly. The rarest case in the project, and a single verified example was still enough.

What it means

If your work has a cheap mechanical way to separate right from wrong, and ours does, down to the byte, you already have most of what you need to turn an unreliable model into a reliable one. You do not need labeled data. You do not need a bigger model or a cluster. You need the model to be right once in a while and a way to notice when it is.

This sits inside a larger shift that is easy to miss. As models get more capable, the scarce thing is not raw capability. It is a trustworthy way to tell good output from bad, cheaply and at scale. We have that for this corner of our system because correctness was a hard requirement long before any of this, so we built the checker first and the model came later.

That ordering is the real lesson. Build the thing that decides right from wrong before you reach for the model to generate at scale. The checker was never going to teach the model anything. Its job was to find the answers the model already had and hand them back.

If you have a checker like that in your own work, you may be holding a teacher and calling it a test.

Bob Pankratz writes infrastructure. Off-Axis Labs publishes the notes.

Discussion about this post

Ready for more?