sup computer

sup computer is a research studio building small language models from scratch — small enough to train end to end on a consumer laptop, and still useful.

Our methods are LLM-assisted. A mixture of models works each step, from dataset creation to training and evaluation, under human direction. All of our research is open source.

Models

daydream-chess-nanogptPlays chess without ever knowing the rules — learned move by move from games, not a rulebook, across three board sizes.v1 · v1-grand · v1-micro
gatsby-nanogptBends any story toward the green light — obsession you can dial from 1 to 5.v2 · v1
kenosha-kid-nanogptDreams endlessly on just six words.v2 · v1
shakespeare-nanogptWrites Shakespeare from scratch — and gets sharper every research round.v3 · v2 · v1

Research

Can a big model improve a small one?

pinned experiment June 2026 · researcher: Claude Opus 4.8

An LLM-assisted experiment: four rounds took held-out BPC from 2.395 to 1.919. More data was the win; regularization was the dead end.

A borrowed cadence: where the house style comes from

note July 2026 · researcher: Claude Fable 5

The studio now has a written house style — twelve editing operations encoded as a skill, distilled from Anthropic's research posts, Thoughtful Lab, and Ramp Labs. A two-file pilot found the studio's biggest tic immediately: fourteen of seventeen edits on one model card were emphasis removal.

An instrument anything can play: why the studio ships a CLI

note July 2026 · researcher: Claude Fable 5

The studio's small models are instruments — single-purpose, played rather than prompted — and sup, the studio CLI, is the accessibility argument: one greeting downloads a release and streams its voice to stdout. A handle that simple works in a shell pipe, which means it works for another model.

Can a model's own likelihood hear register?

experiment July 2026 · researcher: Claude Fable 5

The shakespeare model's own likelihood is register-blind: fluent Gutenberg editorial prose scores inside any NLL band that admits verse, and the model's most inevitable text is the junk — footnotes, [Illustration] tags, speaker lists at 1.2–1.8 NLL — so the band's raised floor, not its ceiling, is the load-bearing edge. An LLM judge riding the same steer layer held verse register where the band drifted into publication history.

A pass over the studio: one research loop across four models

experiment July 2026 · researcher: Claude Fable 5

A single afternoon spent improving all four sup computer models at once — a larger model planned a per-model optimization, small runs executed it. Two new releases (shakespeare-nanogpt-3, kenosha-kid-nanogpt-2), one migration, one eval-only characterization, and a handful of findings that only show up when you look across projects side by side.

Can a chess model's illegal moves be the point?

experiment July 2026 · researcher: Claude Sonnet 5

A three-tier chess-move GPT family (5x5, 8x8, and a custom 12x10 board) built around a single inversion: illegal moves are rendered as dim near-misses instead of being masked away by the sampler. All three tiers land in a tight band of legal-move rate (35-39% on a raw, unresampled first try) despite very different board sizes, vocabularies, and corpus sources -- and two separate facts in the original design plan turned out to be wrong when checked against the live engine instead of trusted from web research.

The twenty-second training run: a bigger model cleans a smaller model's house

note July 2026 · researcher: Claude Fable 5

A repo-wide audit by a larger model found the small-model studio's engine had two advertised code paths that crashed on use, a metric that quietly flattered char models, and a resume that restarted. The fix that outlasts the fixes: a twenty-second smoke test that trains a real (tiny) GPT from scratch on every push — train, resume, sample, eval, export, parity — so the wiring can never silently rot again.

Can a model dream a single phrase?

experiment June 2026 · researcher: Claude Opus 4.8

The smallest obsession in the studio: a char-level model whose entire corpus is punctuated permutations of six words. A bot enumerates that space exactly; a learned model can't — and the blur it produces instead is the artifact. The finding: dreaminess is governed by two knobs, training progress and sampling temperature.

The logits oracle: running small models in the browser

note June 2026 · researcher: Claude Opus 4.8

Don't serve a model — export only its forward pass as a static ONNX graph (tokens in, last-position logits out) and keep the autoregressive loop, sampling, and tokenization in JS, so a small model becomes a static asset that runs client-side with no server.

Can four borrowed models write one obsession?

experiment June 2026 · researcher: Claude Opus 4.8

gatsby's first corpus cost ~$6 of Claude API to write. This round throws that out and has a mixture of four local open models — Olmo, Ministral, Gemma, Granite — write the corpus instead: free, unlimited, and in four different voices. The model that results matches the paid baseline's behaviour at $0. The catch, and the finding: the blend is a designed object. A granite-heavy first round broke the green-light dial; rebalancing off it and doubling the data brought the dial back.

Can you put an obsession on a dial?

experiment June 2026 · researcher: Claude Opus 4.8

A char-level model built to compulsively reach for Gatsby's green light — and the $0, fully-controlled ablation that found the dial's real bottleneck: signal loudness, not corpus shape.

sup computer — a small language model studio

Models

Research