sup computer — a small language model studio
sup computer is a research studio building small language models from scratch — small enough to train end to end on a consumer laptop, and still useful.
Our methods are LLM-assisted. A mixture of models works each step, from dataset creation to training and evaluation, under human direction. All of our research is open source.
Models
- daydream-chess-nanogptPlays chess without ever knowing the rules — learned move by move from games, not a rulebook, across three board sizes.v1 · v1-grand · v1-micro
- gatsby-nanogptBends any story toward the green light — obsession you can dial from 1 to 5.v2 · v1
- kenosha-kid-nanogptDreams endlessly on just six words.v2 · v1
- shakespeare-nanogptWrites Shakespeare from scratch — and gets sharper every research round.v3 · v2 · v1
Research
An LLM-assisted experiment: four rounds took held-out BPC from 2.395 to 1.919. More data was the win; regularization was the dead end.
The studio now has a written house style — twelve editing operations encoded as a skill, distilled from Anthropic's research posts, Thoughtful Lab, and Ramp Labs. A two-file pilot found the studio's biggest tic immediately: fourteen of seventeen edits on one model card were emphasis removal.
The studio's small models are instruments — single-purpose, played rather than prompted — and sup, the studio CLI, is the accessibility argument: one greeting downloads a release and streams its voice to stdout. A handle that simple works in a shell pipe, which means it works for another model.
The shakespeare model's own likelihood is register-blind: fluent Gutenberg editorial prose scores inside any NLL band that admits verse, and the model's most inevitable text is the junk — footnotes, [Illustration] tags, speaker lists at 1.2–1.8 NLL — so the band's raised floor, not its ceiling, is the load-bearing edge. An LLM judge riding the same steer layer held verse register where the band drifted into publication history.
A single afternoon spent improving all four sup computer models at once — a larger model planned a per-model optimization, small runs executed it. Two new releases (shakespeare-nanogpt-3, kenosha-kid-nanogpt-2), one migration, one eval-only characterization, and a handful of findings that only show up when you look across projects side by side.
A three-tier chess-move GPT family (5x5, 8x8, and a custom 12x10 board) built around a single inversion: illegal moves are rendered as dim near-misses instead of being masked away by the sampler. All three tiers land in a tight band of legal-move rate (35-39% on a raw, unresampled first try) despite very different board sizes, vocabularies, and corpus sources -- and two separate facts in the original design plan turned out to be wrong when checked against the live engine instead of trusted from web research.
A repo-wide audit by a larger model found the small-model studio's engine had two advertised code paths that crashed on use, a metric that quietly flattered char models, and a resume that restarted. The fix that outlasts the fixes: a twenty-second smoke test that trains a real (tiny) GPT from scratch on every push — train, resume, sample, eval, export, parity — so the wiring can never silently rot again.
The smallest obsession in the studio: a char-level model whose entire corpus is punctuated permutations of six words. A bot enumerates that space exactly; a learned model can't — and the blur it produces instead is the artifact. The finding: dreaminess is governed by two knobs, training progress and sampling temperature.
Don't serve a model — export only its forward pass as a static ONNX graph (tokens in, last-position logits out) and keep the autoregressive loop, sampling, and tokenization in JS, so a small model becomes a static asset that runs client-side with no server.
gatsby's first corpus cost ~$6 of Claude API to write. This round throws that out and has a mixture of four local open models — Olmo, Ministral, Gemma, Granite — write the corpus instead: free, unlimited, and in four different voices. The model that results matches the paid baseline's behaviour at $0. The catch, and the finding: the blend is a designed object. A granite-heavy first round broke the green-light dial; rebalancing off it and doubling the data brought the dial back.
A char-level model built to compulsively reach for Gatsby's green light — and the $0, fully-controlled ablation that found the dial's real bottleneck: signal loudness, not corpus shape.