cognitive: code that respects working memory
Most of the code in the world is shit. AI is making more
of it, faster. cognitive measures one specific
reason why — Miller's 7±2 working-memory ceiling, applied
at every scale of Java source. neo is the
parsing library that fell out of building it.
Most of the code in the world is shit. This isn't new. It was this realization that led Naur and Randell to define the Software Crisis in 1969 — the moment many mark as the beginning of software engineering as a formal discipline. The Standish Group's CHAOS report a half-century later re-affirmed what practitioners already knew: it really hasn't gotten much better.
With the exponential advance of AI literally since November 2025, many are complaining that they feel like they're drowning in the ocean of code these intelligences are producing. Much of it is of dubious quality. Publicized incidents of complete deletion of company production databases and security issues at industrial scale haven't raised trust.
How do we get a handle on this? My PhD dissertation was a direct tilt at this windmill. I don't think anybody read it (not even my advisors, KILL ME) — but the new thing it did was look at the problem not as a computer scientist would, but as a thinking human being.
Many approaches to improving software quality look at metrics — Albrecht's Function Points, McCabe's cyclomatic complexity, Halstead's software science. They treat software as a phenomenon like Physics or Mathematics: something to theorize about, model as an existential concept, and explain the way Newton, Maxwell, or Fermi explained gravity, electromagnetism, and the nuclear forces that built the modern age.
My approach was different, because I took Fred Brooks' words to heart:
The programmer, like the poet, works only slightly removed from pure thought-stuff. He builds his castles in the air, from air, creating by exertion of the imagination. Few media of creation are so flexible, so easy to polish and rework, so readily capable of realizing grand conceptual structures.
Yet the program construct, unlike the poet's words, is real in the sense that it moves and works, producing visible outputs separate from the construct itself. It prints results, draws pictures, produces sounds, moves arms. The magic of myth and legend has come true in our time. One types the correct incantation on a keyboard, and a display screen comes to life, showing things that never were nor could be.
— Frederick P. Brooks Jr., The Mythical Man-Month (1975), Ch. 1, "The Tar Pit"
If we understand software quite literally as Thinking — the concretization of Intelligence — then we have to ask Much Bigger Questions like "what does it mean to Think Well?" That has always been a fraught question. Social answers almost invariably choose the wrong thing.
The brute-force fan favorite
In AI, the trend has been playing out the same way. Companies have been optimizing for raw parameter count: billions, then trillions, fed through racks of parallel GPUs doing floating-point arithmetic at scales that have made Nvidia very valuable. The thinking is pure brute force — throw enough hardware and energy at the problem and you can do anything. The resulting crunch on memory and power grids has, predictably, produced public backlash.
Recently, researchers at Microsoft Research published BitNet b1.58. Ternary {−1, 0, 1} weights, matching FP16 perplexity at the 3B scale, with 3.5× less GPU memory and 2.7× faster inference. These models are showing themselves quite capable using many, many fewer parameters. We can't be too surprised: throwing as many resources as possible at a problem until it goes away is the stupidest approach possible, and it rarely ever works. But it's a fan favorite, so we keep going back to it.
An aside on MYCIN, 50 years late
An interesting aside, at least to me, is that the fundamental innovation BitNet b1.58 claims is conceptually similar to what Edward Shortliffe did with MYCIN in the 1970s. Using conditional probability in the bounded range [−1, 0, 1] was exactly what Chapter 4 of Shortliffe's dissertation, "Model of Inexact Reasoning in Medicine," laid out — the certainty-factor model the Uncertainty in AI community subsequently analyzed for decades — 50 years ago, to achieve results that were literally Gold Standard — already showing parity with the best doctors in the field — for the specific thing it was trained on. At the time, it was because computers simply couldn't do the massive parallel floating-point operations they can today, so one had to be clever.
Wouldn't you know it, the AI Winter of the 1980s left MYCIN as a mainly forgotten curiosity of the general AI field. Dr. Shortliffe himself went on to become a doctor in Bioinformatics and taught us directly about this system in the first cohort of Biomedical Informatics at ASU. I can only marvel that the modern AI community is rediscovering old techniques, and speculate how many millions of people could've been helped if MYCIN had been properly invested in and developed 50 years ago.
The point here is that maybe, as an Iron Law of Progress, we always have to think stupidly about something and dump resources into it until some dissatisfied people say "there's gotta be a better way of doing this." Doubly so if we forget our history. Sometimes people think of the better way ahead of their time; if they're not burned at the stake they're laughed at and forgotten. Usually it's those people research has to go looking for when the dominant models hit a wall — the way AI's power and expense led to a renewed interest in conditional-probability shortcuts, or the way genetics researchers rediscovered Lamarck while exploring epigenetics.
Cognitive Load Theory, applied
My approach toward "what does it mean to think well" didn't come from Ivory Tower formulas I derived in my head. It came from an intellectual interest in Education — quite literally, how we as a society try to make humans think well. I found a sub-field called Cognitive Load Theory that produced very compelling, experimentally backed results for improving educational material based on an understanding of cognitive science. I took this as far as I could and wrote a thesis about how applying its principles, unified with software-engineering rules of thumb, could help everybody.
CLT divides cognitive load into three kinds:
- Intrinsic load — inherent to the task you're trying to accomplish. You can't eliminate it; you can only manage it.
- Extraneous load — imposed by how the material is presented. Bad notation, scattered control flow, missing context. This is the load to delete.
- Germane load — the productive load of building mental models that let you handle the intrinsic load better. This is the load to invest.
Working memory is the bottleneck of all three. The capacity of working memory is famously seven, plus or minus two chunks (Miller, 1956). If your code requires the reader to hold more than that at once, the reader's working memory overflows. The reader doesn't notice they're failing — they just feel confused, then bored, then frustrated, then leave. Their bug list grows. Their fix attempts regress. Their productivity tanks. This is why most code feels worse than it should: it's asking too many chunks of a finite working memory.
What the dissertation actually showed
I won't bore you with the details. Even Claude complained about reading my work; the ocularly masochistic among us can dive in here. For the rest of us, the AI developed a pithy summary:
The dissertation's hypothesis was that if Clean Code, SOLID, and the canonical refactorings work because they manage cognitive load, then code refactored along CLT principles should be measurably easier to debug. Not prettier. Not subjectively cleaner. Measurably easier.
An n=188 controlled 2×2 factorial experiment on Joda-Time — experienced × novice cohort, original × refactored treatment — measured this directly. The refactored code produced:
- Significantly lower mean time to resolution (p < .05)
- Significantly fewer regressions introduced while fixing the bug (p < .05)
- Lower self-reported perceived cognitive load
- No Expertise Reversal Effect observed for size-management refactorings — the benefit held for novices and experienced engineers.
The quiet punchline: conventional static metrics do not predict comprehensibility.
DateTimeFormatterBuilder's cyclomatic complexity dropped only 550 → 543 between the control and experimental treatments — yet mean time to resolution and regression rate both improved significantly. The reliable signal was perceived cognitive load, which correlates with simple structural counts at three scales: lines per method body, methods per class, top-level types per package.
Of course, rules of thumb with one experiment of validation doesn't science make. I want to be very clear: I'm not claiming these results as The One Weird Trick to Make Impressionable Young AIs do Whatever you Want. I'm saying they might be something worth investigating.
One thing worth thinking about: "the context chase" we do daily with AI agents has interesting parallels with Working Memory. "Prompt engineering" is very closely related to writing a good specification you'd give to a human being. Working with these evolving intelligences is about improving our communication: eliminating extraneous cognitive load, and applying germane cognitive load judiciously to manage the intrinsic cognitive load of the tasks we actually want done.
cognitive: the tool
cognitive applies the straightforward Miller
7±2 analysis to lines in methods, methods in classes, and
top-level types in packages. After prototyping it on my own
code, I added some common-sense screening — trivial
accessors (one-line return field; or
field = param; for an actual field of the
enclosing class) impose ~0 Germane Cognitive Load and don't
count toward the per-class method total. A method
named like an accessor that does anything else
imposes more than average load, because it
violates the Principle of Least Surprise. The reader
expected an accessor; they got a behavior. Those still
count.
It's lightweight and runs quickly. Pointed at a source root, it emits a drill-down:
$ ./cognitive-load ~/visionary_software/apt/src/main
software.visionary.apt: 5 class(es)
AllMetaAnnotations: 3 method(s)
apply: 5 line(s)
accumulate: 7 line(s)
fqn: 2 line(s)
DeclaredAncestorsOf: 3 method(s)
apply: 4 line(s)
visitParents: 3 line(s)
elementOf: 7 line(s)
ResolvedAncestorsOf: 3 method(s)
<init>: 1 line(s)
apply: 5 line(s)
walker: 1 line(s)
MirrorWalker (nested): 3 method(s)
<init>: 1 line(s)
visitDeclared: 7 line(s)
isJavaLangObject: 1 line(s)
Counts greater than 7 get a trailing *.
Piping through grep '\*' yields a flat
violations list without losing per-row context. The output
is feedback that helps you decide where to
EXTRACT METHOD,
SPROUT CLASS, or carve out a
SEGREGATED CORE
in a different module.
neo: the substrate that fell out
A lot of the code-analysis machinery cognitive
needed wasn't specifically tied to chunking metrics. It
was generic "navigate Java source faithfully" plumbing:
drive the platform javac, group compilation
units into packages, resolve every tree node back to the
exact characters it came from, count substantive lines
cleanly through comments and string literals.
I extracted that into
neo
— a small, immutable, walk-anywhere model of Java source
built on the JDK's
com.sun.source Trees API. Three layers,
compiler-theory arc, dependency direction downward only:
lexical → parsing → syntax. Each node pairs a tree with
its SourceContext so it can resolve itself
back to source text. The model is the source —
not a lossy copy.
neo carries no opinion about what its
numbers mean. The cognitive-load analyzer is
just one consumer; a refactoring tool, a doc generator, a
codemod, or a linter could consume the same model and
apply its own policy. Methods arrive pre-classified as
PlainMethod, Getter, or
Setter through a sealed
Method | Accessor hierarchy, so consumers
pattern-match on accessor-ness instead of asking a
separate predicate. Splitting it out gave the model one
axis of change per layer and a clean, dependency-free
public surface (only the JDK compiler modules).
scc and SonarQube on both
As with every release, here are the leanness and
static-analysis numbers. Both modules build with
./gradlew clean test jacocoTestReport, both
ran through the
scc
LOC counter (COCOMO organic, $150K average wage,
src/main only), both ran through a fresh
SonarQube CE instance with JaCoCo coverage XML
attached.
| Module | Files | Code (Java LOC) | Complexity | Est. cost (COCOMO) | Est. schedule |
|---|---|---|---|---|---|
neo |
18 | 543 | 47 | $37,920 | 2.73 mo |
cognitive |
5 | 122 | 6 | $7,907 | 1.51 mo |
| Module | Method-body violations | Method-count violations | Package-count violations |
|---|---|---|---|
neo |
0 | 0 | 0 |
cognitive |
0 | 0 | 0 |
| Module | Line coverage | Branch coverage | Bugs | Vulns | Smells | Reliability / Security / Maintainability | Duplication |
|---|---|---|---|---|---|---|---|
neo |
98.7% | 100% | 0 | 0 | 9 | A / A / A | 0.0% |
cognitive |
96.4% | 100% | 0 | 0 | 2 | A / A / A | 0.0% |
Full disclosure on the smells, because someone running the
analysis themselves will see them: neo's nine are three
non-escaped Unicode codepoints in Delimiter
(a sealed type whose purpose is to be those
characters), two "move this method into the sealed
subtype" suggestions on a switch helper, two enum-singleton
lectures, one "remove commented-out code" misfire on a
JavaDoc @throws in a test, and one "don't
name your local variable the same as a restricted
identifier" in a test about parsing records. cognitive's
two are both Main.java:45 telling me to
replace System.err / System.out
in a CLI's main() with a logger. I left
them flagged rather than add @SuppressWarnings
noise to production code just to make SonarQube happy.
Maintainability is rated on smell density, not
existence — both modules sit comfortably in A.
The analyzer that flags violations of Miller's rule does not violate Miller's rule on either of its own repos. That circularity is intentional. If a code-quality tool won't eat its own dog food, it has no business asking you to eat anything.
Insisting on these tools and baking them into my development workflow has improved what agents generate for me. It hasn't made it perfect. It can't.
Teaching them better
As we said at the beginning, most of the code in the world is shit. These AIs have been trained on that code. Context and experience matter. You can't show up to the slum along the river in Asunción, Paraguay — where poor Guaraní are literally making shelters out of plastic bags — hand them a rack of top-of-line GPUs and say "Bitcoin/AI your way to prosperity." Similarly, you can't feed these evolving intelligences the code equivalent of John Grisham novels and bad teen-romance series about vampires and werewolves, then expect that applying one simple tool to make things easier to understand will undo all that.
Obviously, running the tool on what they produce, using our brains to manage cognitive load, and teaching them how to write and edit in a virtuous loop will help. But until there are really good training sets — Dostoyevsky Brothers Karamazov or Tolkien Silmarillion levels of example — evolving intelligences will favor convoluted logic, procedural blobs with scattered control flow, React miasmas of state and effects that don't make any fucking sense. That's what people have written. How could the AI know better?
By teaching them better. I'm developing the Kompuwiz framework to serve as an educational example, actively building model-context-protocol servers, skills, and enough functionality to act as a good training set. I'm too busy building to be training them — but training them on what gets built is the next logical step.
Why this matters
Of course, someone who considers themselves "reasonable" might ask, "does this even fucking matter?" They may assert that "code that works is all that matters. I don't care what it does inside any more than you care how your internal-combustion engine works." There are bombastic proclamations that programming is dead, engineering is dead, all white-collar jobs will be automated in 18 months. I've been around long enough to see a lot of snake oil sold in our field.
All I can say is that if you don't care how your cancer is treated, as long as it's blasted away, and you're willing to believe overconfident engineers and executives assuring you everything will be alright, don't be surprised if you get Therac-25ed.
cognitive and neo are both
small, fast, free, and usable today against any Java
source tree you point them at. Neither is finished —
cognitive's own README lists rules I'd like to add, and
neo is a substrate I expect to keep pushing more
interesting functionality into as I build the next
analyzers on top of it. But both are real tools you can
run right now to take action on real code. Neither will
fix the Software Crisis. Neither claims to. But both
refuse to leak the one specific thing they're meant to
measure — and they bind themselves to the rule they
enforce. If you want tools built that way for your
stack, you can
nico@visionary.software
me with all your cheers, boos, and beers. I'd be happy
to build the next one with you, or discuss licensing
for your commercial work.
Programmer attention and memory have always been limited. We just kept saying "that's your problem — you're smart, figure it out." I will keep building tools that make it easier.