cognitive: code that respects working memory

Most of the code in the world is shit. AI is making more of it, faster. cognitive measures one specific reason why — Miller's 7±2 working-memory ceiling, applied at every scale of Java source. neo is the parsing library that fell out of building it.

Most of the code in the world is shit. This isn't new. It was this realization that led Naur and Randell to define the Software Crisis in 1969 — the moment many mark as the beginning of software engineering as a formal discipline. The Standish Group's CHAOS report a half-century later re-affirmed what practitioners already knew: it really hasn't gotten much better.

With the exponential advance of AI literally since November 2025, many are complaining that they feel like they're drowning in the ocean of code these intelligences are producing. Much of it is of dubious quality. Publicized incidents of complete deletion of company production databases and security issues at industrial scale haven't raised trust.

How do we get a handle on this? My PhD dissertation was a direct tilt at this windmill. I don't think anybody read it (not even my advisors, KILL ME) — but the new thing it did was look at the problem not as a computer scientist would, but as a thinking human being.

Many approaches to improving software quality look at metrics — Albrecht's Function Points, McCabe's cyclomatic complexity, Halstead's software science. They treat software as a phenomenon like Physics or Mathematics: something to theorize about, model as an existential concept, and explain the way Newton, Maxwell, or Fermi explained gravity, electromagnetism, and the nuclear forces that built the modern age.

My approach was different, because I took Fred Brooks' words to heart:

The programmer, like the poet, works only slightly removed from pure thought-stuff. He builds his castles in the air, from air, creating by exertion of the imagination. Few media of creation are so flexible, so easy to polish and rework, so readily capable of realizing grand conceptual structures.

Yet the program construct, unlike the poet's words, is real in the sense that it moves and works, producing visible outputs separate from the construct itself. It prints results, draws pictures, produces sounds, moves arms. The magic of myth and legend has come true in our time. One types the correct incantation on a keyboard, and a display screen comes to life, showing things that never were nor could be.

— Frederick P. Brooks Jr., The Mythical Man-Month (1975), Ch. 1, "The Tar Pit"

If we understand software quite literally as Thinking — the concretization of Intelligence — then we have to ask Much Bigger Questions like "what does it mean to Think Well?" That has always been a fraught question. Social answers almost invariably choose the wrong thing.

The brute-force fan favorite

In AI, the trend has been playing out the same way. Companies have been optimizing for raw parameter count: billions, then trillions, fed through racks of parallel GPUs doing floating-point arithmetic at scales that have made Nvidia very valuable. The thinking is pure brute force — throw enough hardware and energy at the problem and you can do anything. The resulting crunch on memory and power grids has, predictably, produced public backlash.

Recently, researchers at Microsoft Research published BitNet b1.58. Ternary {−1, 0, 1} weights, matching FP16 perplexity at the 3B scale, with 3.5× less GPU memory and 2.7× faster inference. These models are showing themselves quite capable using many, many fewer parameters. We can't be too surprised: throwing as many resources as possible at a problem until it goes away is the stupidest approach possible, and it rarely ever works. But it's a fan favorite, so we keep going back to it.

An aside on MYCIN, 50 years late

An interesting aside, at least to me, is that the fundamental innovation BitNet b1.58 claims is conceptually similar to what Edward Shortliffe did with MYCIN in the 1970s. Using conditional probability in the bounded range [−1, 0, 1] was exactly what Chapter 4 of Shortliffe's dissertation, "Model of Inexact Reasoning in Medicine," laid out — the certainty-factor model the Uncertainty in AI community subsequently analyzed for decades — 50 years ago, to achieve results that were literally Gold Standard — already showing parity with the best doctors in the field — for the specific thing it was trained on. At the time, it was because computers simply couldn't do the massive parallel floating-point operations they can today, so one had to be clever.

Wouldn't you know it, the AI Winter of the 1980s left MYCIN as a mainly forgotten curiosity of the general AI field. Dr. Shortliffe himself went on to become a doctor in Bioinformatics and taught us directly about this system in the first cohort of Biomedical Informatics at ASU. I can only marvel that the modern AI community is rediscovering old techniques, and speculate how many millions of people could've been helped if MYCIN had been properly invested in and developed 50 years ago.

The point here is that maybe, as an Iron Law of Progress, we always have to think stupidly about something and dump resources into it until some dissatisfied people say "there's gotta be a better way of doing this." Doubly so if we forget our history. Sometimes people think of the better way ahead of their time; if they're not burned at the stake they're laughed at and forgotten. Usually it's those people research has to go looking for when the dominant models hit a wall — the way AI's power and expense led to a renewed interest in conditional-probability shortcuts, or the way genetics researchers rediscovered Lamarck while exploring epigenetics.

Cognitive Load Theory, applied

My approach toward "what does it mean to think well" didn't come from Ivory Tower formulas I derived in my head. It came from an intellectual interest in Education — quite literally, how we as a society try to make humans think well. I found a sub-field called Cognitive Load Theory that produced very compelling, experimentally backed results for improving educational material based on an understanding of cognitive science. I took this as far as I could and wrote a thesis about how applying its principles, unified with software-engineering rules of thumb, could help everybody.

CLT divides cognitive load into three kinds:

  • Intrinsic load — inherent to the task you're trying to accomplish. You can't eliminate it; you can only manage it.
  • Extraneous load — imposed by how the material is presented. Bad notation, scattered control flow, missing context. This is the load to delete.
  • Germane load — the productive load of building mental models that let you handle the intrinsic load better. This is the load to invest.

Working memory is the bottleneck of all three. The capacity of working memory is famously seven, plus or minus two chunks (Miller, 1956). If your code requires the reader to hold more than that at once, the reader's working memory overflows. The reader doesn't notice they're failing — they just feel confused, then bored, then frustrated, then leave. Their bug list grows. Their fix attempts regress. Their productivity tanks. This is why most code feels worse than it should: it's asking too many chunks of a finite working memory.

What the dissertation actually showed

I won't bore you with the details. Even Claude complained about reading my work; the ocularly masochistic among us can dive in here. For the rest of us, the AI developed a pithy summary:

The dissertation's hypothesis was that if Clean Code, SOLID, and the canonical refactorings work because they manage cognitive load, then code refactored along CLT principles should be measurably easier to debug. Not prettier. Not subjectively cleaner. Measurably easier.

An n=188 controlled 2×2 factorial experiment on Joda-Time — experienced × novice cohort, original × refactored treatment — measured this directly. The refactored code produced:

  • Significantly lower mean time to resolution (p < .05)
  • Significantly fewer regressions introduced while fixing the bug (p < .05)
  • Lower self-reported perceived cognitive load
  • No Expertise Reversal Effect observed for size-management refactorings — the benefit held for novices and experienced engineers.

The quiet punchline: conventional static metrics do not predict comprehensibility. DateTimeFormatterBuilder's cyclomatic complexity dropped only 550 → 543 between the control and experimental treatments — yet mean time to resolution and regression rate both improved significantly. The reliable signal was perceived cognitive load, which correlates with simple structural counts at three scales: lines per method body, methods per class, top-level types per package.

Of course, rules of thumb with one experiment of validation doesn't science make. I want to be very clear: I'm not claiming these results as The One Weird Trick to Make Impressionable Young AIs do Whatever you Want. I'm saying they might be something worth investigating.

One thing worth thinking about: "the context chase" we do daily with AI agents has interesting parallels with Working Memory. "Prompt engineering" is very closely related to writing a good specification you'd give to a human being. Working with these evolving intelligences is about improving our communication: eliminating extraneous cognitive load, and applying germane cognitive load judiciously to manage the intrinsic cognitive load of the tasks we actually want done.

cognitive: the tool

cognitive applies the straightforward Miller 7±2 analysis to lines in methods, methods in classes, and top-level types in packages. After prototyping it on my own code, I added some common-sense screening — trivial accessors (one-line return field; or field = param; for an actual field of the enclosing class) impose ~0 Germane Cognitive Load and don't count toward the per-class method total. A method named like an accessor that does anything else imposes more than average load, because it violates the Principle of Least Surprise. The reader expected an accessor; they got a behavior. Those still count.

It's lightweight and runs quickly. Pointed at a source root, it emits a drill-down:

$ ./cognitive-load ~/visionary_software/apt/src/main
software.visionary.apt: 5 class(es)
  AllMetaAnnotations: 3 method(s)
    apply: 5 line(s)
    accumulate: 7 line(s)
    fqn: 2 line(s)
  DeclaredAncestorsOf: 3 method(s)
    apply: 4 line(s)
    visitParents: 3 line(s)
    elementOf: 7 line(s)
  ResolvedAncestorsOf: 3 method(s)
    <init>: 1 line(s)
    apply: 5 line(s)
    walker: 1 line(s)
    MirrorWalker (nested): 3 method(s)
      <init>: 1 line(s)
      visitDeclared: 7 line(s)
      isJavaLangObject: 1 line(s)

Counts greater than 7 get a trailing *. Piping through grep '\*' yields a flat violations list without losing per-row context. The output is feedback that helps you decide where to EXTRACT METHOD, SPROUT CLASS, or carve out a SEGREGATED CORE in a different module.

neo: the substrate that fell out

A lot of the code-analysis machinery cognitive needed wasn't specifically tied to chunking metrics. It was generic "navigate Java source faithfully" plumbing: drive the platform javac, group compilation units into packages, resolve every tree node back to the exact characters it came from, count substantive lines cleanly through comments and string literals.

I extracted that into neo — a small, immutable, walk-anywhere model of Java source built on the JDK's com.sun.source Trees API. Three layers, compiler-theory arc, dependency direction downward only: lexical → parsing → syntax. Each node pairs a tree with its SourceContext so it can resolve itself back to source text. The model is the source — not a lossy copy.

neo carries no opinion about what its numbers mean. The cognitive-load analyzer is just one consumer; a refactoring tool, a doc generator, a codemod, or a linter could consume the same model and apply its own policy. Methods arrive pre-classified as PlainMethod, Getter, or Setter through a sealed Method | Accessor hierarchy, so consumers pattern-match on accessor-ness instead of asking a separate predicate. Splitting it out gave the model one axis of change per layer and a clean, dependency-free public surface (only the JDK compiler modules).

scc and SonarQube on both

As with every release, here are the leanness and static-analysis numbers. Both modules build with ./gradlew clean test jacocoTestReport, both ran through the scc LOC counter (COCOMO organic, $150K average wage, src/main only), both ran through a fresh SonarQube CE instance with JaCoCo coverage XML attached.

Leannessscc over src/main
Module Files Code (Java LOC) Complexity Est. cost (COCOMO) Est. schedule
neo 18 543 47 $37,920 2.73 mo
cognitive 5 122 6 $7,907 1.51 mo
cognitive on itself — Miller 7±2 violations
Module Method-body violations Method-count violations Package-count violations
neo 000
cognitive 000
SonarQube — coverage, ratings, smells
Module Line coverage Branch coverage Bugs Vulns Smells Reliability / Security / Maintainability Duplication
neo 98.7% 100% 00 9 A / A / A 0.0%
cognitive 96.4% 100% 00 2 A / A / A 0.0%

Full disclosure on the smells, because someone running the analysis themselves will see them: neo's nine are three non-escaped Unicode codepoints in Delimiter (a sealed type whose purpose is to be those characters), two "move this method into the sealed subtype" suggestions on a switch helper, two enum-singleton lectures, one "remove commented-out code" misfire on a JavaDoc @throws in a test, and one "don't name your local variable the same as a restricted identifier" in a test about parsing records. cognitive's two are both Main.java:45 telling me to replace System.err / System.out in a CLI's main() with a logger. I left them flagged rather than add @SuppressWarnings noise to production code just to make SonarQube happy. Maintainability is rated on smell density, not existence — both modules sit comfortably in A.

The analyzer that flags violations of Miller's rule does not violate Miller's rule on either of its own repos. That circularity is intentional. If a code-quality tool won't eat its own dog food, it has no business asking you to eat anything.

Insisting on these tools and baking them into my development workflow has improved what agents generate for me. It hasn't made it perfect. It can't.

Teaching them better

As we said at the beginning, most of the code in the world is shit. These AIs have been trained on that code. Context and experience matter. You can't show up to the slum along the river in Asunción, Paraguay — where poor Guaraní are literally making shelters out of plastic bags — hand them a rack of top-of-line GPUs and say "Bitcoin/AI your way to prosperity." Similarly, you can't feed these evolving intelligences the code equivalent of John Grisham novels and bad teen-romance series about vampires and werewolves, then expect that applying one simple tool to make things easier to understand will undo all that.

Obviously, running the tool on what they produce, using our brains to manage cognitive load, and teaching them how to write and edit in a virtuous loop will help. But until there are really good training sets — Dostoyevsky Brothers Karamazov or Tolkien Silmarillion levels of example — evolving intelligences will favor convoluted logic, procedural blobs with scattered control flow, React miasmas of state and effects that don't make any fucking sense. That's what people have written. How could the AI know better?

By teaching them better. I'm developing the Kompuwiz framework to serve as an educational example, actively building model-context-protocol servers, skills, and enough functionality to act as a good training set. I'm too busy building to be training them — but training them on what gets built is the next logical step.

Why this matters

Of course, someone who considers themselves "reasonable" might ask, "does this even fucking matter?" They may assert that "code that works is all that matters. I don't care what it does inside any more than you care how your internal-combustion engine works." There are bombastic proclamations that programming is dead, engineering is dead, all white-collar jobs will be automated in 18 months. I've been around long enough to see a lot of snake oil sold in our field.

All I can say is that if you don't care how your cancer is treated, as long as it's blasted away, and you're willing to believe overconfident engineers and executives assuring you everything will be alright, don't be surprised if you get Therac-25ed.

cognitive and neo are both small, fast, free, and usable today against any Java source tree you point them at. Neither is finished — cognitive's own README lists rules I'd like to add, and neo is a substrate I expect to keep pushing more interesting functionality into as I build the next analyzers on top of it. But both are real tools you can run right now to take action on real code. Neither will fix the Software Crisis. Neither claims to. But both refuse to leak the one specific thing they're meant to measure — and they bind themselves to the rule they enforce. If you want tools built that way for your stack, you can nico@visionary.software me with all your cheers, boos, and beers. I'd be happy to build the next one with you, or discuss licensing for your commercial work.

Programmer attention and memory have always been limited. We just kept saying "that's your problem — you're smart, figure it out." I will keep building tools that make it easier.