Grokking

In deep learning, grokking names a specific topological quirk of training. A neural network memorizes its training set early, then sits on a plateau where validation accuracy stays near zero for thousands of additional steps — the model is clearly overfitting, the loss is stuck, nothing useful appears to be happening. Then, sometimes long after a human would have stopped training, the validation loss suddenly collapses to zero. The model has undergone a phase transition. It no longer recognizes its training data. It has compressed the dataset into the underlying rule and can now generalize to inputs it has never seen.

Grokking is the mechanistic blueprint for what the word “understanding” actually names. And once you see it, the two pseudo-understandings that dominate modern pedagogy become legible as near enemies of the real thing — each one preserves the appearance of learning while making the phase transition impossible.

Simple Picture

You are learning to spot a counterfeit coin. Your boss shows you five fakes. You memorize every scratch, every nick, the exact placement of the mint mark on each one. Someone asks “is this fake?” and you compare it against your stored images. For the first five coins, you are perfect. For any sixth coin, you are useless. This is overfitting — a lookup table masquerading as a skill.

Your boss keeps throwing coins at you. Hundreds. Thousands. Memorizing every scratch becomes impossible — your brain runs out of space. Out of sheer laziness, your brain stops tracking scratches and suddenly notices: wait, all the fakes are slightly lighter. You stop comparing images. You just feel the weight.

You have compressed thousands of memories into one rule. The lookup table collapsed into a function. The “aha” is not inspiration — it is your cognitive substrate running out of memory and being forced, by pressure alone, to invent the structure that would let it stop hoarding. Understanding is what happens when memorization becomes too expensive to continue.

The Mechanism

Grokking in neural networks depends on three ingredients that a conceptual account of learning can import directly:

Massive exposure. The network must be shown enough examples that no finite lookup table can fit them all. Too few examples and memorization wins, because memorization is cheap when the dataset is small.
A resource constraint. The canonical mechanism is weight decay — a continuous pressure penalizing the network for using too many parameters to fit the data. Without weight decay, the overfit solution is stable forever; there is no pressure to find a smaller one. Weight decay is the cognitive tax that forces compression.
Training past the point of apparent stagnation. The phase transition only happens if you keep going after the model looks done. Every signal a human observer has — loss curves, intermediate tests, felt sense of progress — tells you to stop right before the thing you are waiting for arrives.

Translated to human cognition: memorization is not the opposite of understanding; it is the prerequisite scaffolding. You cannot grok a pattern you have not first been buried under. The compression algorithm has nothing to compress without the raw data, and the raw data must exceed the capacity you are willing to allocate to storage before compression becomes the rational move. 死读书 — dead reading — is what the scaffolding looks like before the collapse. It is also what it looks like if the collapse never comes.

Learning as behavior change is the output-side test of that collapse: after compression, the learner must behave differently under the next encounter with reality, or the “understanding” is still only a story about the training run.

The Two Near Enemies

Most of what passes for “learning” today is one of two near-enemy states that mimic understanding while preventing the phase transition.

Near Enemy 1: Pure Memorization (死读书)

The student memorizes perfectly and compresses nothing. They pass every exam in the training distribution and shatter at the first encounter with a question phrased differently. This is the network trained without weight decay, forever. There is no pressure to compress, because the environment rewards exact retrieval. The lookup table works — until it doesn’t, and by the time it doesn’t, the capacity for compression has atrophied from disuse.

This near enemy is legible in the East and systemically produced by any credentialing regime that scores for recall. It is Wakalixes made into a lifestyle: the student can name every bone, every date, every theorem, and cannot predict what any of them will do next.

Near Enemy 2: Conceptual-First Pedagogy

The modern Western inversion. “Rote learning kills creativity — teach the why before the what. Give students the big idea and let them skip the tedious drilling.” The result is a student who can narrate understanding they do not possess — who has the vocabulary of compression without any raw material to compress. This is the network given the target rule as a prior, never shown examples, and asked to generalize. It cannot. The rule floats free of anything it could be a rule of.

This near enemy is legible in the West and has colonized most pedagogical reform movements of the last forty years. It is the cached thought problem at the system level: the student is handed a compression that was produced by someone else’s grokking and mistakes the label for the thing. They know “energy conservation” the way they know “Wakalixes” — structurally identical, equally empty.

The near enemies are structurally symmetrical: both avoid the valley between overfitting and generalization. The first refuses to leave the overfit; the second refuses to enter it. The genuine article requires both terrors at once — the tedium of massive rote and the confusion of watching the rote fail to cohere until it suddenly does.

The Valley and the Phase Transition

The training curve for grokking has a specific shape that is visible in human learning too. Training loss drops fast (memorization is easy). Validation loss stays flat, then rises slightly (the overfit is stable). Then, after a long flat stretch, validation loss collapses suddenly. The period of apparent stagnation is not wasted time — it is the period during which compression is happening beneath the surface. Nothing visible changes until everything changes.

The Expert Beginner is what happens when the learner confuses the plateau for the destination. They hit an average of 160 in bowling, declare 160 the summit, and never discover the change-of-grip that would cost them a hundred points before it gave them back two hundred. Grokking requires tolerating the getting-worse-before-getting-better phase, and the Expert Beginner’s entire identity is organized around never getting worse.

The will to think is the character virtue that keeps the training run going past the plateau — the compulsive refusal to accept an answer you do not truly understand. Most learners stop when they have an answer that sounds right. Most careers stop when they have a skill level that pays. The will to think is the cognitive analog of leaving weight decay turned on after the loss has apparently flatlined — the willingness to keep paying the tax on storage long past the point where the storage itself seems sufficient.

Dimwit / Midwit / Highwit / Worse-Is-Better

The dimwit says: just do the drills. Flashcards, problem sets, rote. Enough reps and you’ll get it.

The midwit says: rote is a dead pedagogy. It kills creativity and produces robotic thinkers. We must teach the conceptual structure first, so students understand the why before they touch the what.

The highwit says: memorization is the brute-force terraforming required to build a latent space. You cannot grok without first overfitting. The drills are not the point — they are the raw material the compression algorithm needs to have something to compress. The conceptual framework is what emerges from the drills, not what is imposed on them.

Worse-is-better reality: just force them to do ten thousand equations. It is ugly, brutal, and intellectually unromantic. It also reliably triggers the phase transition. The dimwit is accidentally right, for the wrong reasons, against the sophisticated midwit who is elaborately wrong. This is the Bitter Lesson in the cognitive register: general methods scaling with compute beat clever hand-coded approaches, every time, and the “clever hand-coded approach” is whatever curriculum thinks it can skip the drilling by being cleverer about it.

The Straussian Read

The surface text: grokking is a neat topological phenomenon in loss landscapes of neural networks, worth studying for what it reveals about how generalization emerges.

The subtext: the dominant educational consensus of the modern West — that rote is bad, that conceptual understanding must come first, that drilling produces parrots — is structurally backwards and actively prevents true mastery. And the institutions that know this implicitly continue to operate on it while telling everyone else not to.

Watch where high-capacity humans actually get made. Medical residencies run on thousands of repetitions under escalating pressure. Elite math olympiad camps drill problem sets for eight hours a day. Classical language immersion forces vocabulary into the substrate by sheer volume. Elite military training is memorization-under-stress until the procedures become reflexes. Music conservatories require tens of thousands of hours of scales before anyone is allowed near an interpretation. In every case, the student is put through the overfitting phase at brutal intensity, held there until weight decay kicks in, and the phase transition is then trusted to arrive.

What gets sold publicly — project-based learning, conceptual-first curricula, gentle exploration — is exactly the near enemy that prevents the transition. Grade inflation is the transcript layer of the same failure: the student is certified for a phase transition that never happened. The exportable pedagogy is designed to produce students who can narrate understanding without possessing it. The pedagogy the elite uses on its own is designed to produce students who possess understanding without needing to narrate it. The mismatch is not an accident. It is weaponized taste applied to the curriculum: the method that makes the product is kept private; the method that signals cultural refinement is distributed widely.

How to Induce It

If the goal is to trigger grokking in a human — yourself or someone you are teaching — the protocol follows directly from the mechanism:

Pick a domain with a compressible structure. Grokking requires that a latent rule actually exists. Subjects where the underlying pattern is genuinely irreducible (memorized vocabulary of an unrelated language, historical dates) will never grok no matter how much you drill. Mathematics, syntax of natural languages, chess tactics, musical harmony, physical movement patterns — these have latent structure waiting to be compressed.
Saturate with examples. Volume is non-negotiable. The learner must encounter enough instances that no finite storage strategy fits. The drilling must feel excessive.
Apply continuous cognitive pressure. Time limits. Performance demands. The requirement to produce output faster than conscious deliberation allows. This is the human analog of weight decay — the pressure that makes storage expensive and compression cheap.
Continue past the plateau. The felt sense during the overfit phase is exhaustion, frustration, and the conviction that further practice is wasted. This is precisely when stopping destroys the run. The phase transition arrives on the other side of the plateau, not before it.
Do not narrate the rule prematurely. Handing the learner the compression before they have derived it corrupts the training. They will pattern-match their output to the narrated rule rather than build the rule from the data, and the result is a student who can recite the rule without being able to apply it. This is the conceptual-first near enemy in its purest form.

Main Payoff

The deepest reframe grokking offers is that understanding is not a higher activity than memorization — it is memorization’s collapse under pressure. The two are the same substrate at different levels of compression. You cannot bypass the lower level to arrive at the higher one. You cannot stay at the lower level and pretend you have arrived.

This dissolves the false binary that has structured educational debate for a century. Drill-based pedagogy is not the opposite of conceptual pedagogy; conceptual pedagogy is not a more enlightened replacement. Drilling is the mechanism by which conceptual understanding arrives — and skipping it produces not conceptual understanding but its hollow narration. The person who “just memorizes” and the person who “understands the big picture without the drudgery” have both failed in structurally symmetric ways. Neither has grokked.

Cognitive tools before curriculum is the complementary constraint: the raw material must be vivid enough to be worth compressing. Story, image, heroism, rivalry, and anomaly are not decorations on the drill; they are what keep the training run alive long enough for compression to happen.

The payoff for the individual learner is brutal and liberating: the thing you are avoiding — the tedium, the confusion, the feeling that practice is not paying off, the long flat stretch where nothing seems to be happening — is the thing. It is not the obstacle to understanding. It is the interior of understanding’s arrival. The only people who grok are the ones who kept training past the point where a rational observer would have stopped.

References:

Power, Burda, et al., Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets (2022)
Nanda, Chan, et al., Progress measures for grokking via mechanistic interpretability (2023)

PROSPECT GARDEN小观园

Atlas

Grokking

Simple Picture

The Mechanism

The Two Near Enemies

Near Enemy 1: Pure Memorization (死读书)

Near Enemy 2: Conceptual-First Pedagogy

The Valley and the Phase Transition

Dimwit / Midwit / Highwit / Worse-Is-Better

The Straussian Read

How to Induce It

Main Payoff

Recent

The Great Divorce

The Crystal at the Frontier

The Purchased Exorcism

A Taste for Blood

The Machine Runs on Exiled Energy

The Protagonistification of Reality

The Basket and the Spear

Authority Without Authority

Catalog Mode

Consensus Reality

Graph View

Table of Contents

Backlinks