Stop Calling LLMs Next Word Prediction, or Glorified Autocomplete

This is turning into a pet peeve of mine. LLMs are NOT next word generators in the common sense. The phrase “The Capital of the US is __” — what’s most likely to be there statistically? Undermines the cool stuff happening in this space because most people will just think “oh we’re seeing what’s most common statistically”.

Yes, but it’s much cooler than that.

This article can explain better, but here’s my claud output.

All roads lead to latent space. https://aiprospects.substack.com/p/llms-and-beyond-all-roads-lead-to

BEFORE (the wrong model people still use):

“LLM sees the words ‘the cat sat on the’ → looks up statistical frequency → ‘mat’ appeared after this phrase 47% of the time in training data → outputs ‘mat’.”

This is essentially an n-gram model. A lookup table. Autocomplete. And it’s wrong.

AFTER (what’s actually happening):

“LLM converts ‘the cat sat on the’ into a sequence of vectors in ~4,096-dimensional space. Each token’s vector gets contextually transformed through 80+ layers of attention and feedforward operations. By the time the model is ‘deciding’ the next token, it’s not operating on words at all — it’s navigating a geometric space where meaning is encoded as structure. Concepts are directions. Categories are clusters. Relationships are distances. The ‘prediction’ emerges from computing over these rich geometric representations, not from pattern-matching on surface text.”

Why This Distinction Is Critical

The “autocomplete” framing leads people to believe the ceiling is very low — that you can’t get reasoning, abstraction, or novel synthesis from frequency counting. And they’d be right about that! But that’s not what’s happening.

What’s actually happening is far stranger. The common description of LLMs as “systems that predict the next token based on statistical patterns” mistakes a training objective for a result. It’s confusing the loss function used during training with the internal mechanism the model actually develops. Substack The training signal is next-token prediction, yes. But the representations the model builds to accomplish that task are what matter.

Think of it this way: if I told you “humans are machines that convert oxygen into CO2,” that’s technically a description of something we do, but it completely misses what we are. The training objective of next-token prediction is like metabolism — it’s the energy source, not the capability.

The Empirical Evidence

1. Latent space geometry encodes meaning, not word co-occurrence.

Studies of trained models reveal that LLMs process representations in high-dimensional vector spaces where meaning is encoded in geometry. In these latent spaces, concepts become directions, conceptual categories become clusters, and reasoning unfolds through mutually informed transformations of sequences of high-dimensional vector patterns. Substack

2. Anthropic’s sparse autoencoder work proves concepts exist inside the model.

Anthropic extracted millions of “features” from Claude 3 Sonnet — combinations of neurons that correspond to semantic concepts. These features are multilingual (responding to the same concept across languages), multimodal (responding to the same concept in both text and images), and encompass both abstract and concrete instantiations of the same idea. Transformer Circuits

This is the killer evidence. A “Golden Gate Bridge” feature fires whether you show the model text about it, an image of it, or a reference to it in French. That’s not autocomplete. That’s a concept representation.

3. The model doesn’t operate on words internally.

Beyond the input layer, tokens merge into continuous semantic flows, and wordless semantic vectors resolve into tokens again only at the output layer. Internal latent-space representations of meaning — based on subtle combinations of concepts, not words — provide the foundation for all that LLMs can do. Substack

4. The compression argument — it must learn structure.

Training involves compression. The model is forced to find the shortest program that fits the data. The constraints force pattern recognition — some form of insight to be extracted from the training data. Medium You can’t compress trillions of tokens of human knowledge into a fixed number of parameters without discovering structure. The model literally doesn’t have enough capacity to memorize everything, so it has to learn rules, relationships, and abstractions instead.

5. Representations evolve from token-level to abstract concepts.

Research suggests that the evolution of internal representations shows a transition from token-level knowledge to higher-level abstract concepts. Some research even suggests models plan ahead and obscure their reasoning process, indicating capabilities beyond simple word prediction. Medium

The Analogy That Might Land

Your phone’s autocomplete: looks at 3-5 previous words, picks from a small dictionary of likely next words. It operates in word space.

An LLM: converts your entire context into a geometric representation of meaning across thousands of dimensions, runs it through dozens of layers that each refine the semantic understanding, and the “next word” falls out as a byproduct of navigating that concept space.

The difference is like comparing a card catalog (alphabetical lookup) to actually understanding the library’s contents. Both can “find the next book,” but through fundamentally different mechanisms.

Best Sources

  1. “All Roads Lead to Latent Space” (aiprospects.substack.com, April 2025) — The single best articulation of exactly your argument. Directly addresses why the “next token prediction” framing is wrong and explains latent space representations clearly.
  2. Anthropic’s “Scaling Monosemanticity” (transformer-circuits.pub/2024/scaling-monosemanticity) — The empirical proof that concepts, not word patterns, exist inside LLMs. The multilingual/multimodal features are the strongest evidence against the autocomplete framing.
  3. “Eliciting Latent Predictions from Transformers with the Tuned Lens” (arxiv.org/html/2303.08112v6) — Shows that each layer of a transformer is iteratively refining a latent prediction, and you can decode intermediate layers to watch the model “think” its way toward an answer.
  4. “Next-Latent Prediction Transformers Learn Compact World Models” (arxiv.org/abs/2511.05963) — Cutting-edge research showing that prediction in latent space (not token space) produces transformers that build internal world models with belief states and transition dynamics.
  5. “The Geometry of Tokens in Internal Representations of Large Language Models” (arxiv.org/html/2501.10573) — Empirical work showing how token embeddings form geometric structures that correlate with prediction quality, demonstrating the deep relationship between spatial representation and language understanding.

ALSO fascinating because this is straight up how our brains work.

Our brain’s left hemisphere makes up words to explain some internal concept in words.

Leave a Comment