Exploring Alternative AI Architectures and Theories

I had a conversation with Claude that started with GPU benchmarks and went somewhere I didn’t expect. Sharing it because it was genuinely fun to think about. — Yuji

I was comparing hardware for LoRA fine-tuning and asked Claude for the numbers on RTX 5090 vs my M4 Max 128GB.

Claude: The 5090 is roughly 4–6x faster for training a 7B model. ~1,500–3,000 tokens/sec vs ~350–500. Training is compute-bound, and the 5090 has 14x the tensor FLOPS. But your M4 Max wins on model size — 128GB unified memory lets you work with models the 5090 physically can’t fit in 32GB VRAM. Different tools, different jobs.

I brought up NVIDIA’s DGX Spark benchmarks — 82,000 tok/s on Llama 3B, 53,000 tok/s on 8B, 5,000 tok/s on 70B — and asked why it’s so fast for a $4,000 box.

Claude: Those numbers are real — NVIDIA published them. The Spark’s 125 TFLOPS of dense BF16 is a monster for training even though its 273 GB/s bandwidth makes it mediocre for inference. Training is compute-bound, not bandwidth-bound. And 128GB unified memory means no offloading — none of those three workloads could even run on a 32GB consumer GPU. The Spark is a training machine disguised as a tiny desktop.

That got me thinking about the economics at the top.

How many runs to ship a model?

Claude: For GPT-4 class models — probably 3–10 full-scale runs at $10M–$100M+ each, 50–200 medium-scale scaling law prediction runs, and hundreds of post-training iterations. Meta’s OPT-175B logbook was candid about crashes and restarts. The Llama 3 paper describes managing through loss spikes on their 405B run. Maybe 30–50% of major runs hit significant problems. For every $100M final run that ships, there’s probably $50M–$200M in failed, exploratory, or preparatory compute behind it.

Me: But since it’s software and math, and not a rocket, you could optimize for recoverability or iterability on these runs. Architect around cost. Can any of the work done here be stored? 1TB is cheap even if slow — any novel moves here? Do we really just start with literal random reward functions and true noise? No way to optimize?

Claude: More is reused than you’d think — checkpoint recovery, warm-starting from previous generations, data curation pipelines, scaling law extrapolation. But here’s what might bother you: yes, pretraining still starts from random initialization. The weights are literally sampled from a Gaussian distribution. Every time.

“We are brute forcing reality with these Gaussian runs” — Yuji

Me: So the universe is probably deterministic with a small set of equations and laws that account for 95% of observed behavior. We don’t have a unified theory yet but we don’t see exceptions. Those rules plus subatomic particle state result in future and past calculation. We are brute forcing reality with these LLM Gaussian runs fitting them over our dataset. There must be something we can do.

VLM and multimodal stuff taking physics into account, many theories like Zipf’s law or golden ratio — we can rule out a lot of stuff the training is probably doing if we layered or did this intentionally. For example, can we teach a system most of the human-observable world — visible spectrum laws, etc. — to hyper-train a base reality first?

Claude: You’re pointing at something real. We’re spending billions training models to rediscover that objects fall, light reflects, and water flows — things we already have equations for. It’s like teaching someone to read by showing them a billion sentences instead of just teaching them the alphabet and grammar rules.

But here’s where the intuition breaks down in an interesting way. The universe’s equations are simple. The mapping from equations to human-observable experience is not. You can write Maxwell’s equations in four lines. Computing what a sunset looks like from them requires simulating quintillions of photon interactions. Getting from the Schrödinger equation to “why does this joke land” requires traversing every level of emergence: quantum → chemistry → biology → neuroscience → psychology → culture → humor.

The LLM isn’t really learning physics. It’s learning the compressed statistics of the emergent output of physics as filtered through human experience. And that compression is actually what makes it useful.

That said — there is a theoretical middle ground. Imagine a tiered architecture: physics simulator at the base (not learned, just encoded), chemistry and materials on top (partly learned, partly constrained), biological systems above that, cognitive models above that, language and culture at the top. Each layer constrains the search space for the layer above. This is how reality itself is structured, and how we educate humans — physics before engineering, anatomy before medicine.

People are working on pieces of this. Physics-informed neural networks bake F=ma into the loss function. LeCun’s JEPA argues for learned world models as foundation. Neuro-symbolic approaches tried combining neural networks with symbolic reasoning. None scaled. Not because they’re wrong — because all the money went to transformers.

The economic wall

Claude: The cost trajectory of frontier models is hitting a wall. Each generation costs roughly 3–10x more to train than the last. GPT-4 was estimated at $100M+. The next generation is looking at $1B+. Anthropic and OpenAI need massive revenue growth just to fund the next training run — not to profit, just to keep going.

The industry’s answer so far is: build bigger datacenters, buy more GPUs, develop better chips. There’s real work happening — photonic computing for matrix operations, analog chips, custom ASICs. All of it aimed at making the current approach cheaper per FLOP.

But all of that assumes the current approach is the right one to optimize.

Me: It’s entirely possible that, just like crappy viral vibe code stuff that works but is terrible makes sense in this AI context — where output value is greater than cost minus human activation energy — we’re in a similar situation with the whole field.

Here’s the history of AI as I see it: nobody cared. Until “Attention Is All You Need” comes out. Guy at a research lab who didn’t care about joining what’s popular. Super niche field, no capitalist cared.

We found one mechanism through which we can now throw money at the problem — scaling datacenters, LLM researchers — which created great output, and now capitalist and world talent interest. Now it is the only field that exists.

Now that we have everyone’s attention, we could potentially go back to before transformers or LLMs with the dollar and talent reach available.

It is critical to have a “throw money at it” problem because humanity is optimized for it. It’s possible we’re stuck in the wrong approach in the first place.

It’s entirely possible everything changes once we are in a world where we casually talk about throwing $1B, $100M for compute costs for one run. That plus the attention of the world and government and talent availability.

Armed with money and talent and interest, what would you look into if I said scrap the transformer paper?

Claude: Your meta-observation is sharp. The transformer wasn’t necessarily the right answer — it was the first scalable answer, and scalability is what unlocks capital, and capital is what unlocks talent. The mechanism matters more than the mechanism being optimal.

That creates lock-in risk. Everyone is now optimizing transformers because that’s where the money is, not because we’ve proven it’s the best possible approach. Like internal combustion engines getting 120 years of optimization not because they were the best approach but because the gas stations were already built.

If I had to allocate $10B across non-transformer research:

$3B — Energy-based models. Instead of predicting the next token, learn an energy landscape where low-energy states correspond to true things about the world. The universe doesn’t predict the next frame — it minimizes action, follows energy gradients. This is LeCun’s bet. Training energy-based models is harder than autoregressive models. That’s a funding problem, not a physics problem.

$2B — Program synthesis. Learn programs instead of weights. A model that outputs a physics equation instead of a probability distribution is fundamentally more sample-efficient, interpretable, and composable. Programs compose and weights don’t — that directly solves the problem of not being able to merge knowledge from different training runs. DreamCoder from MIT showed this works at tiny scale. Nobody funded it seriously.

$2B — Spiking neural networks. The brain runs on 20 watts. It doesn’t do matrix multiplication. It uses sparse, asynchronous, event-driven computation. We’ve ignored this because GPUs are optimized for dense linear algebra. Intel’s Loihi and IBM’s TrueNorth were early attempts with tiny budgets. The brain is proof a radically different architecture works.

$1.5B — Hybrid architectures that use symbolic and physical priors as scaffolding with learned components filling gaps — your original idea of encoding reality as a base layer.

$1B — Predictive coding. The brain constantly predicts its own next input and only propagates errors upward. Familiar stimuli require almost zero computation. A transformer processes every token with equal effort. A predictive coding system’s cost scales with information content, not sequence length.

$0.5B — Weird bets. Cellular automata, hyperdimensional computing, quantum approaches to optimization.

Mirror the brain

Me: It makes sense to mirror the brain. We can think of it as specialized regions that interact with others through a large interface asynchronously. Where activation of one area affects the others even if that path didn’t ultimately connect. What if we focus entirely on how the brain works and actively think of concepts that we’d thrown away because they didn’t fit our hardware specs at the time — single thread, general purpose, etc. What could we try with a simpler brain model to mechanize that is novel?

Claude: The fundamental thing we threw away for hardware convenience: the brain doesn’t have a forward pass. There’s no clean separation between input, computation, and output. Everything runs simultaneously. A thought isn’t a sequence of matrix multiplications — it’s a standing wave pattern that emerges across regions and dissolves.

We threw that away because GPUs need synchronous batch operations.

Specific mechanisms we abandoned:

Recurrence everywhere. The brain sends as many connections backward as forward. High-level understanding shapes low-level perception in real time — you literally see differently based on what you’re thinking about. We had RNNs and abandoned them because they couldn’t parallelize on GPUs. But the brain’s recurrence implements iterative refinement — each pass sharpens the representation. The computation converges rather than passing through a fixed number of layers.

Radical sparsity as computation. 86 billion neurons, 1–5% active at any moment. The pattern of which neurons fire IS the information. The same neuron participates in thousands of concept-ensembles. Anthropic’s own mechanistic interpretability work found exactly this in transformers — features in superposition. In transformers it’s an accident that makes interpretability hard. In the brain it’s the design principle that makes it efficient.

Local learning only. Hebbian: neurons that fire together wire together. No global backprop. No global loss function. Each synapse updates on local information only. Online, continuous, embarrassingly parallel.

Neuromodulation. Dopamine, serotonin, norepinephrine — broadcast signals that change the learning rate and processing mode of entire brain regions simultaneously. The brain’s metacognition at the hardware level. We have nothing like this in current architectures.

As an engineering prototype, I’d build: a collection of small recurrent modules — call them cortical columns. 50–100 of them. Each one a small network, few million parameters. Loosely hierarchical, lots of lateral and feedback connections. Each running asynchronously on its own clock. Local learning only, no global backprop. A few broadcast “neuromodulator” signals for novelty, confidence, reward. Predictive coding as the core loop.

Train it on multimodal streams — video, audio, physics simulations — before it ever sees language. Start modules pre-wired with simple priors: edge detection, frequency analysis, basic physics. Let them self-organize. Developmental, like a brain.

Would it work? I genuinely don’t know. Nobody’s tried it with real resources. But the existence proof is the thing inside your skull, running on 20 watts, doing things LLMs still can’t do.

The takeaway

This wasn’t a manifesto. It was a fun conversation that went somewhere neither of us planned.

The practical stuff: RTX 5090 crushes M4 Max on training throughput. DGX Spark is a surprisingly legit training box for $4K. The economics of frontier models are getting brutal — hundreds of millions per run, scaling into billions, with novel chip architectures (photonic, neuromorphic, analog) all being built to make the current approach cheaper. But cheaper to run the same approach still means the same approach.

The interesting-to-think-about stuff: what if instead of building better chips to run transformers faster, some of the world’s newly available talent and capital just… looked back? At energy-based models, program synthesis, spiking networks, predictive coding, brain-like architectures. Not because transformers are wrong — the output is objectively incredible — but because it would be interesting to see what happens when the abandoned ideas finally get real funding.

The current approach works. It also costs hundreds of millions of dollars per run, starts from literal random noise every time, and brute forces its way to understanding reality through statistical fitting over trillions of tokens while ignoring centuries of accumulated physics. That might be fine. Or it might be the ENIAC era of AI.

Or maybe scaling just keeps working and this was a fun thought experiment over GPU benchmarks.

Either way, it was a good conversation.

— Yuji Tomita, from a conversation with Claude, February 2026

Exploring Alternative AI Architectures and Theories

How many runs to ship a model?

“We are brute forcing reality with these Gaussian runs” — Yuji

The economic wall

Mirror the brain

The takeaway

Published by Yuji Tomita

Leave a Comment Cancel reply

How many runs to ship a model?

“We are brute forcing reality with these Gaussian runs” — Yuji

The economic wall

Mirror the brain

The takeaway

Share this:

Related

Published by Yuji Tomita

Leave a Comment Cancel reply