1. The Parameter Count Question
When people compare large language models to the human brain, the first number that surfaces is usually the synapse count. It's an intuitive comparison: synapses are the adjustable connection weights of the brain, and parameters are the adjustable connection weights of a neural network. So how many of each are there?
| Entity | Count | Order of Magnitude |
|---|---|---|
| Human brain synapses | ~100–150 trillion | 10¹⁴ |
| Human brain neurons | ~86 billion | 10¹⁰·⁹ |
| GPT-4 (estimated) | ~1–2 trillion | 10¹² |
| Llama 3 (405B) | 405 billion | 10¹¹·⁶ |
| GPT-3 | 175 billion | 10¹¹·² |
| GPT-2 | 1.5 billion | 10⁹·² |
On raw count, the brain's synapse count sits roughly 100–1000× larger than today's frontier models. That gap is real and worth acknowledging. But fixating on it misses the deeper story — because the synapse–parameter mapping is only superficially valid. The brain is doing several additional kinds of computation that have no parameter-count equivalent at all.
2. Why the Synapse–Parameter Analogy Breaks Down
The analogy treats the brain as though it were a static feedforward network with learned weights. In reality, biological neural computation happens at multiple nested levels simultaneously:
LLM parameters map to one of these levels: the synaptic weights going into the dendritic branches. Every other level has no direct parameter-count analog. They are more like the learning algorithm itself running live during inference.
The gaps can be organized into five distinct problems:
3. Layer 1: Dendrites as Mini-Computers
The Standard Neuron Model (What LLMs Use)
Every artificial neuron in a modern network computes the same operation:
output = activation(Σ wᵢ · xᵢ + bias)
This is a weighted sum followed by a nonlinearity. The neuron sees all its inputs equally — there is no spatial structure to the computation. It doesn't matter which inputs arrive together or in which order.
What Biological Dendrites Actually Do
A real pyramidal neuron has a dendritic tree with thousands of branches. Crucially, each branch is not a passive wire — it is an active computational subunit capable of generating its own local electrical event:
- NMDA spikes: When multiple synapses on the same branch activate close together in time and space, the NMDA receptors on that branch can trigger a local depolarization (a dendritic spike) that is nonlinearly amplified. This is effectively an AND-gate: only if co-active inputs cluster together on the same branch does the branch fire.
- Calcium (Ca²⁺) spikes: In distal dendrites, voltage-gated calcium channels produce slower, broader spikes that are sensitive to coincident input from different pathways.
The cell body (soma) never sees individual synaptic inputs. It receives the processed outputs of 10–20 dendritic sub-computations already performed in the branches.
The Poirazi & Mel result (2001) estimated that the computational capacity of a single layer-5 pyramidal neuron — counting its dendritic nonlinearities — is roughly equivalent to a 2-layer neural network with hundreds of hidden units. One neuron. Not one parameter. An entire sub-network.
This means the brain's true "parameter count equivalent" is dramatically underestimated by just counting synapses. You'd need to count all the nonlinear dendritic integration events as additional computational degrees of freedom.
4. Layer 2: Intra-Neuron Processing
Even after dendritic integration, a neuron performs several more processing steps before output:
Temporal Integration: Soma as a Leaky Integrator
The soma doesn't just sum its inputs at a single instant — it integrates over time. The membrane potential obeys roughly:
τ · dV/dt = -V + Σ(synaptic inputs)
Where τ is a time constant (typically 5–30ms). This means a weak input from 50ms ago can still influence whether a current input fires. Temporal context is built into the physical dynamics of the cell, not stored in a separate memory module.
LSTMs (see Section 6) are the closest engineered approximation to this, but the biological version has no equivalent of fixed gates — the time constant itself shifts with neuromodulatory state.
Back-Propagating Action Potentials
After a neuron fires, the spike doesn't just travel forward down the axon. It also propagates backward into the dendritic tree. This retrograde signal arrives at synapses that were recently active and modifies their future sensitivity — a form of online plasticity happening within a single neuron, mid-computation.
This has no analog in any feedforward architecture. The closest engineered concept is online Hebbian learning, but that typically changes weights after a full forward pass, not mid-pass.
5. Layer 3: Neuromodulation — The Most Alien Gap
This is where the comparison between brains and LLMs most completely collapses.
What Neuromodulation Is
Neuromodulation refers to a class of signalling systems where a small number of neurons — located in specific brainstem nuclei — broadcast chemical signals (neuromodulators) diffusely across vast brain regions. These signals don't transmit information the way synapses do. They change the operating mode of every neuron in the region they reach simultaneously.
The Four Key Neuromodulators
Dopamine — prediction error, not reward
Dopamine neurons in the VTA fire not when something good happens, but when something is better than expected. They are silent for expected rewards, fire for surprising rewards, and are suppressed for predicted rewards that don't arrive. This is a real-time temporal difference (TD) learning signal.
The engineering parallel is the loss gradient in reinforcement learning — but that signal only exists during training in an LLM. The brain runs the equivalent of training and inference simultaneously, on the same hardware.
Acetylcholine — the plasticity gate
ACh concentration controls whether synapses change when active. High ACh (novelty, attention, arousal) → high plasticity; synapses update easily. Low ACh (routine, familiar contexts) → plasticity suppressed; synapses freeze.
This is the brain's solution to the stability-plasticity dilemma: learn fast when the world is new, consolidate when it's familiar. No inference-time LLM has this. Weights are frozen the moment training ends.
Norepinephrine — signal-to-noise control
When the locus coeruleus fires, it floods the cortex with norepinephrine. The effect is a global sharpening of neural responses: weak inputs get more suppressed, strong inputs get more amplified. The signal-to-noise ratio of the entire cortex shifts in milliseconds.
The closest LLM analog is temperature in sampling (softmax(logits / T)), but that's a fixed hyperparameter set before inference. The brain adjusts it dynamically in response to what's happening inside the network.
Serotonin — temporal discounting and patience
Serotonin's computational role is still debated, but the leading theories tie it to temporal discounting — how far into the future the system considers consequences. High serotonin correlates with more patient, longer-horizon decision-making. This is a meta-level control over the objective function itself, not just the computation.
Why This Is Fundamentally Different from Any LLM Feature
The top-right quadrant — global scope, fast timescale — is where neuromodulators operate. No LLM component sits there. Attention weights are local and fast; temperature is global but static; layer norm is local and fast. The quadrant that matters most for adaptive cognition is the one that's empty in artificial systems.
6. Pre-Transformer Architectures That Tried to Close These Gaps
Before "Attention Is All You Need" (2017) standardized the field, researchers had been working for decades on architectures motivated by each of these biological gaps. The efforts were largely parallel and uncoordinated. Each closed part of the problem while leaving others untouched.
Gap 1 — Dendritic Nonlinearity: Sigma-Pi Units (1986)
The problem they were solving: standard neurons can only represent linear combinations of their inputs before the activation function. Dendritic AND-gates require multiplicative interactions — input A times input B — which a simple weighted sum cannot express.
The solution: Rumelhart, Hinton & McClelland proposed Sigma-Pi units. Instead of Σ wᵢxᵢ, the neuron computes products of inputs first, then sums them:
output = σ( Σⱼ [ ∏ᵢ (wᵢⱼ · xᵢ) ] )
Each ∏ term corresponds to a dendritic branch performing local multiplication (coincidence detection). The outer Σ corresponds to the soma summing branch outputs.
Why it didn't scale: The number of possible product terms grows factorially with the number of inputs. You had to manually specify which inputs should multiply, because there was no learning rule that could discover the correct groupings. The architecture was biologically accurate but practically unscalable.
Legacy: Capsule Networks (Hinton, 2017–2019) revisit the idea that parts of an input should be grouped before global decisions are made — a loose descendant of the Sigma-Pi insight. Second-order optimization methods also exploit input products implicitly.
Gap 2 — Temporal Integration: LSTMs and Reservoir Computing
The problem: biological neurons integrate inputs over time via membrane potential decay. A synaptic input from 50ms ago can still influence the current firing decision. Standard feedforward and even simple RNNs have no principled mechanism for this.
LSTMs (1997) are the most successful pre-transformer solution to any biological gap. The mapping is explicit:
The cell state is the membrane potential. The forget gate is the decay constant τ. The input gate is the threshold mechanism. The output gate controls what gets read out.
What LSTM still missed: the time constant τ is not fixed in biology. Acetylcholine, for example, can shift a neuron's integration timescale from milliseconds to hundreds of milliseconds. LSTMs have a fixed architecture with learned but static gates — no runtime reconfiguration.
Reservoir Computing / Echo State Networks (2001–2002) took a different approach. Instead of learning a gated recurrent structure, the reservoir is a large, fixed, randomly-connected recurrent network. Its chaotic dynamics maintain a rich "echo" of recent temporal history. Only the linear readout layer is trained.
This is arguably more biologically plausible than LSTM — cortical dynamics look more like a high-dimensional chaotic reservoir than a gated sequence processor. But reservoirs couldn't be trained end-to-end, which severely limited their applicability.
Gap 3 — Spike Timing: Spiking Neural Networks
The problem: real neurons produce discrete spikes, and when spikes arrive carries information beyond firing rate alone. Two neurons sending 10 spikes/sec can mean entirely different things if one fires in tight bursts and the other fires uniformly. Mainstream ANNs use continuous activations and are completely blind to this.
Leaky Integrate-and-Fire (LIF) neurons model this most directly:
τ · dV/dt = -V + I_syn(t)
if V(t) ≥ V_threshold: emit spike, reset V → V_reset
The membrane potential integrates incoming spikes and leaks between them. When the threshold is crossed, a discrete spike is emitted and the potential resets. This is as close to the biology as a digital system gets.
The training problem: spikes are non-differentiable. The threshold crossing is a step function — gradient is zero everywhere except at the threshold, where it's undefined. Standard backpropagation cannot pass through it.
Where SNNs won: energy efficiency on neuromorphic hardware. Intel's Loihi chip and IBM's TrueNorth exploit the sparse, event-driven nature of spiking computation — power is only consumed when a spike occurs, not on every clock cycle. For certain sparse, event-driven inference tasks (gesture recognition, keyword spotting), neuromorphic hardware at SNN models achieve orders-of-magnitude better energy efficiency than GPU-based ANNs.
Gap 4 — Neuromodulation: ART and Fast Weights
The problem: no pre-transformer architecture had a mechanism for globally changing the operating mode of the network during inference, in response to internal state.
Adaptive Resonance Theory (ART, Grossberg 1976–2013) attacked the stability-plasticity problem most directly. ART networks maintain a top-down expectation signal that is compared against the bottom-up input. If they "resonate" (match within a vigilance threshold), the network reinforces the existing memory. If they mismatch, the network resets and opens a new category.
This is conceptually close to acetylcholine's role: only commit to memory when the input is consistent with prior expectations OR when a new category is warranted. ART worked on classification tasks but was fragile, required careful parameter setting, and never scaled to deep hierarchical representations.
Fast Weights (Hinton 1987, revisited Ba et al. 2016) are the most direct engineering attempt at the neuromodulation gap. The core idea is two weight timescales:
| Weight type | Timescale | Biological analog |
|---|---|---|
| Slow weights | Updated across training | Synaptic long-term potentiation (LTP) |
| Fast weights | Updated within a single forward pass | Short-term synaptic facilitation/depression |
Fast weights let the network temporarily reconfigure its own connectivity during inference — exactly what dopamine and acetylcholine enable biologically. A neuron that recently fired strengthens its connections to recently co-active neurons within the current computation, not just across training episodes.
The 2016 revival showed strong results on few-shot learning tasks and is a conceptual ancestor of modern in-context learning. When a transformer "learns" from a few examples in its context window, it may be approximating fast weights — but the mechanism is emergent from scale, not designed.
Gap 5 — Online Plasticity: Hopfield Networks and Hebbian Rules
The problem: in LLMs, weights are completely frozen during inference. The brain has no such phase separation — synapses change continuously, including during the equivalent of inference.
Hopfield Networks (1982) model associative memory as energy minimization:
The weights encode memories as energy minima. Retrieval is an active dynamical process — the network evolves over time until it reaches a stable state — not a single matrix multiply. This is closer to biology than feedforward inference.
Modern Hopfield Networks (Ramsauer et al., 2020) dramatically increased storage capacity and proved a formal equivalence between Hopfield dynamics and self-attention. This wasn't coincidence — it revealed that transformers are implicitly computing something like energy-minimizing retrieval, though without the continuous-time dynamics.
Spike-Timing Dependent Plasticity (STDP) was the learning-rule attempt at online plasticity:
- If neuron A fires before neuron B: strengthen A→B (causal, A "caused" B)
- If neuron A fires after neuron B: weaken A→B (anti-causal)
STDP is one of the most biologically validated learning rules known. It has a beautiful temporal logic: if A fired before B, A probably caused B, so the connection should be strengthened. But STDP is purely local — it has no credit assignment for distant layers. Deep networks trained with STDP alone fail for the same reason you can't train a multi-layer network with simple Hebbian rules: there's no signal telling lower layers how to adjust for errors at the output.
7. The Most Ambitious Combination Attempt: SPAUN
The Semantic Pointer Architecture Unified Network (SPAUN), built by Chris Eliasmith's group at the University of Waterloo in 2012, remains the most comprehensive attempt to build a large-scale biologically constrained brain model that could actually perform tasks.
What SPAUN combined:
What it could do — eight different cognitive tasks within a single architecture, switching based on visual input:
| Task | Cognitive demand |
|---|---|
| Digit recognition | Visual perception |
| Serial working memory | Temporal sequence storage |
| Digit span | Short-term memory capacity |
| Pattern completion | Associative retrieval |
| Addition via counting | Procedural arithmetic |
| Rapid variable creation | Abstract rule binding |
| Fluid reasoning (Raven's) | Analogical inference |
| Handwriting | Motor sequence output |
The scale: 2.5 million spiking neurons. The cost: a few seconds of "brain time" took several hours of compute on the hardware available in 2012.
What SPAUN proved: it's possible to build a single, unified, biologically constrained architecture that performs diverse cognitive tasks by combining spiking dynamics, attractor-based working memory, reservoir computation, and a basal ganglia model of action selection. Nothing before or since has done this with the same biological fidelity.
What SPAUN couldn't do: learn from data. The entire architecture was hand-engineered — connection weights were derived analytically from the Neural Engineering Framework, not learned by gradient descent. It was a magnificent proof of concept with no path to scale.
8. The Irony of the Transformer
The transformer architecture, by consciously ignoring most of this biological literature, accidentally approximates some of it — without any of it being designed in.
In-context learning as emergent fast weights: When a transformer "learns" from a few examples in its context window without any weight update, it is implementing something functionally equivalent to fast weights — the forward pass itself is modifying the computation based on recent input. This was not designed; it appeared as an emergent property of scale.
Modern Hopfield / attention equivalence: Ramsauer et al. (2020) proved that the update rule for modern Hopfield networks is mathematically equivalent to the attention operation in transformers. Transformers are implicitly computing energy-minimizing retrievals from their key-value memory — something that Hopfield Networks were explicitly designed to do in 1982.
The feedforward layers may be implicit dendritic computation: the two-layer MLP after each attention block in a transformer has enough parameters that it may be implicitly representing higher-order input interactions — something Sigma-Pi units tried to engineer explicitly. This is speculative, but mechanistic interpretability work (e.g., Anthropic's research on MLP layers as lookup tables for factual associations) suggests the FFN layers are doing something closer to structured retrieval than to smooth function approximation.
9. What Remains Genuinely Unsolved
After all of this — fifty years of bio-motivated architectures, and seven years of transformer scaling — five gaps remain fundamentally open:
The deepest of these is probably embodiment and active inference. The brain is not predicting the next token in a sequence of text. It is a predictive machine embedded in a body, constantly generating predictions about incoming sensory data and updating those predictions via sensorimotor feedback. Karl Friston's Free Energy Principle frames the entire brain as minimizing surprise about sensory input — a fundamentally different computational objective from next-token prediction.
LLMs are extraordinarily good at what they do. But the parameter count comparison, taken alone, implies a competition between two systems trying to do the same thing at different scales. The more accurate picture is that they are running different algorithms, at different timescales, for different objectives, on different substrate — with one of them (the biological one) running for 500 million years of evolutionary iteration, and the other for about fifteen.
Summary: Gaps, Architectures, and Status
The honest summary: we are roughly within 2–3 orders of magnitude on synapse count. On computational architecture, we have partial solutions for two of the five major gaps (temporal integration via attention, and associative memory via modern Hopfield / in-context learning). The other three — dynamic gain control, inference-time plasticity, and the fundamental difference in embodied objective — remain as open as they were in 1986.
The path forward almost certainly isn't more parameters. It's probably closer to the questions Maass, Eliasmith, and Hinton were asking in the 1990s: how do you build a system that can change its own computation mid-inference, in response to what it's currently computing?
This article synthesizes research from Rumelhart, Hinton & McClelland (1986), Hochreiter & Schmidhuber (1997), Maass (2002), Jaeger (2001), Eliasmith et al. (2012), Ramsauer et al. (2020), Ba, Hinton et al. (2016), Poirazi & Mel (2001), and Grossberg (1976–2013).
license: "Creative Commons Attribution-ShareAlike 4.0 International"