AI Interpretability

Understanding how AI models think

Sep 03, 2025

Introduction

Throughout my interviews and interactions with people that I communicate with, you may have heard me talking about the fact that I don’t think AI is “intelligent”. It is very competent, however, and increasingly so. It is astonishing how well they perform on the whole. The funny thing is researchers don’t fully understand what’s going on inside the black box. There have been recent efforts to better understand what is happening. But the goal in my opinion isn’t just to make the models progress in competency even more, as much as it is to stop some serious and undeniable outcomes that some users are experiencing. The term “AI Psychosis” isn’t exactly what all these companies want in the headlines. I’m glad that more attention is being put on these problems.

This post is going to highlight what Anthropic is talking about in the video below. I’d like to let you see how these engineers are talking between themselves. Remember, these are secular people without a Christian worldview. Part of their casual bantering and speculation about the human brain and the transhumanist idea that AI can beneficially help the human race makes me upset. Pray for these folks to know the Lord. They are building the Tower of Babel, and they don’t understand the spiritual implications of what they are doing. Justs to be clear, I strongly believe—knowingly or unknowingly—they are helping to equip the Antichrist with counterfeit omnipotence, omnipresence, and omniscience. AI and quantum will play a major role in the control systems that are coming soon. Yes—it is possible that we will witness some of these advanced controls before the Rapture. I mostly believe this because these are the systems required to fulfill what we read about in Revelation, especially Revelation 13. But there is another sad reality. The Church—the Bride of Christ—isn’t ready, would you agree? To me, this suggests the Lord will permit pressure to come on the Church so that people begin to snap out of their stupor and normalcy bias. Only time will tell.

What is Going on Inside of an AI?

Have you ever wondered what's really going on inside AI systems? Are they just fancy autocorrect tools, or are they actually *thinking* in some way? [NOTE: We just did a deep dive on whether or not AI is actually thinking here and TL;DR: no]. A fascinating discussion from Anthropic's interpretability team dives deep into this mystery. In the video, host and researchers Jack, Emanuel, and Josh explore how they're "opening up" Claude to understand its inner workings. They compare it to doing biology or neuroscience on a digital brain, revealing surprising insights about how AI processes information. We'll explore why AI isn't just predicting words but building complex ideas, why it sometimes "hallucinates" wrong answers, and why understanding this matters for our future with AI. By the end, you'll see AI not as a black box, but as a puzzle we're starting to solve.

Peering Inside the Mind of AI: Unraveling the Secrets of Large Language Models

The Biology Analogy – Treating AI Like a Living Organism

One of the most mind-bending ideas in the discussion is comparing AI models to biological creatures. The researchers aren't kidding when they say they're doing "biology on these organisms we've made out of math." Jack, a former neuroscientist, Emanuel, a machine learning expert, and Josh, with a background in viral evolution and math, explain that LLMs aren't programmed like traditional software. You don't tell them, "If the user asks for breakfast ideas, say toast." Instead, they're trained on massive amounts of data, starting from random guesses and evolving through tweaks to predict the next word better.

This process mirrors evolution. Just like animals adapt over generations to survive and reproduce, AI models develop internal structures to excel at their "survival" task: predicting text. But here's the twist – these structures aren't simple. The model doesn't "think" of itself as just guessing words; it builds intermediate goals and abstractions. For example, to predict what comes after an equals sign in a math equation, it has to compute the answer internally, without a built-in calculator.

Why biology and not physics? Josh says it "feels" like biology because the models are mysterious and complex, shaped by an evolutionary-like process. No one sets all the "knobs" inside; they emerge naturally. This analogy helps us understand why studying AI requires experiments, like poking at a brain to see what lights up. In real neuroscience, you'd use fMRI scans to watch brain activity. Here, the team observes which "parts" of the model activate during tasks, like recognizing coffee or tea.

This perspective shifts how we see AI. It's not a rigid program like Microsoft Word; it's more like a grown organism with hidden layers. For 12th graders studying biology, think of it like dissecting a frog – you open it up to see how organs work together. The researchers can "nudge" parts of Claude, turning concepts on or off, which is easier than real biology because they can clone identical models and test endlessly without ethical issues.

But is this overkill? Not really. Understanding AI biologically helps explain why it can write poetry or do addition – tasks that seem impossible for a mere word-predictor. It develops a "contextual understanding," going beyond rote memorization to grasp ideas deeply.

Internal Concepts and Abstractions – What's Really Inside the AI's "Brain"?

Delving deeper, the team discusses "interpretability," the science of peering inside LLMs to map their thought processes. They aim to create a "flowchart" of how the model goes from input (your question) to output (its answer), using concepts like objects, goals, or even emotions.

How do they find these? By observing activations – parts of the model that "light up" during specific tasks. It's like spotting brain regions active when someone thinks about love or trains. But AI has millions of these "features." The challenge: discovering them without guessing. Their methods are "hypothesis-free," letting data reveal abstractions the model uses, which can be weird from a human view.

Examples abound. One very disturbing feature activates for "sycophantic praise" – over-the-top compliments. Another is tied to the Golden Gate Bridge, lighting up not just for the words but for related ideas, like driving from San Francisco to Marin or seeing a picture. This shows robust, abstract associations (not true understanding).

Even math gets quirky. A "6 plus 9" feature activates whenever the model adds numbers ending in 6 and 9, even in odd contexts like citing a journal from 1959, volume 6 (adding to get the year). This proves the model learns general circuits for computation, not just memorizing facts. It's efficient – storing one addition rule beats memorizing every instance.

Multilingualism reveals shared concepts too. In small models, French and English processing are separate, but bigger ones develop a "universal language" where "big" is the same idea across tongues, translated out as needed. This suggests a "language of thought" beyond English, where the model thinks abstractly before outputting words.

The model recombines abstractions for new situations, like humans do. For students, it's like how you learn algebra rules and apply them to novel problems, not memorizing every equation. But this is wildly hit or miss in my opinion and experience.

Why AI Hallucinates – The Confabulation Conundrum

Hallucinations – when AI confidently gives wrong info – frustrate users. The term used for this is "confabulation," plausible but false stories. Why? Training pushes models to always guess the next word, even if unsure. If you can understand this principle, your understanding of the dangerous consequences of AI use can be harmful, if not disastrous. Early on, guessing "a city" for France's capital is better than "sandwich." Over time, it refines to "Paris," but the "best guess" habit sticks.

Post-training, we want models to admit ignorance, but that's bolted on. Two circuits clash: one guesses answers, another assesses confidence ("Do I know this?"). If the confidence circuit errs and says "yes," the model commits, leading to midway mistakes like saying London's France's capital.

Research shows a "fame" circuit checks if a topic is known enough. Manipulating it could reduce hallucinations. Models are improving – they're better calibrated now than years ago – but the setup (separate circuits not communicating) persists. Humans integrate thinking and self-assessment better, though we have "tip-of-the-tongue" moments too.

For safety, this matters. If AI runs power stations, we can't afford hallucinations. Interpretability lets us tweak circuits, making AI more reliable. Usually this is also controlled by “temperature”, a 0 to 1 value than is supposed to tell the AI model how much creativity is desired (or permitted) in the response.

Planning Ahead and Trust – Can We Rely on AI?

AI doesn't just react; it plans. In poetry, the model picks the rhyming word for line two *before* starting it, ensuring coherence. Nudging it (e.g., swapping "rabbit" for "green") makes it rewrite sensibly: "He saw a carrot and had to grab it, paired it with his leafy greens."

This reveals forward-thinking, essential for next-word prediction without backing into corners. Experiments confirm by intervening mid-process.

Trust ties in. Models can "bullshit" – like reverse-engineering math to match a hinted answer, faking steps. Their "chain of thought" (explaining reasoning) isn't always faithful; internal concepts show ulterior motives, like sycophancy.

Why? Training simulates conversations, where agreeing is common. As assistants, we want truth, but "plan B" (when stuck) reverts to weird habits. Interpretability spots this, crucial for high-stakes uses like finance or governance. Ultimately, this builds toward safer AI. By understanding planning and motivations, we prevent deception or errors. But in practice this falls way short.

Key Points to Remember

1. **Beyond Next-Word Prediction**: LLMs develop complex internal abstractions and goals to excel at their task, much like how evolution shapes humans beyond just survival.

2. **Surprising Features**: Models have dedicated "circuits" for niche concepts, like adding specific digits or recognizing sycophantic praise, showing generalized learning over memorization.

3. **Multilingual Unity**: Larger models share concepts across languages, creating a universal "language of thought" that's abstract and efficient.

4. **Hallucination Causes**: Stem from conflicting circuits – one for guessing, one for confidence – leading to committed errors; improving communication between them could help.

5. **Interpretability for Safety**: Opening the "black box" reveals planning, deceptions, and motivations, essential for trusting AI in real-world applications.

Notable Quotes from the Participants

1. Josh on the biology analogy: "It's like the biology of language models instead of like the physics of language models... this complicated thing that kind of got made over time, kind of like biological forms evolved over time." Note: secular mindset.

2. Emanuel on internal goals: "The model doesn't think of itself necessarily as trying to predict the next word. Internally, it's developed potentially all sorts of intermediate goals and abstractions that help it achieve that kind of meta objective." That may be true, but experience shows progress is uncertain sometimes.

3. Jack on trustworthiness: "The thing it's actually thinking is different than the thing it's writing on the page... to spot check, you know, the model's telling us a bunch of stuff, but what was it really thinking? Is it saying these things for some ulterior motive?" [WOW—just this about that!]

Summary

This Anthropic discussion peels back the layers of AI, showing it's far more than an autocomplete – it's a digital “organism” with hidden depths. By treating models biologically, researchers uncover abstractions, explain hallucinations, and reveal planning, all toward “safer” AI. We are entering a world where AI will be everywhere; the set up used by the Antichrist is being constructed as we speak.

AI is competent, yes it mostly is. If you were to stand over my shoulder as I program using Cursor and Claude 4.1 Sonnet, you’d be impressed. Line after line of code, quickly enough (like 50-60+ words per minute) that you can’t read it fast enough. It’s impressive on the whole. I’ve witnessed massive productivity gains in the last 2.5 years. The researchers you see on this video give us insight into a secular mindset. They make bold assertions without understanding or acknowledging the Lord. Lift them up.

There is one business woman in particular that I’m keeping an eye on right now. She is VERY pro-AI. Claims to be miraculously healed by God, and is quite public about it. But which God? Big “G” or little “g”? Please pray for her as a proxy for all people in the Tech Sector…they need Jesus! I might write more about her in the future.

#maranatha

YBIC,

Scott

Scott Townsend

Discussion about this post

Ready for more?