The First Mechanistic Interpretability Frontier Lab — Myra Deng & Mark Bissell of Goodfire AI
Fri Feb 06 2026
From Palantir and Two Sigma to building Goodfire into the poster-child for actionable mechanistic interpretability, Mark Bissell (Member of Technical Staff) and Myra Deng (Head of Product) are trying to turn “peeking inside the model” into a repeatable production workflow by shipping APIs, landing real enterprise deployments, and now scaling the bet with a recent $150M Series B funding round at a $1.25B valuation.
In this episode, we go far beyond the usual “SAEs are cool” take. We talk about Goodfire’s core bet: that the AI lifecycle is still fundamentally broken because the only reliable control we have is data and we post-train, RLHF, and fine-tune by “slurping supervision through a straw,” hoping the model picks up the right behaviors while quietly absorbing the wrong ones. Goodfire’s answer is to build a bi-directional interface between humans and models: read what’s happening inside, edit it surgically, and eventually use interpretability during training so customization isn’t just brute-force guesswork.
Mark and Myra walk through what that looks like when you stop treating interpretability like a lab demo and start treating it like infrastructure: lightweight probes that add near-zero latency, token-level safety filters that can run at inference time, and interpretability workflows that survive messy constraints (multilingual inputs, synthetic→real transfer, regulated domains, no access to sensitive data). We also get a live window into what “frontier-scale interp” means operationally (i.e. steering a trillion-parameter model in real time by targeting internal features) plus why the same tooling generalizes cleanly from language models to genomics, medical imaging, and “pixel-space” world models.
We discuss:
* Myra + Mark’s path: Palantir (health systems, forward-deployed engineering) → Goodfire early team; Two Sigma → Head of Product, translating frontier interpretability research into a platform and real-world deployments
* What “interpretability” actually means in practice: not just post-hoc poking, but a broader “science of deep learning” approach across the full AI lifecycle (data curation → post-training → internal representations → model design)
* Why post-training is the first big wedge: “surgical edits” for unintended behaviors likereward hacking, sycophancy, noise learned during customization plus the dream of targeted unlearning and bias removal without wrecking capabilities
* SAEs vs probes in the real world: why SAE feature spaces sometimes underperform classifiers trained on raw activations for downstream detection tasks (hallucination, harmful intent, PII), and what that implies about “clean concept spaces”
* Rakuten in production: deploying interpretability-based token-level PII detection at inference time to prevent routing private data to downstream providers plus the gnarly constraints: no training on real customer PII, synthetic→real transfer, English + Japanese, and tokenization quirks
* Why interp can be operationally cheaper than LLM-judge guardrails: probes are lightweight, low-latency, and don’t require hosting a second large model in the loop
* Real-time steering at frontier scale: a demo of steering Kimi K2 (~1T params) live and finding features via SAE pipelines, auto-labeling via LLMs, and toggling a “Gen-Z slang” feature across multiple layers without breaking tool use
* Hallucinations as an internal signal: the case that models have latent uncertainty / “user-pleasing” circuitry you can detect and potentially mitigate more directly than black-box methods
* Steering vs prompting: the emerging view that activation steering and in-context learning are more closely connected than people think, including work mapping between the two (even for jailbreak-style behaviors)
* Interpretability for science: using the same tooling across domains (genomics, medical imaging, materials) to debug spurious correlations and extract new knowledge up to and including early biomarker discovery work with major partners
* World models + “pixel-space” interpretability: why vision/video models make concepts easier to see, how that accelerates the feedback loop, and why robotics/world-model partners are especially interesting design partners
* The north star: moving from “data in, weights out” to intentional model design where experts can impart goals and constraints directly, not just via reward signals and brute-force post-training
—
Goodfire AI
* Website: https://goodfire.ai
* LinkedIn: https://www.linkedin.com/company/goodfire-ai/
* X: https://x.com/GoodfireAI
Myra Deng
* Website: https://myradeng.com/
* LinkedIn: https://www.linkedin.com/in/myra-deng/
* X: https://x.com/myra_deng
Mark Bissell
* LinkedIn: https://www.linkedin.com/in/mark-bissell/
* X: https://x.com/MarkMBissell
Full Video Episode
Timestamps
00:00:00 Introduction
00:00:05 Introduction to the Latent Space Podcast and Guests from Goodfire
00:00:29 What is Goodfire? Mission and Focus on Interpretability
00:01:01 Goodfire’s Practical Approach to Interpretability
00:01:37 Goodfire’s Series B Fundraise Announcement
00:02:04 Backgrounds of Mark and Myra from Goodfire
00:02:51 Team Structure and Roles at Goodfire
00:05:13 What is Interpretability? Definitions and Techniques
00:05:30 Understanding Errors
00:07:29 Post-training vs. Pre-training Interpretability Applications
00:08:51 Using Interpretability to Remove Unwanted Behaviors
00:10:09 Grokking, Double Descent, and Generalization in Models
00:10:15 404 Not Found Explained
00:12:06 Subliminal Learning and Hidden Biases in Models
00:14:07 How Goodfire Chooses Research Directions and Projects
00:15:00 Troubleshooting Errors
00:16:04 Limitations of SAEs and Probes in Interpretability
00:18:14 Rakuten Case Study: Production Deployment of Interpretability
00:20:45 Conclusion
00:21:12 Efficiency Benefits of Interpretability Techniques
00:21:26 Live Demo: Real-Time Steering in a Trillion Parameter Model
00:25:15 How Steering Features are Identified and Labeled
00:26:51 Detecting and Mitigating Hallucinations Using Interpretability
00:31:20 Equivalence of Activation Steering and Prompting
00:34:06 Comparing Steering with Fine-Tuning and LoRA Techniques
00:36:04 Model Design and the Future of Intentional AI Development
00:38:09 Getting Started in Mechinterp: Resources, Programs, and Open Problems
00:40:51 Industry Applications and the Rise of Mechinterp in Practice
00:41:39 Interpretability for Code Models and Real-World Usage
00:43:07 Making Steering Useful for More Than Stylistic Edits
00:46:17 Applying Interpretability to Healthcare and Scientific Discovery
00:49:15 Why Interpretability is Crucial in High-Stakes Domains like Healthcare
00:52:03 Call for Design Partners Across Domains
00:54:18 Interest in World Models and Visual Interpretability
00:57:22 Sci-Fi Inspiration: Ted Chiang and Interpretability
01:00:14 Interpretability, Safety, and Alignment Perspectives
01:04:27 Weak-to-Strong Generalization and Future Alignment Challenges
01:05:38 Final Thoughts and Hiring/Collaboration Opportunities at Goodfire
Transcript
Shawn Wang [00:00:05]: So welcome to the Latent Space pod. We’re back in the studio with our special MechInterp co-host, Vibhu. Welcome. Mochi, Mochi’s special co-host. And Mochi, the mechanistic interpretability doggo. We have with us Mark and Myra from Goodfire. Welcome. Thanks for having us on. Maybe we can sort of introduce Goodfire and then introduce you guys. How do you introduce Goodfire today?
Myra Deng [00:00:29]: Yeah, it’s a great question. So Goodfire, we like to say, is an AI research lab that focuses on using interpretability to understand, learn from, and design AI models. And we really believe that interpretability will unlock the new generation, next frontier of safe and powerful AI models. That’s our description right now, and I’m excited to dive more into the work we’re doing to make that happen.
Shawn Wang [00:00:55]: Yeah. And there’s always like the official description. Is there an understatement? Is there an unofficial one that sort of resonates more with a different audience?
Mark Bissell [00:01:01]: Well, being an AI research lab that’s focused on interpretability, there’s obviously a lot of people have a lot that they think about when they think of interpretability. And I think we have a pretty broad definition of what that means and the types of places that can be applied. And in particular, applying it in production scenarios, in high stakes industries, and really taking it sort of from the research world into the real world. Which, you know. It’s a new field, so that hasn’t been done all that much. And we’re excited about actually seeing that sort of put into practice.
Shawn Wang [00:01:37]: Yeah, I would say it wasn’t too long ago that Anthopic was like still putting out like toy models or superposition and that kind of stuff. And I wouldn’t have pegged it to be this far along. When you and I talked at NeurIPS, you were talking a little bit about your production use cases and your customers. And then not to bury the lead, today we’re also announcing the fundraise, your Series B. $150 million. $150 million at a 1.25B valuation. Congrats, Unicorn.
Mark Bissell [00:02:02]: Thank you. Yeah, no, things move fast.
Shawn Wang [00:02:04]: We were talking to you in December and already some big updates since then. Let’s dive, I guess, into a bit of your backgrounds as well. Mark, you were at Palantir working on health stuff, which is really interesting because the Goodfire has some interesting like health use cases. I don’t know how related they are in practice.
Mark Bissell [00:02:22]: Yeah, not super related, but I don’t know. It was helpful context to know what it’s like. Just to work. Just to work with health systems and generally in that domain. Yeah.
Shawn Wang [00:02:32]: And Mara, you were at Two Sigma, which actually I w
More
From Palantir and Two Sigma to building Goodfire into the poster-child for actionable mechanistic interpretability, Mark Bissell (Member of Technical Staff) and Myra Deng (Head of Product) are trying to turn “peeking inside the model” into a repeatable production workflow by shipping APIs, landing real enterprise deployments, and now scaling the bet with a recent $150M Series B funding round at a $1.25B valuation. In this episode, we go far beyond the usual “SAEs are cool” take. We talk about Goodfire’s core bet: that the AI lifecycle is still fundamentally broken because the only reliable control we have is data and we post-train, RLHF, and fine-tune by “slurping supervision through a straw,” hoping the model picks up the right behaviors while quietly absorbing the wrong ones. Goodfire’s answer is to build a bi-directional interface between humans and models: read what’s happening inside, edit it surgically, and eventually use interpretability during training so customization isn’t just brute-force guesswork. Mark and Myra walk through what that looks like when you stop treating interpretability like a lab demo and start treating it like infrastructure: lightweight probes that add near-zero latency, token-level safety filters that can run at inference time, and interpretability workflows that survive messy constraints (multilingual inputs, synthetic→real transfer, regulated domains, no access to sensitive data). We also get a live window into what “frontier-scale interp” means operationally (i.e. steering a trillion-parameter model in real time by targeting internal features) plus why the same tooling generalizes cleanly from language models to genomics, medical imaging, and “pixel-space” world models. We discuss: * Myra + Mark’s path: Palantir (health systems, forward-deployed engineering) → Goodfire early team; Two Sigma → Head of Product, translating frontier interpretability research into a platform and real-world deployments * What “interpretability” actually means in practice: not just post-hoc poking, but a broader “science of deep learning” approach across the full AI lifecycle (data curation → post-training → internal representations → model design) * Why post-training is the first big wedge: “surgical edits” for unintended behaviors likereward hacking, sycophancy, noise learned during customization plus the dream of targeted unlearning and bias removal without wrecking capabilities * SAEs vs probes in the real world: why SAE feature spaces sometimes underperform classifiers trained on raw activations for downstream detection tasks (hallucination, harmful intent, PII), and what that implies about “clean concept spaces” * Rakuten in production: deploying interpretability-based token-level PII detection at inference time to prevent routing private data to downstream providers plus the gnarly constraints: no training on real customer PII, synthetic→real transfer, English + Japanese, and tokenization quirks * Why interp can be operationally cheaper than LLM-judge guardrails: probes are lightweight, low-latency, and don’t require hosting a second large model in the loop * Real-time steering at frontier scale: a demo of steering Kimi K2 (~1T params) live and finding features via SAE pipelines, auto-labeling via LLMs, and toggling a “Gen-Z slang” feature across multiple layers without breaking tool use * Hallucinations as an internal signal: the case that models have latent uncertainty / “user-pleasing” circuitry you can detect and potentially mitigate more directly than black-box methods * Steering vs prompting: the emerging view that activation steering and in-context learning are more closely connected than people think, including work mapping between the two (even for jailbreak-style behaviors) * Interpretability for science: using the same tooling across domains (genomics, medical imaging, materials) to debug spurious correlations and extract new knowledge up to and including early biomarker discovery work with major partners * World models + “pixel-space” interpretability: why vision/video models make concepts easier to see, how that accelerates the feedback loop, and why robotics/world-model partners are especially interesting design partners * The north star: moving from “data in, weights out” to intentional model design where experts can impart goals and constraints directly, not just via reward signals and brute-force post-training — Goodfire AI * Website: https://goodfire.ai * LinkedIn: https://www.linkedin.com/company/goodfire-ai/ * X: https://x.com/GoodfireAI Myra Deng * Website: https://myradeng.com/ * LinkedIn: https://www.linkedin.com/in/myra-deng/ * X: https://x.com/myra_deng Mark Bissell * LinkedIn: https://www.linkedin.com/in/mark-bissell/ * X: https://x.com/MarkMBissell Full Video Episode Timestamps 00:00:00 Introduction 00:00:05 Introduction to the Latent Space Podcast and Guests from Goodfire 00:00:29 What is Goodfire? Mission and Focus on Interpretability 00:01:01 Goodfire’s Practical Approach to Interpretability 00:01:37 Goodfire’s Series B Fundraise Announcement 00:02:04 Backgrounds of Mark and Myra from Goodfire 00:02:51 Team Structure and Roles at Goodfire 00:05:13 What is Interpretability? Definitions and Techniques 00:05:30 Understanding Errors 00:07:29 Post-training vs. Pre-training Interpretability Applications 00:08:51 Using Interpretability to Remove Unwanted Behaviors 00:10:09 Grokking, Double Descent, and Generalization in Models 00:10:15 404 Not Found Explained 00:12:06 Subliminal Learning and Hidden Biases in Models 00:14:07 How Goodfire Chooses Research Directions and Projects 00:15:00 Troubleshooting Errors 00:16:04 Limitations of SAEs and Probes in Interpretability 00:18:14 Rakuten Case Study: Production Deployment of Interpretability 00:20:45 Conclusion 00:21:12 Efficiency Benefits of Interpretability Techniques 00:21:26 Live Demo: Real-Time Steering in a Trillion Parameter Model 00:25:15 How Steering Features are Identified and Labeled 00:26:51 Detecting and Mitigating Hallucinations Using Interpretability 00:31:20 Equivalence of Activation Steering and Prompting 00:34:06 Comparing Steering with Fine-Tuning and LoRA Techniques 00:36:04 Model Design and the Future of Intentional AI Development 00:38:09 Getting Started in Mechinterp: Resources, Programs, and Open Problems 00:40:51 Industry Applications and the Rise of Mechinterp in Practice 00:41:39 Interpretability for Code Models and Real-World Usage 00:43:07 Making Steering Useful for More Than Stylistic Edits 00:46:17 Applying Interpretability to Healthcare and Scientific Discovery 00:49:15 Why Interpretability is Crucial in High-Stakes Domains like Healthcare 00:52:03 Call for Design Partners Across Domains 00:54:18 Interest in World Models and Visual Interpretability 00:57:22 Sci-Fi Inspiration: Ted Chiang and Interpretability 01:00:14 Interpretability, Safety, and Alignment Perspectives 01:04:27 Weak-to-Strong Generalization and Future Alignment Challenges 01:05:38 Final Thoughts and Hiring/Collaboration Opportunities at Goodfire Transcript Shawn Wang [00:00:05]: So welcome to the Latent Space pod. We’re back in the studio with our special MechInterp co-host, Vibhu. Welcome. Mochi, Mochi’s special co-host. And Mochi, the mechanistic interpretability doggo. We have with us Mark and Myra from Goodfire. Welcome. Thanks for having us on. Maybe we can sort of introduce Goodfire and then introduce you guys. How do you introduce Goodfire today? Myra Deng [00:00:29]: Yeah, it’s a great question. So Goodfire, we like to say, is an AI research lab that focuses on using interpretability to understand, learn from, and design AI models. And we really believe that interpretability will unlock the new generation, next frontier of safe and powerful AI models. That’s our description right now, and I’m excited to dive more into the work we’re doing to make that happen. Shawn Wang [00:00:55]: Yeah. And there’s always like the official description. Is there an understatement? Is there an unofficial one that sort of resonates more with a different audience? Mark Bissell [00:01:01]: Well, being an AI research lab that’s focused on interpretability, there’s obviously a lot of people have a lot that they think about when they think of interpretability. And I think we have a pretty broad definition of what that means and the types of places that can be applied. And in particular, applying it in production scenarios, in high stakes industries, and really taking it sort of from the research world into the real world. Which, you know. It’s a new field, so that hasn’t been done all that much. And we’re excited about actually seeing that sort of put into practice. Shawn Wang [00:01:37]: Yeah, I would say it wasn’t too long ago that Anthopic was like still putting out like toy models or superposition and that kind of stuff. And I wouldn’t have pegged it to be this far along. When you and I talked at NeurIPS, you were talking a little bit about your production use cases and your customers. And then not to bury the lead, today we’re also announcing the fundraise, your Series B. $150 million. $150 million at a 1.25B valuation. Congrats, Unicorn. Mark Bissell [00:02:02]: Thank you. Yeah, no, things move fast. Shawn Wang [00:02:04]: We were talking to you in December and already some big updates since then. Let’s dive, I guess, into a bit of your backgrounds as well. Mark, you were at Palantir working on health stuff, which is really interesting because the Goodfire has some interesting like health use cases. I don’t know how related they are in practice. Mark Bissell [00:02:22]: Yeah, not super related, but I don’t know. It was helpful context to know what it’s like. Just to work. Just to work with health systems and generally in that domain. Yeah. Shawn Wang [00:02:32]: And Mara, you were at Two Sigma, which actually I w