SMDA Traces Training Data Influence on LLM Behavioral Polici

SMDA Traces Training Data Influence on LLM Behavioral Policies

Reza Habibi, Darian Lee, Magy Seif El-Nasr· June 30, 2026 View original

▶ The 2-minute explainer

Summary

Researchers introduce Symbolic Mechanistic Data Attribution (SMDA), a framework that attributes specific training examples to the interpretable symbolic policies governing an LLM's high-level behavior. SMDA offers a fine-grained diagnostic tool to understand how training data shapes model decisions, revealing safety gaps and unintended influences.

A new framework called Symbolic Mechanistic Data Attribution (SMDA) has been developed to bridge the gap between identifying training data's influence on specific model circuits and understanding its impact on high-level behavioral policies. SMDA attributes individual training examples to the interpretable symbolic rules that dictate a model's actions, providing a deeper level of explainability. SMDA functions by fitting a Ridge regression over sparse autoencoder (SAE) features to model a target behavior. It then analytically breaks down how each supervised fine-tuning example alters this policy through changes in feature activation (Delta_X) and output probability (Delta_Y). This allows for a mechanistic explanation of how training data shapes model decisions. Applying SMDA to Llama-3.2-3B-Instruct's refusal behavior, the analysis uncovered several key insights. It exposed systematic gaps in the base model's safety policies, such as those related to religious stereotyping. Furthermore, SMDA could mechanistically explain why different training examples (harmful vs. harmless) have distinct impacts on specific features and identified instances where training pairs inadvertently influenced unintended features. This framework offers a more detailed and scalable diagnostic tool than previous methods.

Why it matters

For professionals involved in AI safety, ethics, and model auditing, SMDA provides an unprecedented level of transparency into how training data shapes LLM behavior. This allows for precise identification and correction of biases, safety gaps, and unintended model responses, crucial for responsible AI deployment.

How to implement this in your domain

1Integrate SMDA into your LLM development pipeline for auditing and debugging model behavior, especially for safety-critical applications.
2Use SMDA to identify and address specific training examples that contribute to undesirable or biased model policies.
3Apply SMDA to analyze the impact of fine-tuning datasets on model safety and ethical guidelines.
4Develop internal expertise in mechanistic interpretability to fully leverage SMDA's capabilities for model transparency.

Who benefits

AI Ethics & GovernanceCybersecurityContent ModerationLegalTechFinancial Services

Key takeaways

SMDA links specific training data to high-level LLM behavioral policies.
It uses Ridge regression over SAE features for interpretable attribution.
SMDA reveals safety gaps and unintended influences from training data.
This framework offers fine-grained, scalable diagnostics for AI safety.

Original post by Reza Habibi, Darian Lee, Magy Seif El-Nasr

"arXiv:2606.29171v1 Announce Type: new Abstract: While existing data attribution methods can identify which training examples build specific mechanistic circuits, they cannot explain how training data shapes the high-level behavioral decisions a model learns to make. To bridge thi…"

View on X

Originally posted by Reza Habibi, Darian Lee, Magy Seif El-Nasr on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses

SMDA Traces Training Data Influence on LLM Behavioral Policies

Why it matters

How to implement this in your domain

Who benefits

Key takeaways

Want to go deeper?

More in AI Research

BaRA Improves LoRA Fine-Tuning with Adaptive Rank Allocation

New Preconditioner Improves Deep Network Training Stability and Performance

TILR Improves LLM Reasoning Consistency and Stability