Bricken, Trenton, Adly Templeton, Joshua Batson, et al. 2023.
“Towards Monosemanticity: Decomposing Language Models with Dictionary Learning.” Transformer Circuits Thread.
https://transformer-circuits.pub/2023/monosemantic-features/index.html.
Elhage, Nelson, Tristan Hume, Catherine Olsson, et al. 2022.
“Toy Models of Superposition.” Transformer Circuits Thread.
https://transformer-circuits.pub/2022/toy_model/index.html.
Hardt, Moritz, Eric Price, and Nathan Srebro. 2016.
“Equality of Opportunity in Supervised Learning.” Advances in Neural Information Processing Systems 29.
https://arxiv.org/abs/1610.02413.
Hendrycks, Dan, and Kevin Gimpel. 2017.
“A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks.” International Conference on Learning Representations.
https://arxiv.org/abs/1610.02136.
Lundberg, Scott M., and Su-In Lee. 2017.
“A Unified Approach to Interpreting Model Predictions.” Advances in Neural Information Processing Systems 30.
https://arxiv.org/abs/1705.07874.
McClarren, Ryan G. 2021. Machine Learning for Engineers: Using Data to Solve Problems for Physical Systems. Springer.
Mothilal, Ramaravind K., Amit Sharma, and Chenhao Tan. 2020.
“Explaining Machine Learning Classifiers Through Diverse Counterfactual Explanations.” Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency.
https://arxiv.org/abs/1905.07697.
Neuer, Michael et al. 2024. Machine Learning for Engineers: Introduction to Physics-Informed, Explainable Learning Methods for AI in Engineering Applications. Springer Nature.
Sundararajan, Mukund, Ankur Taly, and Qiqi Yan. 2017.
“Axiomatic Attribution for Deep Networks.” Proceedings of the 34th International Conference on Machine Learning.
https://arxiv.org/abs/1703.01365.
Templeton, Adly, Tom Conerly, Jonathan Marcus, et al. 2024.
“Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet.” Transformer Circuits Thread.
https://transformer-circuits.pub/2024/scaling-monosemanticity/.