Abstract
Neural models have seen exponential changes, both in terms of scale and deployment, in the years
since transformers and large language models were developed. The scale of these models and of their
training data have enabled them to reach near- (and in some cases super-) human performances in several
tasks. However, this raises concerns of value misalignment and potential misbehaviour of these models
in high-stakes situations. This creates the need for a more fine-grained, general, and mathematical
understanding of the functioning of these models, with the objective of being able to reliably and
generally predict and control their behaviour. This is the central effort of interpretability, a field of study
aiming to reduce the heavily overparameterized functions implemented by neural nets to simple, sparse,
and abstract causal models.
However, the relative immaturity of the discipline has meant that the rigour of paradigms, techniques,
and experiments has not seen consensus. In this thesis, we present a proof of concept that analogy with
the natural sciences can form a valuable foundation for achieving the long-term aims of interpretability;
in particular, we leverage the reductionist approach to understanding complex systems, and apply it to
the study of deep models.
We restrict our scope to models that operate on natural language – or, more generally, text – rather than
other modalities like images, audio, or time series. We take inspiration, therefore, from computational
linguistics, which in its incipient phases relied on a remarkably expressive reduction of natural language
– formal grammars. We exploit this concept to idealize the conditions under which we examine neural
language models, and present a study that operationalizes this intuition.
Concretely, we examine the recently popular sparse autoencoder (SAE) method for interpretability.
This method centres on using two-layer MLPs with a sparse, overcomplete hidden representation, trained
to encode a latent space of a large model, in the hopes that meaningful semantic decompositions of this
space arises. We use language models trained on formal grammars, attempt to uncover relevant features
using this approach, and try to find properties of the approach that are significant to its usability. Our
findings align for the most part with existing conclusions on the properties of SAEs (although these were
based mostly on experiments in the image domain) such as their sensitivity to inductive biases and lack
of robustness. Most significantly, we note that the features identified by SAEs are rarely causally relevant
– ablating them fails to produce the expected effects most of the time. As causality has emerged as a
widely agreed upon sine qua non among interpretability researchers, this is a major deficiency of the method. We propose, accordingly, a modification of the pipeline that aims to incentivize the causality
of identified features, and demonstrate its efficacy in the same setting of formal grammars.
Overall, we believe that our results demonstrate the potential of importing scientific modi operandi
into interpretability, and more specifically, the capacity of reductionism to provide useful insights into
the functioning of deep models.