Researchers at OpenAI have developed new methods for interpreting the neural activity within language models, providing a glimpse into the concepts and patterns that drive their behaviour.
As explained in the research paper, "The resulting networks are not well understood and cannot be easily decomposed into identifiable parts. This means we cannot reason about AI safety the same way we reason about something like car safety." By using sparse autoencoders, researchers aim to find interpretable patterns in the neural activity that align with human-understandable concepts.
As they note in their paper, "We find that our methodology demonstrates smooth and predictable scaling, with better returns to scale than prior techniques." This breakthrough enables a more comprehensive analysis of the concepts encoded in the language model.
Using their improved sparse autoencoder, the researchers trained a 16-million-feature autoencoder on GPT-4 activations. By visualising the documents where each feature activates, they were able to identify a range of interpretable patterns, such as:
- Human imperfection
- Price increases
- Rhetorical questions
- Algebraic rings
- Dopamine receptors
For example, the "Human Imperfection" feature activates on phrases related to the flaws and limitations of humans, while the "Price Increases" feature activates on text discussing rising costs and prices.
However, as the authors acknowledge, this research is still in its early stages. To fully map the concepts in frontier language models, autoencoders may need to scale to billions or even trillions of features.
As they state in the paper, "Ultimately, we hope that one day, interpretability can provide us with new ways to reason about model safety and robustness, and significantly increase our trust in powerful AI models by giving strong assurances about their behaviour."
By identifying interpretable features and patterns in the neural activity, this work provides some insight into the concepts and mechanisms that drive these systems.