The inner workings of LLMs have largely remained a mystery, making it difficult to trust their safety and reliability. In a significant breakthrough, researchers at Anthropic have begun to map the mind of Claude Sonnet, providing the first-ever detailed look inside a modern, production-grade large language model.
To understand the significance of this discovery, it's essential to grasp the concept of a "black box" in AI. When an input is given to an AI model, it processes the information and generates a response, but the reasoning behind that particular response is often unclear. The internal state of the model consists of a long list of numbers called "neuron activations," which lack clear meaning to human observers.
Researchers at Anthropic have made progress in matching patterns of neuron activations, known as "features," to human-interpretable concepts using a technique called "dictionary learning." This method isolates recurring patterns of neuron activations across different contexts, allowing the model's internal state to be represented by a few active features instead of many active neurons.
As Anthropic explains, "Just as every English word in a dictionary is made by combining letters, and every sentence is made by combining words, every feature in an AI model is made by combining neurons, and every internal state is made by combining features."
By applying dictionary learning to Claude Sonnet, researchers successfully extracted millions of features from the model's middle layer, creating a rough conceptual map of its internal states halfway through its computation. The features found in Sonnet reflect its advanced capabilities, corresponding to a wide range of entities, such as cities, people, atomic elements, scientific fields, and programming syntax. These features are multimodal and multilingual, responding to images and descriptions in various languages.
Moreover, the researchers discovered abstract features that activate on concepts, like bugs in computer code, gender bias in professions, and conversations about keeping secrets.
The internal organisation of these concepts apparently corresponds to human notions of similarity, which may explain Claude's ability to make analogies and metaphors.
The ability to map and manipulate features in large language models may have significant implications for AI safety and reliability. By understanding how models internally represent the world and use these representations in their behaviour, researchers can work towards making AI systems safer and more trustworthy.
One potential application is monitoring AI systems for dangerous behaviours, such as deceiving users or producing harmful content. By identifying features associated with these behaviours, researchers can develop techniques to steer models towards more desirable outcomes or remove dangerous subject matter entirely.
Furthermore, this research could enhance other safety techniques, such as Constitutional AI, by understanding how they shift the model towards more harmless and honest behaviour and identifying any gaps in the process. As Anthropic notes, "The latent capabilities to produce harmful text that we saw by artificially activating features are exactly the sort of thing jailbreaks try to exploit."
While this breakthrough represents a step forward in AI interpretability research, there is still much work to be done. The features discovered so far represent only a small subset of the concepts learned by the model during training, and finding a full set of features using current techniques would apparently be cost-prohibitive.
Additionally, understanding the representations the model uses doesn't fully explain how it uses them; researchers still need to identify the circuits in which these features are involved.
Finally, further research is needed to demonstrate that safety-relevant features can be effectively used to improve AI safety.
The successful mapping of Claude Sonnet's internal representations marks a crucial step forward in understanding the inner workings of large language models. By providing a detailed look inside a production-grade AI model, this research possibly paves the way for developing safer, more reliable, and trustworthy AI systems.