In a new research paper, Anthropic has begun to map out the inner workings of Claude, shedding light on the millions of concepts that activate within the model's "mind" when it encounters relevant text or images.

The research team at Anthropic has identified what they call "features" within Claude's neural network - specific combinations of neurons that activate when the model encounters mentions or images of particular concepts. For example, they discovered a feature that activates when Claude encounters the Golden Gate Bridge.

The researchers found that they can not only identify these features but also tune their activation strength up or down, leading to corresponding changes in Claude's behaviour. When the "Golden Gate Bridge" feature is strengthened, Claude's responses begin to focus on the bridge, even if it's not directly relevant to the query at hand. 

This demonstrates a level of interpretability and control over the model's internal representations that was previously unattainable. By gaining a deeper understanding of how these models work internally, researchers can potentially improve their performance, safety, and alignment with company values.

Anthropic explains in their paper, they can use these same techniques to change the strength of safety-related features, such as those related to dangerous computer code, criminal activity, or deception.

The process of identifying and manipulating features is currently time-consuming and requires significant computational resources. Additionally, the effects of manipulating features can sometimes be unexpected or undesirable, highlighting the need for further research and refinement. Anthropic plan to continue exploring the inner workings of Claude, with the goal of developing more robust and reliable techniques for interpretability and control.

By beginning to map out the millions of concepts that activate within these models' neural networks, researchers are gaining unprecedented insights into how they function internally. 

Anthropic's work on interpretability is a step in the right direction in the quest to understand the machinations of Large Language Models. 


Golden Gate Claude
When we turn up the strength of the “Golden Gate Bridge” feature, Claude’s responses begin to focus on the Golden Gate Bridge. For a short time, we’re making this model available for everyone to interact with.
Share this post
The link has been copied!