Technology

Anthropic develops ‘AI microscope’ to reveal how large language models think | Technology News

In what might be a significant AI breakthrough, Anthropic researchers said that they have developed a new tool to help understand how large language models (LLMs) actually work.The AI startup behind Claude said the new tool is capable of deciphering how LLMs think. Taking inspiration from the field of neuroscience, Anthropic said it was able to build a kind of AI microscope that “let us identify patterns of activity and flows of information.”
“Knowing how models like Claude think would allow us to have a better understanding of their abilities, as well as help us ensure that they’re doing what we intend them to,” the company said in a blog post published on Thursday, March 27.
Story continues below this ad

Beyond their capabilities, today’s LLMs are often described as black boxes since AI researchers are yet to figure out exactly how the AI models arrived at a particular response without requiring any programming. Other grey areas of understanding pertain to AI hallucinations, fine-tuning, and jailbreaking.
However, the potential breakthrough could make the inner workings of LLMs more transparent and understandable. This could further inform the development of more safer, secure, and reliable AI models. Addressing AI risks such as hallucinations could also drive greater adoption among businesses.
What Anthropic did
The Amazon-backed startup said it has released two new scientific papers on building a microscope for “AI biology”.
While the first paper focuses on “parts of the pathway” that transforms user inputs into AI-generated outputs Claude, the second report sheds light on what exactly happens within Claude 3.5 Haiku when the LLM responds to a user prompt.Story continues below this ad

As part of its experiments, Anthropic trained an entirely different model called a cross-layer transcoder (CLT). But instead of using weights, the company trained the model using sets of interpretable features such as conjugations of a particular verb or or any term that suggests “more than”, according to a report Fortune.
“Our method decomposes the model, so we get pieces that are new, that aren’t like the original neurons, but there’s pieces, which means we can actually see how different parts play different roles,” Anthropic researcher Josh Batson was quoted as saying.

“It also has the advantage of allowing researchers to trace the entire reasoning process through the layers of the network,” he said.
Findings of Anthropic researchers
After examining the Claude 3.5 Haiku model using its “AI microscope,” Anthropic found that the LLM plans ahead before saying what it will say. For instance, when asked to write a poem, Claude identifies rhyming words relating to the poem’s theme or topic and works backwards to construct them into sentences that end in those rhyming words.Story continues below this ad
Importantly, Anthropic said it discovered that Claude is capable of making up a fictitious reasoning process. This means that the reasoning model can sometimes appear to “think through” a tough math problem instead of accurately representing the steps it is taking.
This discovery appears to contradict what tech companies like OpenAI have been saying about reasoning AI models and “chain of thought”. “Even though it does claim to have run a calculation, our interpretability techniques reveal no evidence at all of this having occurred,” Batson said.
In case of hallucinations, Anthropic said that “Claude’s default behaviour is to decline to speculate when asked a question, and it only answers questions when something inhibits this default reluctance.”
In a response to an example jailbreak, Anthropic found that “the model recognised it had been asked for dangerous information well before it was able to gracefully bring the conversation back around.”Story continues below this ad
Research gaps in the study
Anthropic acknowledged that its method to open up the AI black box had a few drawbacks. “It is only an approximation of what is actually happening inside a complex model like Claude,” the company clarified.

It also pointed out that there may be neurons that ex outside the circuits identified through the CLT method, even though they may play a role in determining the outputs of the model.
“Even on short, simple prompts, our method only captures a fraction of the total computation performed Claude, and the mechanisms we do see may have some artefacts based on our tools which don’t reflect what is going on in the underlying model,” Anthropic said.

Related Articles

Back to top button