Deepdive: Mind of LLM, Mamba-2, Dask
Anthropic has unveiled a groundbreaking paper that delves into the internal workings of a Large Language Model (LLM), offering unprecedented insights into the previously mysterious “black box” nature of these models. By employing a technique called “dictionary learning,” the research team successfully mapped the internal states of Claude 3 Sonnet, isolating patterns of neuron activations and representing complex model states with fewer active features. This innovative approach revealed a conceptual map within the model, showing how features related to similar concepts, such as “inner conflict,” cluster together. Even more astonishing, the researchers found that by manipulating these features, they could alter the model’s behavior—an advancement with significant implications for AI safety. This study represents a major leap in understanding and potentially controlling LLMs, though challenges remain in fully mapping and leveraging these features for practical safety applications.
Mind of LLM by Anthropic
Anthropic has just released a groundbreaking paper that inspects a Large Language Model from the inside for the very first time.
The Problem of Neural Networks as Black Boxes
Until now, such models acted like black boxes: input producing an output without a clear understanding of the reasons that lead to it.
More specifically, as you might know, these models are composed of neurons that are or are not activated by activation functions. The combination of neurons that get activated when a certain input is passed make up features, and the combination of features create an internal state of the model.
Well, most LLM neurons are uninterpretable, which stops us from mechanistically understanding the models.
A new technique to map the internal states of the model
The Anthropic team used a technique called “dictionary learning”, borrowed from classical machine learning, which isolates patterns of neuron activations that recur across many different contexts. In turn, any internal state of the model can be represented in terms of a few active features instead of many active neurons, which means decomposing the model in features.
This technique was previously used on very tiny models and showed promising results. This is what motivated the researchers to apply the “dictionary learning” technique to Claude 3 Sonnet. And the results are mind-blowing.
The Anthropic team was able to extract millions of features which creates a conceptual map of its internal states halfway through the models computation. These features are remarkably abstract, often representing the same concept across contexts and languages, even generalizing to image inputs.
What the inside of an LLM looks like
What they found was that that the distance between similar features reflected their similarity. In other words, the model has a mapping of concepts that are as close to each other as their associated features.
For examples, looking near a feature related to the concept of “inner conflict”, they find features related to relationship breakups, conflicting allegiances, logical inconsistencies, and so on.
But what’s even more interesting is that it is possible to artificially manipulate those features, and that emphasizing or deemphasizing certain features changes the behavior of the model. This is a groundbreaking result and a huge step forward for ensuring the safety of these models.
For example, researchers found a specific feature associated with blindly agreeing with the user. Artificially activating this feature completely alters the response and behavior or the model.
Why It Matters
This opens the opportunity to map all features of LLMs and be able to manipulate them to ensure more safety, for example by banning some features and activating others artificially.
However, there is still work to be done as mapping all existing features would require more computation that training the model itself. Plus, knowing the activations and their significance doesn’t give us the circuits they are involved in. And lastly, nothing says for sure that this technique would actually make AI models safer, even if this approach seems promising.
Mamba-2
After the success of Mamba-1, which accumulated over 10k stars on GitHub, researchers Tri Dao and Albert Gu introduce Mamba-2.
Mamba is a new state space model architecture showing promising performance on information-dense data such as language modeling, where previous subquadratic models fall short of transformers.
You can use Mamba-2 with PyTorch to build and train neural networks handling information-dense data efficiently.
Core Innovation: Structured State Space Duality (SSD)
Mamba-2’s core innovation, SSD, combines SSMs and attention mechanisms. SSD constrains the recurrent matrix 𝐴A to a scalar-times-identity structure, simplifying computations and enhancing hardware efficiency.
This design enables multi-head SSMs, increasing the state size from 16 in Mamba-1 to 64-256 in Mamba-2, and uses matrix multiplications optimized for GPUs/TPUs.
Performance Improvements
Mamba-2 trains 50% faster than Mamba-1 and handles larger state dimensions. At the 3B scale, Mamba-2, trained on 300B tokens, surpasses Mamba-1 and older transformers in performance. The model significantly outperforms Mamba-1 on tasks like multi-query associative recall (MQAR) due to its larger state sizes.
Architectural Changes
Mamba-2 introduces parallel parameter generation for 𝐴A, 𝐵B, and 𝐶C, enabling tensor parallelism and scaling. It maintains efficient memory usage with a constant state size per channel and leverages matrix multiplications for faster computation. These changes simplify the architecture and improve scalability.
Empirical Results
In language modeling tasks, Mamba-2 shows slightly better scaling compared to Mamba-1, adhering to Chinchilla laws, with faster training times. Pretrained models ranging from 130M to 2.8B parameters are available, trained on datasets like Pile and SlimPajama. Performance is consistent across architectures, with minor variations in zero-shot evaluation results due to evaluation noise.
Key Specifications
- State Size: Increased from 16 (Mamba-1) to 64-256 (Mamba-2)
- Training Speed: 50% faster than Mamba-1
- Scale: Models range from 130M to 2.8B parameters
- Datasets: Trained on Pile and SlimPajama
- Evaluation Tasks: Includes MQAR and zero-shot evaluations on various benchmarks
Dask
Use Dask to handle larger-than-memory datasets with familiar Pandas-like syntax. Dask provides parallel computing, out-of-core computation, and scalability beyond what Pandas can offer, making it ideal for very large datasets.
Benefits:
- Scalability: Handle datasets larger than your available memory.
- Parallel Computing: Leverage multiple cores for faster computation.
- Familiar Syntax: Use a syntax similar to Pandas, minimizing the learning curve.
Using Dask can significantly optimize your data processing workflows, especially when dealing with very large datasets.
import dask.dataframe as dd
import pandas as pd
from dask.distributed import Client
# Create a Dask client
client = Client()
# Create a sample DataFrame
data = {
'column_name': ['A', 'B', 'A', 'B', 'A', 'B'],
'another_column': [1, 2, 3, 4, 5, 6]
}
df = pd.DataFrame(data)
# Convert the Pandas DataFrame to a
Dask DataFrame
ddf = dd.from_pandas(df, npartitions=2)
# Perform operations just like you would with Pandas
result = ddf.groupby('column_name').mean().compute()
# Print the result
print(result)