Review: Interpretability in the Wild: A Circuit for Indirect Object Detection in GPT2-Small

3 minute read

Published: October 30, 2024

A paper review highlighting the key discoveries with respect to attention heads and the algorithms used.

The Discovery

For each head, patch it to observe the impact on logit difference and visualize which heads lead to a strong increase in logit difference (indicating that these heads are writing information against the IO token) and which reduce logit difference (indicating alignment with the IO token). This can be visualized via a heatmap.
Name Mover Heads: The name mover head attends to names and copies whatever they attend to, due to the separation of the QK and OV circuits. To verify this, the next two steps are performed.
Attention Probability Mapping: The attention probability for a name token implies that in the Attention Matrix (Q * K), the Key for the name token is given a high probability against the query for the end token. However, it’s unclear which query is used here—whether it’s the end token’s query or another.
Copy Score: While the previous step assesses whether the correct tokens are attended to by the name mover heads, the copy score determines whether actual copying is taking place. Here’s the process:
- Take a name token from the residual stream after the first MLP layer (reason unclear), then project it using the OV matrix of a name mover head (since the OV circuit determines what information is written by the head into the residual stream).
- This projection simulates a scenario in which the head fully attends to the name (giving 100% weight to that name token), and is then multiplied with the unembedding matrix.
- The original name is then checked to see if it appears in the top 5 logits. For name mover heads, this is the case 95% of the time; for other heads, it occurs below 20% of the time. This measure is referred to as the copy score.
For the negative name mover heads, the same procedure is applied, except using the negative of the OV matrix to compute the copy score, called the negative copy score. For negative name movers, the negative copy score is 98%, compared to 12% for an average head.
Moving backwards from the name mover heads to the S-inhibitor heads, all direct paths from heads (individually) occurring before the name mover heads are patched and assessed for logit difference. Additionally, changes in the attention pattern of the name mover heads are analyzed. It turns out that S-inhibitors reduce the extent to which the name mover attends to the S1 and S2 tokens (which are the incorrect predictions). They are called S-inhibitors because they inhibit attention to the subject token. Specifically, after identifying the name movers, only heads affecting the query of the name mover are analyzed, and an explanation is given for discarding the values and the keys.
After identifying the S-inhibitors, the analysis continues further back to assess different heads preceding the S-inhibitors. A few minor heads are identified. This time, the impact of previous heads on the values of S-inhibitors is examined, but no tangible results are found for the queries and keys.
Duplicate Token Heads are identified as being active at the S2 token, primarily attending to the S1 token. They signal to the S-inhibitor heads that token duplication has occurred.

Share on

Twitter Facebook LinkedIn

Pratik Doshi

Review: Interpretability in the Wild: A Circuit for Indirect Object Detection in GPT2-Small

The Discovery

Share on

You May Also Enjoy

Paged Attention and vLLM

Are Autoencoders Fundamentally Denoisers?

Einops and Einsum Summarized

Implementing GPT from Scratch