I'm not sure how much is actually known to write about, but what I'd like to see...

jayalammar · on May 25, 2023

This is a field I find fascinating. It's generally the research field of Machine Learning Interpretability. The BlackboxNLP workshop is one of the main places for investigating this and is a very popular academic workshop https://blackboxnlp.github.io/

One of the most interesting presentations in the last session of the workshop is this talk by David Bau titled "Direct Model Editing and Mechanistic Interpretability". David and his team locate exact information in the model, and edit it. So for example they edit the location of the Eiffel Tower to be in Rome. So whenever the model generates anything involving location (e.g., the view from the top of the tower), it actually describes Rome

Talk: https://www.youtube.com/watch?v=I1ELSZNFeHc

Paper: https://rome.baulab.info/

Follow-up work: https://memit.baulab.info/

There is also work on "Probing" the representation vectors inside the model and investigating what information is encoded at the various layers. One early Transformer Explainability paper (BERT Rediscovers the Classical NLP Pipeline https://arxiv.org/abs/1905.05950) found that "the model represents the steps of the traditional NLP pipeline in an interpretable and localizable way: POS tagging, parsing, NER, semantic roles, then coreference". Meaning that the representations in the earlier layers encode things like whether a token is a verb or noun, and later layers encode other, higher-level information. I've made an intro to these probing methods here: https://www.youtube.com/watch?v=HJn-OTNLnoE

A lot of applied work doesn't require interpretability and explainability at the moment, but I suspect the interest will continue to increase.

HarHarVeryFunny · on May 25, 2023

Thanks, Jay!

I wasn't aware of that BERT explainability paper - will be reading it, and watching your video.

Are there any more recent Transformer Explainability papers that you would recommend - maybe ones that build on this and look at what's going on in later layers?

jayalammar · on May 25, 2023

Additional ones that come to mind now are:

Transformer Feed-Forward Layers Are Key-Value Memories https://arxiv.org/abs/2012.14913

The Dual Form of Neural Networks Revisited: Connecting Test Time Predictions to Training Patterns via Spotlights of Attention https://arxiv.org/abs/2202.05798

https://github.com/neelnanda-io/TransformerLens

HarHarVeryFunny · on May 26, 2023

Another piece of the puzzle seems to be transformer "induction heads" where attention heads in consecutive layers work together to provide a mechanism that is believed to be responsible for much of in-context learning. The idea is that earlier instances of a token pattern/sequence in the context are used to predict the continuation of a similar pattern later on.

In the most simple case this is a copying operation such that an early occurrence of AB predicts that a later A should be followed by B. In the more general case this becomes A'B' => AB which seems to be more of an analogy type relationship.

https://arxiv.org/abs/2209.11895

https://youtu.be/Vea4cfn6TOA

This is still only a low level mechanistic type of operation, but at least a glimpse into how transformers are operating at inference time.

HarHarVeryFunny · on May 25, 2023

That's great - thank you!

famouswaffles · on May 27, 2023

No one can tell you much about that. Interpretability is still very poor.

You don't know what they learn beforehand (else deep learning wouldn't be necessary) so you have to try and figure it out afterwards.

But artificial parameters aren't beholden to any sort of "explainabilty rule". No guarantee anything is wired in a way for humans to comprehend. And even if it was, you're looking at hundreds of billions of parameters potentially.

uoaei · on May 25, 2023

> not at the mechanistic level of the architecture, but in terms of what they learn (some type of world model ? details, not hand waving!)

https://imgs.xkcd.com/comics/tasks.png

HarHarVeryFunny · on May 25, 2023

Sure - but it's still the interesting part!

I'm sure some of key players know at least a little, but they don't seem inclined to share. In his Lex Fridman interview Sam Altam said something along the lines of "a LOT of knowledge went into designing GPT-4", and there's a time gap between GPT-3 (2020) and GPT-4 (2022) where it seems they spent a lot of time probably trying to understand it, among other things.

It seems the way values are looked up via query/key and added must constrain representations quite a bit, and comparing internal activations for closely related types of input might be one way to start to understand what's going on.

A high level understanding of what the model has learnt may be the last thing to fall, but understanding the internal representations would go a long way towards that.

quickthrower2 · on May 25, 2023

Are you saying no one really knows how these things work? I am very curious about if you can “peer into the weights”. I have seen simple examples of that with image recognition but only for early layers.