I'm the author of https://jalammar.github.io/illustrated-transformer/ and have spent years since introducing people to Transformers and thinking of how best to communicate those concepts. I've found that different people need different kinds of introductions, and the thread here includes some often cited resources including:
Looking back at The Illustrated Transformer, when I introduce people to the topic now, I find I can hide some complexity by omitting the encoder-decoder architecture and focusing only on one. Decoders are great because now a lot of people come to Transformers having heard of GPT models (which are decoder only). So for me, my canonical intro to Transformers now only touches on a decoder model. You can see this narrative here: https://www.youtube.com/watch?v=MQnJZuBGmSQ
For me the main thing I've gained from ChatGPT (3.5) is I no longer dread that wall of lack of knowledge. Where you know so little you don't even know where to start and the task seems dreadful and insurmountable. Just a couple questions later you've got an intro, some jargon, some sample code. It makes it much easier to put the pieces together and ask followup questions.
https://peterbloem.nl/blog/transformers
https://e2eml.school/transformers.html
I would also add Luis Serrano's article here: https://txt.cohere.com/what-are-transformer-models/ (HN discussion: https://news.ycombinator.com/item?id=35576918).
Looking back at The Illustrated Transformer, when I introduce people to the topic now, I find I can hide some complexity by omitting the encoder-decoder architecture and focusing only on one. Decoders are great because now a lot of people come to Transformers having heard of GPT models (which are decoder only). So for me, my canonical intro to Transformers now only touches on a decoder model. You can see this narrative here: https://www.youtube.com/watch?v=MQnJZuBGmSQ