Decoding transformers with efficient translation
This article explains the Transformer architecture and the Swin Transformer using a simple real-life analogy. The original paper only described the transformer architecture as the encoder/decoder model, so let’s start with that, however, since then much progress has been made.
Encoder — Decoder Transformer Architecture:
Imagine you’re a translator, and your friend wants you to help them translate a story from one language to another. The story is quite long, so you decide to divide the work into two parts. First, you read and understand the story (encoding), and then you start translating it into the desired language (decoding).
Encoding (Unraveling)
— You read the story carefully and break it down into smaller parts. Each part represents a different scene or idea in the story.
— To understand each part better, you read it six times, each time focusing on a different aspect, like the characters, actions, and emotions (multi-head attention).
— After that, you take notes and summarize the main points in a simple language (feed-forward network).
— You do this for all the parts, making sure to connect the main ideas from one part to the next (residual connections and layer normalization).
— The result of this process is a concise summary of the story, capturing all the essential information (encoder output size is 512).
Decoding (Translation)
— Armed with your summary, you start translating it back into the desired language (decoding).
— You also have access to the original story while translating, which helps you cross-reference the details (cross-attention).
— You take it one part at a time and carefully choose the right words and phrases to express the meaning accurately (autoregressive).
— As you translate each part, you also make sure it fits well with the previous and upcoming parts, maintaining the flow of the story (residual connections and layer normalization — again 😅).
Encoder Pretraining
Before you start translating stories, you want to make sure you really understand the language. So, you practice reading and understanding different stories in the original language. To do this, you mask some words in each story and try to reconstruct the missing parts. This forces you to pay attention to the context and not just memorize words, making your understanding more genuine.
Decoder Pretraining
To become an expert translator, you also need to practice translating from scratch. So, you take stories and try to translate them into the desired language, but you can only see one word at a time. This trains you to predict the next word based on the words that came before it, helping you become proficient in generating translations.
No Activation After Multi-head Attention Layer
As you work on understanding and translating stories, you notice an interesting thing. When you focus on different aspects of the story during your six readings (multi-head attention), you can easily tell which parts are closely related and which are not. This is because words with similar meanings have higher similarities between them. So, you don’t need any extra steps like additional thinking or processing (activations) after each reading to identify these relationships, as the structure of your readings already captures them effectively.
Swin Transformer
One day, you realize that you can work more efficiently and handle longer stories by breaking them down into smaller chunks. Instead of reading the entire story at once, you read it section by section, and then you combine the key points from each section (sub-groups of attention layers). This way, you can handle larger stories without feeling overwhelmed.
Additionally, you find a smarter way to reduce the effort in your translation process. When you’re translating, you don’t always have to go back to the beginning of a sentence or paragraph. Sometimes, you can just focus on a specific part and shift your attention slightly to the right or left to find the necessary context (Shifted Window mechanism). This helps you save time and energy during your translation work.
Overall, the Swin Transformer makes you a more efficient translator, allowing you to handle bigger translation tasks with ease, and it improves your accuracy by helping you focus on the right context while translating.
That’s all about an introduction to Transformers, hope you liked this! Be sure to follow me on medium as I will be writing more articles like this. Stay tuned for more exciting insights and updates in the world of AI and language magic!