How translation happens in Transformer architecture

Encoder:

Tokenize the input French phrase using the trained tokenizer.
Add tokens to the input on the encoder side.
Pass the tokens through the embedding layer.
Feed tokens into the multi-headed attention layers.
Outputs of the multi-headed attention layers are processed by a feed-forward network.
The encoder output represents the structure and meaning of the input sequence.

Decoder:

Insert the encoder output into the middle of the decoder to influence self-attention mechanisms.
Add a start-of-sequence token to the decoder input.
The decoder uses contextual understanding from the encoder to predict the next token.
The output of the decoder's self-attention layers is processed by the decoder feed-forward network.
Pass the output through a final softmax output layer to get the first token.
Continue the loop, passing the output token back to the input to predict the next token.
Repeat until the model predicts an end-of-sequence token.
The final sequence of tokens can be detokenized into words to obtain the output translation.

An Architect's vision