DeepL is very limited in context, only looking at a small surrounding sentence. Additionally, when you're trying to optimize for a more concise or verbose translation, you really need a generalist model.
The transformer model was invented to attend to context over the entire sequence length. Look at how the original authors used the Transformer for NMT in the original Vaswani et al publication. https://github.com/jadore801120/attention-is-all-you-need-py...
See how the source sentence is run through the encoded, and then the decoder is run in an auto-regressive mode. It's been a while, so I can't recall if the encoded embeddings are masked to the same index to the decoder input sequence, and I don't see why you would even need to do particular task auto-regressively in theory. Found this survey, https://ar5iv.labs.arxiv.org/html/2204.09269 which outlines the pros and cons. A cool feature of transformers is their ability to pull in a great deal of surrounding context, look at how BERT was trained. Yeah, things are bit more boring with decoder only GPT types, but they are winning the day with the easy scaling.