Transformer (machine learning)

Contemporary · Computation · 2017

TL;DR

Neural network architecture using self-attention to process sequences in parallel, replacing recurrence with learned relationships between all positions.

Now reading

0% --:--

For decades, neural networks processed sequences the way humans read text: one token at a time, left to right, maintaining memory of what came before. Recurrent neural networks (RNNs) and their improved variants—LSTMs and GRUs—dominated natural language processing by the mid-2010s. But this sequential approach created a fundamental bottleneck: each token had to wait for all previous tokens to be processed, making parallelization impossible. Long documents meant long training times. Worse, information from early tokens would fade as sequences grew longer, creating a 'vanishing gradient' problem that limited what these networks could learn.

The attention mechanism, introduced by Bahdanau in 2014, offered a partial solution—networks could learn to focus on relevant earlier tokens rather than relying solely on compressed memory states. But attention was still layered atop sequential RNNs. In June 2017, eight researchers at Google—Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan Gomez, Łukasz Kaiser, and Illia Polosukhin—published a paper with a provocative title: 'Attention Is All You Need.' They proposed removing recurrence entirely, building networks from pure attention layers. The architecture they called the Transformer could process all positions in a sequence simultaneously.

The key innovation was 'self-attention'—a mechanism where each token in a sequence computes weighted relationships to every other token. A sentence like 'The animal didn't cross the road because it was too tired' can now correctly associate 'it' with 'animal' rather than 'road' through learned attention weights. Multi-head attention allowed the model to focus on different types of relationships simultaneously—syntactic structure in one head, semantic meaning in another, long-range dependencies in a third. Position encodings preserved word order without requiring sequential processing.

The adjacent possible was perfectly configured for this innovation. GPU computing had made massively parallel matrix operations practical. The attention mechanism had proven its value. Large text datasets (Common Crawl, Wikipedia) were freely available for training. Perhaps most crucially, Google's internal infrastructure—tensor processing units, distributed training frameworks, and vast computational resources—made experimenting with architectures at unprecedented scale feasible. The original Transformer trained on translation tasks achieved state-of-the-art results while training significantly faster than recurrent models.

The cascade from Transformers was revolutionary. In 2018, OpenAI introduced GPT (Generative Pre-trained Transformer), applying the decoder portion to language generation. Google's BERT used the encoder for bidirectional understanding. Vision Transformers extended the architecture to images. By 2020, GPT-3's 175 billion parameters demonstrated that scaling Transformers yielded emergent capabilities—few-shot learning, reasoning, code generation—that smaller models lacked. The original eight authors scattered to found or join companies that would define the AI era: Cohere, Character.AI, Adept, and essential roles at Anthropic and Google DeepMind. The 2017 paper has become one of the most cited in computer science history, its architecture now underlying virtually every major AI system.

Transformer (machine learning)

What Had To Exist First

Preceding Inventions

Required Knowledge

Enabling Materials

What This Enabled

Biological Patterns

Related Inventions

Tags