Transformer (machine learning)
Neural network architecture using self-attention to process sequences in parallel, replacing recurrence with learned relationships between all positions.
For decades, neural networks processed sequences the way humans read text: one token at a time, left to right, maintaining memory of what came before. Recurrent neural networks (RNNs) and their improved variants—LSTMs and GRUs—dominated natural language processing by the mid-2010s. But this sequential approach created a fundamental bottleneck: each token had to wait for all previous tokens to be processed, making parallelization impossible. Long documents meant long training times. Worse, information from early tokens would fade as sequences grew longer, creating a 'vanishing gradient' problem that limited what these networks could learn.
The attention mechanism, introduced by Bahdanau in 2014, offered a partial solution—networks could learn to focus on relevant earlier tokens rather than relying solely on compressed memory states. But attention was still layered atop sequential RNNs. In June 2017, eight researchers at Google—Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan Gomez, Łukasz Kaiser, and Illia Polosukhin—published a paper with a provocative title: 'Attention Is All You Need.' They proposed removing recurrence entirely, building networks from pure attention layers. The architecture they called the Transformer could process all positions in a sequence simultaneously.
The key innovation was 'self-attention'—a mechanism where each token in a sequence computes weighted relationships to every other token. A sentence like 'The animal didn't cross the road because it was too tired' can now correctly associate 'it' with 'animal' rather than 'road' through learned attention weights. Multi-head attention allowed the model to focus on different types of relationships simultaneously—syntactic structure in one head, semantic meaning in another, long-range dependencies in a third. Position encodings preserved word order without requiring sequential processing.
The adjacent possible was perfectly configured for this innovation. GPU computing had made massively parallel matrix operations practical. The attention mechanism had proven its value. Large text datasets (Common Crawl, Wikipedia) were freely available for training. Perhaps most crucially, Google's internal infrastructure—tensor processing units, distributed training frameworks, and vast computational resources—made experimenting with architectures at unprecedented scale feasible. The original Transformer trained on translation tasks achieved state-of-the-art results while training significantly faster than recurrent models.
The cascade from Transformers was revolutionary. In 2018, OpenAI introduced GPT (Generative Pre-trained Transformer), applying the decoder portion to language generation. Google's BERT used the encoder for bidirectional understanding. Vision Transformers extended the architecture to images. By 2020, GPT-3's 175 billion parameters demonstrated that scaling Transformers yielded emergent capabilities—few-shot learning, reasoning, code generation—that smaller models lacked. The original eight authors scattered to found or join companies that would define the AI era: Cohere, Character.AI, Adept, and essential roles at Anthropic and Google DeepMind. The 2017 paper has become one of the most cited in computer science history, its architecture now underlying virtually every major AI system.
What Had To Exist First
Preceding Inventions
Required Knowledge
- Neural network architectures
- Attention mechanisms and alignment
- Distributed training at scale
- Natural language processing fundamentals
Enabling Materials
- NVIDIA GPUs with tensor cores
- Google TPUs
What This Enabled
Inventions that became possible because of Transformer (machine learning):
Biological Patterns
Mechanisms that explain how this invention emerged and spread: