Long short-term memory
LSTM emerged when Hochreiter and Schmidhuber solved the vanishing gradient problem with gated memory cells in 1997—creating the architecture that enabled speech recognition, machine translation, and virtual assistants for two decades.
Long short-term memory emerged from a young Austrian researcher's frustration with a fundamental flaw in neural networks. In 1991, Sepp Hochreiter was a diploma student at the Technical University of Munich working under Jürgen Schmidhuber when he formalized what he called the vanishing gradient problem: when training recurrent neural networks on sequences, the gradients used to update early layers became exponentially small, effectively erasing the network's ability to learn long-range dependencies. A network that couldn't remember couldn't learn language, speech, or any temporal pattern spanning more than a few timesteps.
The adjacent possible required understanding why existing solutions failed. Standard recurrent neural networks, invented in the 1980s, processed sequences by maintaining a hidden state that theoretically encoded all previous inputs. In practice, the gradient signals used for learning decayed exponentially through time—a single error signal might have to traverse hundreds of timesteps, losing information at each step like a message whispered through a long chain of people.
Hochreiter's 1991 diploma thesis proved this mathematically but offered no solution. Six years of refinement followed. Working with Schmidhuber at Munich, Hochreiter developed an architecture that enforced constant error flow through what they called Constant Error Carousels. The key insight was architectural: instead of hoping gradients would propagate through arbitrary computations, they designed explicit memory cells where information could flow unchanged indefinitely.
The 1997 paper in Neural Computation introduced the full LSTM architecture. An LSTM unit contains a cell—the actual memory—and three gates controlling information flow. The input gate determines what new information to store. The forget gate decides what to discard. The output gate controls what to expose to downstream computations. Crucially, the cell state itself passes through time with only linear transformations, allowing gradients to flow backward for thousands of timesteps without vanishing.
The cascade from LSTM was slow but ultimately transformative. For a decade, the architecture remained obscure, a solution waiting for the computational power to exploit it. Then, around 2014, deep learning met abundant GPU computation and massive datasets. Suddenly, LSTMs became the default for speech recognition at Google, Apple's Siri, machine translation systems, and language models everywhere. The 1997 architecture processed the audio for every "Hey Siri" and "OK Google," every neural machine translation, every caption generated for a photo.
Recognition came late. Hochreiter received the IEEE CIS Neural Networks Pioneer Prize in 2021 for work published 24 years earlier. By then, LSTMs had accumulated over 98,000 citations—one of the most influential papers in computer science history.
Path dependence favored LSTMs over alternative memory architectures like Neural Turing Machines or differentiable memory networks. The installed base of LSTM implementations in TensorFlow, PyTorch, and production systems created enormous switching costs. Even when transformer architectures emerged in 2017 and began displacing LSTMs for language tasks, the older architecture persisted in time-series forecasting, speech recognition, and embedded systems where transformers' computational demands were prohibitive.
By 2026, LSTM remains foundational even as transformers dominate language. Hochreiter himself published xLSTM in 2024, updating the architecture for the modern era. The memory problem he identified as a diploma student in Munich—how can a network remember?—proved to be the central question of artificial intelligence. His answer enabled machines to process time.
What Had To Exist First
Preceding Inventions
Required Knowledge
- Vanishing gradient problem analysis
- Backpropagation through time
- Gated architectures
Enabling Materials
- GPU computing infrastructure
- Large sequential datasets
What This Enabled
Inventions that became possible because of Long short-term memory:
Biological Patterns
Mechanisms that explain how this invention emerged and spread: