Large language model

Contemporary · Computation · 2018

TL;DR

Transformer-based neural networks trained on massive text datasets that exhibit emergent reasoning and generation capabilities at scale.

Now reading

0% --:--

The idea that scaling neural networks would unlock qualitatively new capabilities was counterintuitive. For decades, the machine learning community focused on architectural innovation—inventing new types of layers, connections, and training procedures. Larger models required more data, more compute, and more engineering effort, with diminishing returns that seemed to plateau quickly. Then came the Transformer architecture in 2017, and with it a discovery that would reshape the field: Transformers exhibited what researchers called 'scaling laws'—predictable improvements in capability as model size, training data, and compute increased together.

In June 2018, OpenAI released GPT (Generative Pre-trained Transformer), a 117-million parameter model trained on 8 million web pages. The 'pre-training' approach was crucial: rather than training on task-specific data, GPT learned from massive amounts of unlabeled text, developing general language understanding that could be 'fine-tuned' for specific applications. The model showed surprising zero-shot capabilities—performing tasks it was never explicitly trained for simply by framing them as text completion problems. GPT-2 in 2019 scaled to 1.5 billion parameters. GPT-3 in 2020 reached 175 billion. Each jump revealed emergent abilities that smaller models completely lacked.

The adjacent possible for LLMs was uniquely assembled. Transformers provided an architecture that could efficiently use parallel computation. Web-scale datasets—Common Crawl, Wikipedia, books, code repositories—offered trillions of tokens of training text. Cloud computing made massive distributed training feasible. Crucially, the lesson from ImageNet and deep learning had diffused: scale could substitute for clever engineering. The compute became available just as the architecture arrived that could use it productively.

GPT-3's 2020 release demonstrated capabilities that seemed to require understanding: analogical reasoning, code generation, mathematical problem-solving, creative writing. The API-based access model created an immediate commercial ecosystem. Researchers discovered that prompting—carefully crafted text input—could steer these models to perform specific tasks without any additional training. 'In-context learning' emerged as models showed they could follow instructions and examples provided in the prompt itself.

The cascade was immediate and industry-transforming. Google released BERT, then PaLM. Anthropic, founded by former OpenAI researchers, introduced Claude. Meta released LLaMA as an open-weight model, democratizing access to powerful architectures. Microsoft invested $10 billion in OpenAI and integrated GPT into Office products and Bing. The race to scale continued: Google's Gemini, OpenAI's GPT-4, Anthropic's Claude series. By 2024, these models had become infrastructure—embedded in search, productivity tools, coding environments, and customer service. The technology that seemed exotic in 2018 had become, within six years, the foundation for a new era of human-computer interaction.

Large language model

What Had To Exist First

Preceding Inventions

Required Knowledge

Enabling Materials

What This Enabled

Biological Patterns

Related Inventions

Tags