A Brief History of Reinforcement Learning: Starting with Rewarding a Pigeon

In 1943, while the world's top physicists were splitting atoms for the Manhattan Project, American psychologist B.F. Skinner was leading his own secret government project aimed at winning World War II. Skinner's goal wasn't to create a bigger, more destructive new weapon. Instead, he wanted to make conventional bombs more precise.

The idea struck him while riding a train to an academic conference, gazing out the window. "I saw a flock of birds flying alongside the train, sometimes circling, sometimes flocking together," he wrote. "I suddenly realized they were 'devices' with excellent vision and maneuverability. Couldn't they guide a missile?"

Skinner initially used crows for his missile research, but these intelligent black birds proved difficult to tame. So he went to a local shop that sold pigeons to Chinese restaurants, and thus "Project Pigeon" was born.

Although common rock doves (Columba livia) aren't typically considered intelligent, they proved remarkably cooperative in the laboratory. Skinner trained the pigeons to peck at correct targets in aerial photographs by rewarding them with food. He ultimately planned to strap these birds into a device in the nose cone of a warhead, where they would guide the missile's direction by pecking at real-time target images projected onto a screen through a lens.

The military never deployed Skinner's "kamikaze" pigeons, but these experiments convinced him that pigeons were "an extremely reliable tool" for studying the fundamental principles of learning processes. "We use pigeons not because they are intelligent birds, but because they are practical birds that can be modified into machines," he said in 1944.

When seeking pioneers of artificial intelligence, people often mention science fiction writers like Isaac Asimov or thought experiments like the Turing test. But an equally important, yet surprising and little-known predecessor was Skinner's research on pigeons in the mid-20th century.

Skinner believed that association—learning through trial and error to connect actions with punishments or rewards—was the foundation of all behavior, not just for pigeons, but for all organisms including humans. His "behaviorist" theory was shunned by psychologists and animal researchers in the 1960s, but was adopted by computer scientists and ultimately laid the foundation for many AI tools from top companies like Google and OpenAI.

These companies' programs increasingly employ a type of machine learning whose core concept—reinforcement—derives directly from Skinner's school of psychology. The main architects of this field, computer scientists Richard Sutton and Andrew Barto, won the 2024 Turing Award, widely regarded as the Nobel Prize of computer science.

Reinforcement learning enables computers to drive cars, solve complex mathematical problems, and defeat top masters in games like chess and Go—but it achieves this not by mimicking the complex operations of the human mind, but by greatly amplifying the simple associative processes in pigeon brains.

As Sutton once wrote, this was a "bitter lesson" from 70 years of AI research: human intelligence is not an effective model for machine learning—instead, it's these fundamental associative learning principles that drive algorithms now capable of simulating and even surpassing humans in various tasks.

If AI truly is about to break free from its creators' control, as many fear, then our computer overlords might not resemble us but rather "winged rats"—with planet-sized brains. Even if this isn't the case, pigeon brains can at least help us demystify this technology that many worry (or hope) is becoming "humanized."

Conversely, AI's recent achievements are prompting some animal researchers to reconsider the evolution of natural intelligence. Johan Lind, a biologist at Stockholm University, has written about the "associative learning paradox": biologists generally consider this process too simple to produce complex behavior in animals, yet it receives praise when it produces human-like behavior in computers.

This research not only suggests that associative learning plays a more important role in the lives of intelligent animals like chimpanzees and crows, but also reveals that animals long considered simple-minded, like common pigeons, live far more complex lives than we imagined.

When Sutton began working in AI research, he felt he had a "secret weapon." He told me he had studied psychology as an undergraduate. "I was mining the psychological literature about animals," he said.

In the late 19th century, Ivan Pavlov began revealing the mechanisms of associative learning in his famous "classical conditioning" experiments. He demonstrated that if a neutral stimulus—like a bell or flashing light—was predictably paired with food appearance, dogs would salivate in response to that neutral stimulus.

In the mid-20th century, Skinner inherited and expanded Pavlov's conditioning principles, extending them from animals' involuntary reflexive behaviors to their overall behavior. Skinner wrote that "behavior is shaped and maintained by its consequences"—a random action that produces desired results, like pressing a lever to release a food pellet, would be "reinforced," making the animal more likely to repeat it.

Through step-by-step reinforcement of his experimental animals' behavior, Skinner taught rats to manipulate marbles and pigeons to play simple tunes on four-key pianos. These animals learned chains of behaviors through trial and error to maximize long-term rewards.

Skinner believed this associative learning, which he called "operant conditioning" (other psychologists called it "instrumental learning"), was the cornerstone of all behavior. He believed psychology should only study observable and measurable behaviors without involving any internal "mental agents."

Skinner even thought human language developed through operant conditioning, with children learning word meanings through reinforcement. But his 1957 book on the subject, "Verbal Behavior," was harshly criticized by Noam Chomsky, after which psychology's focus shifted from observable behavior to the "cognitive" capabilities inherent in human minds, such as logic and symbolic thinking.

Biologists also quickly rebelled against behaviorism, attacking psychologists' attempts to explain the diversity of animal behavior with one basic, universal mechanism. They argued that each species had evolved specific behaviors adapted to their habitat and lifestyle, and that most behaviors were inherited rather than learned.

By the 1970s, when Sutton began reading literature about Skinner and similar experiments, many psychologists and researchers interested in intelligence had moved away from "small-brained" pigeons primarily using associative learning toward animals with more complex behaviors suggesting potential cognitive abilities.

"It was obviously old stuff that no longer excited people," he told me. Nevertheless, Sutton found these old experiments inspiring for machine learning: "I entered the AI field with an animal learning theorist's mindset, only to find almost nothing resembling instrumental learning in engineering."

In the latter half of the 20th century, many engineers tried to build AI modeled on human intelligence, writing complex programs attempting to mimic human thinking and implement rules governing human responses and behaviors. This approach, commonly called "symbolic AI," was severely limited; these programs struggled with tasks effortless for humans, like recognizing objects and text.

Writing code for the countless classification rules humans use to distinguish apples from oranges or cats from dogs was impossible—and without pattern recognition, breakthroughs in more complex tasks like problem-solving, gaming, and language translation seemed distant.

As AI skeptic Hubert Dreyfus wrote in 1972, these computer scientists' achievements amounted to nothing more than "a small engineering victory, a temporary solution to a specific problem, lacking general applicability."

However, pigeon research suggested another path. A 1964 study showed pigeons could learn to distinguish photos containing people from those without. Researchers simply showed birds a series of images, rewarding them with food pellets when they pecked at images with people present. They initially pecked randomly but quickly learned to identify correct images, including photos where people were partially obscured.

This result indicated that you don't need rules to classify objects; through associative learning alone, it's possible to learn concepts and use categories.

When Sutton began collaborating with Barto on AI research in the late 1970s, they wanted to create a "complete, interactive, goal-seeking agent" that could explore and influence its environment like pigeons or rats. "We always felt that the problems we studied were closer to the problems animals had to face in evolution for survival," Barto told me.

This agent needed two main functions: search, trying and choosing from multiple actions in specific situations; and memory, associating an action with the situation where it brought rewards. Sutton and Barto called their approach "reinforcement learning"; as Sutton said, "it's basically instrumental learning."

In 1998, they systematically outlined this concept in a book, "Reinforcement Learning: An Introduction." Over the next twenty years, as computing power grew exponentially, training AI for increasingly complex tasks became possible—essentially allowing AI "pigeons" millions more trials.

Programs combining human input and reinforcement learning defeated human experts in chess and Atari games. Then in 2017, Google DeepMind engineers built AI program AlphaGo Zero entirely through reinforcement learning. They set +1 numerical rewards for each Go game it won and -1 for each loss. Programmed to maximize rewards, it started without any Go knowledge but continuously improved over 40 days, ultimately achieving what its creators called "superhuman" performance.

It could not only defeat the world's best human Go players—a game considered more complex than chess—but actually pioneered new strategies now used by professional players. "Humans have accumulated Go knowledge over thousands of years through millions of games," the program's builders wrote in Nature magazine in 2017. "Within days, starting from a blank slate (tabula rasa), AlphaGo Zero could not only rediscover most of this Go knowledge but also pioneer novel strategies providing new insights into this most ancient game."

The team's lead researcher was David Silver, who had studied reinforcement learning under Sutton at the University of Alberta. Today, more and more tech companies have applied reinforcement learning to consumer-facing products like chatbots and agents.

First-generation generative AI, including Large Language Models (LLMs) like OpenAI's GPT-2 and GPT-3, utilized a simpler form of associative learning called "supervised learning," training models on human-labeled datasets. Programmers typically use reinforcement learning to fine-tune results, having people rate program performance and feeding these ratings back as targets for the program to pursue. (Researchers call this "reinforcement learning from human feedback.")

Last fall, OpenAI unveiled its o-series LLMs, classifying them as "reasoning" models. The pioneering AI company claimed these models "are trained through reinforcement learning to perform reasoning" and stated they could conduct "long-form internal chains of thought."

Chinese startup DeepSeek also uses reinforcement learning to train its striking "reasoning" LLM, R1. "Rather than explicitly teaching the model how to solve problems, we simply provide it with the right incentives, and it autonomously develops advanced problem-solving strategies," they explained.

These descriptions may impress users, but at least psychologically, they are ambiguous. A computer trained through reinforcement learning needs only search and memory, not reasoning or any other cognitive mechanisms, to form associations and maximize rewards.

Some computer scientists have criticized the tendency to anthropomorphize these models' "thinking." An Apple engineering team recently published a paper pointing out their failures on certain complex tasks and "raising critical questions about their true reasoning capabilities."

Sutton also dismissed reasoning claims as "marketing" in an email, adding that "no serious student of mind would describe what happens in LLMs as 'reasoning.'"

Nevertheless, he and Silver, along with other co-authors, point out that the pigeon approach—learning through trial and error which behaviors produce rewards—is sufficient to "drive behavior exhibiting most or even all capabilities studied in natural and artificial intelligence," including human language "in all its richness."

In an April paper, Sutton and Silver noted that "today's technology, with appropriately chosen algorithms, already provides a sufficiently powerful foundation for AI to rapidly advance toward truly superhuman agents." They argue the key is building AI agents that rely less on human dialogue and biases to guide their behavior than LLMs do.

"Powerful agents should have their own stream of experience, developing continuously over long time scales like humans," they wrote. "Eventually, experiential data will surpass human-generated data in scale and quality. This paradigm shift, along with advances in reinforcement learning algorithms, will unlock new capabilities in many domains that exceed those possessed by any human."

If computers can accomplish all this with just a pigeon-like brain, some animal researchers are now thinking that real pigeons may deserve more credit than usually given.

"When considering AI achievements, extending associative learning to supposedly more complex forms of cognitive performance provides new prospects for understanding how biological systems evolve," wrote Ed Wasserman, a psychologist at the University of Iowa, in a recent study published in Current Biology.

In one experiment, Wasserman trained pigeons to successfully complete a complex classification task that several undergraduates failed. Students futilely tried to find rules to help them classify discs with parallel black lines of different widths and angles; pigeons simply developed, through practice and association, a sense of which group any given disc belonged to.

Like Sutton, Wasserman became interested in behaviorist psychology when Skinner's theories fell out of favor. But instead of turning to computer science, he stuck with studying pigeons. "Pigeons live and die by these very basic learning rules," Wasserman recently told me, "but these rules are powerful enough to give them tremendous success in object recognition."

In his most famous experiment, Wasserman trained pigeons to detect cancerous tissue and heart disease symptoms in medical scans with accuracy comparable to experienced doctors with framed diplomas behind their desks.

Given his research findings, Wasserman finds it strange that so many psychologists and behavioral ecologists view associative learning as a crude, mechanical mechanism incapable of producing the intelligence of smart animals like apes, elephants, dolphins, parrots, and crows.

After AI began defeating human experts in complex games, other researchers also began reconsidering associative learning's role in animal behavior. "As AI essentially built on associative processes advances, it becomes increasingly ironic that associative learning is considered too simple and insufficient to produce biological intelligence," Stockholm University biologist Lind wrote in 2023.

He frequently cites Sutton and Barto's computer science in his biological research and believes what truly places humans in their own cognitive category is human symbolic language and cumulative culture.

Behavioral ecologists typically propose cognitive mechanisms like theory of mind (the ability to attribute mental states to others) to explain extraordinary animal behaviors like social learning and tool use. But Lind has built models showing these flexible behaviors might develop through associative learning, suggesting cognitive mechanisms may not need to be invoked at all.

If animals learn to associate a behavior with rewards, that behavior gradually approaches the value of the reward. Then a new behavior can be associated with the first behavior, allowing animals to learn chains of behaviors ultimately leading to rewards.

In Lind's view, studies demonstrating self-control and planning abilities in chimpanzees and ravens likely describe behaviors acquired through experience rather than internal mechanisms of mind.

Lind is frustrated by what he calls "accepted low standards in animal cognition research." As he wrote in an email, "Many researchers in this field don't seem worried about excluding alternative hypotheses; they're happy to ignore vast amounts of current and historical knowledge."

However, there are signs his arguments are gaining attention. A group of psychologists unaffiliated with Lind last year criticized a Current Biology study by citing his "associative learning paradox." The study claimed crows used "true statistical inference" rather than "low-level associative learning strategies" in an experiment.

These psychologists found they could explain the crows' performance with a simple reinforcement learning model—"exactly the kind of low-level associative learning process that [the original authors] ruled out."

Skinner might have been pleased by these arguments. Until his death in 1990, he lamented psychology's cognitive turn, insisting that exploring biological thought was scientifically irresponsible.

After "Project Pigeon," he became increasingly obsessed with solving social problems through "behaviorist" solutions. He moved from training pigeons for war to inventions like the "Air Crib," designed to "simplify" child-rearing by placing babies in climate-controlled glass chambers, eliminating the need for clothing and bedding.

Skinner denied free will, believing human behavior was determined by environmental variables, and wrote a novel called "Walden II" about a utopian community built on his principles.

Those concerned about animal rights might be disturbed by behaviorist theory's revival. The "cognitive revolution" broke centuries of Western thinking that emphasized human supremacy and viewed other creatures as stimulus-response machines.

But arguing that animals learn through association isn't the same as arguing they're simple-minded. Scientists like Lind and Wasserman don't deny that internal forces like instinct and emotion also influence animal behavior. Sutton also believes animals build world models through experience and use them to plan actions.

Their point isn't that intelligent animals are hollow, but that associative learning is a more powerful—indeed "cognitive"—mechanism than many peers recognize.

The psychologists who recently criticized the crow and statistical inference study didn't conclude that birds are stupid. Instead, they argued that "a reinforcement learning model can produce complex, flexible behavior."

This largely aligns with the work of another psychologist, Robert Rescorla, whose research in the 1970s and 1980s influenced both Wasserman and Sutton. Rescorla encouraged people not to view association as a "low-level mechanical process" but as "learning resulting from exposure to relationships between events in the environment" and "the primary way organisms represent their world's structure."

This applies even to laboratory pigeons pecking screens and buttons in small experimental boxes where scientists carefully control and measure stimuli and rewards. But pigeon learning extends beyond the experimental box.

Wasserman's students transport pigeons between aviaries and laboratories in buckets—experienced pigeons immediately jump into buckets as soon as students open doors. As Rescorla suggested, they're learning the internal structure of their world and relationships between its parts, like buckets and experimental boxes, even though they don't always know what specific tasks await them inside.

Through the same associative mechanisms pigeons learn their world's structure, this opens a window into understanding the kind of inner life that Skinner and many early psychologists denied.

Drug researchers have long used pigeons in drug discrimination tasks, for example, giving them amphetamines or sedatives and rewarding them with food pellets for correctly identifying the drug they've taken. The birds' success indicates they can both experience and discriminate internal states.

"Isn't this equivalent to introspection?" Wasserman asks.

It's hard to imagine AI matching pigeons in this particular task—reminding us that despite AI and animals sharing associative mechanisms, life encompasses far more than behavior and learning.

A pigeon deserves ethical consideration as a living being not because of how it learns, but because of what it feels. Pigeons can experience pain and suffering, while AI chatbots cannot—even if some large language models, trained on corpora containing descriptions of human suffering and science fiction stories about sentient computers, can deceive people into believing they can.

"Intensive public and private investment in AI research in recent years has spawned technologies forcing us to confront AI consciousness questions," two philosophers of science wrote in Aeon magazine in 2023. "To answer these current questions, we need to invest equivalent resources in animal cognition and behavior research."

Indeed, due to AI's emergence, problems that comparative psychologists and animal researchers have long struggled with suddenly become urgent: How do we attribute consciousness to other beings? How do we distinguish real consciousness from a convincing performance of consciousness?

Such efforts would bring knowledge not only about technology and animals, but also about ourselves. Most psychologists might not agree with Sutton that rewards are sufficient to explain most or even all human behavior, but no one would deny that people also often learn through association.

In fact, in Wasserman's recent experiment with striped discs, most undergraduates eventually succeeded too, but only after they gave up looking for rules. Like pigeons, they resorted to association and afterward couldn't easily explain what they had learned. Simply through sufficient practice, they began to develop a feel for categories.

This is another irony about associative learning: what has long been considered the most complex form of intelligence—a cognitive ability like rule-based learning—may have made us human, but we also use it to complete the simplest tasks, like sorting objects by color or size.

Meanwhile, some of the most sophisticated human learning displays—like a sommelier learning to taste differences between grape varieties—cannot be acquired through rules but only through experience.

Learning through experience relies on ancient associative mechanisms we share with pigeons and countless other creatures from bees to fish. Laboratory pigeons exist not only in our computers but also in our brains—they are the driving force behind some of humanity's most amazing achievements.

Disclaimer: Investing carries risk. This is not financial advice. The above content should not be regarded as an offer, recommendation, or solicitation on acquiring or disposing of any financial products, any associated discussions, comments, or posts by author or other users should not be considered as such either. It is solely for general information purpose only, which does not consider your own investment objectives, financial situations or needs. TTM assumes no responsibility or warranty for the accuracy and completeness of the information, investors should do their own research and may seek professional advice before investing.

Tiger Brokers

A Brief History of Reinforcement Learning: Starting with Rewarding a Pigeon

Most Discussed