What is reinforcement learning?
We look at a method of AI development built on the idea of positive and negative feedback
One of the most fascinating subdivisions of artificial intelligence (AI) is reinforcement learning. Itself a subset of machine learning (ML), reinforcement learning technology is widely tested on games, such as Go, but its development might have wider implications on industries and businesses.
This branch of AI aspires to reflect human-like capabilities and has even exceeded these ambitions when applied in gaming contexts. For instance, it’s gone toe-to-toe with several world champions in their specialities.
Ke Jie, for example, is a Go world champion that’s been humbled by a reinforcement learning system. The Chinese competitor had dominated the game from 2014, but he was beaten three times in 2017 by a system developed by Google’s DeepMind division.
The previous year, DeepMind’s AlphaGo system lost to the 18-time Go champion Lee Sedol in the fourth of a five-game series, although it won the other four games. Lee then retired in 2019, citing the dominance of AI and suggesting it “cannot be defeated”.
Although reinforcement learning has proven itself in the realm of gaming, this technology can also be used in robotics and automation. Further breakthroughs, therefore, can have significant implications for businesses and the wider economy.
What is reinforcement learning?
Reinforcement learning (RL) is a method of training ML systems to find their own way of solving complex problems, rather than making decisions based on preconfigured possibilities that a programmer has set.
Positive and negative reinforcement is used, with correct decisions leading to rewards whereas negative decisions are penalised. Although humans normally consider rewards to be a treat of some description, for machines the reward is a positive evaluation of an action.
RL also doesn't rely on human involvement during the training process. In classic ML, using what's known as supervised learning, a machine learning algorithm is given a set of decisions to choose from. Using the game of Go as an example, someone training the algorithm could give it a list of moves to make in a given scenario, which the program could then choose from.
The problem with this model is that the algorithm then becomes only as good as the human programming it, which means the machine cannot learn by itself.
The goal of reinforcement learning is to train the algorithm to make sequential decisions to reach an end goal and over time; the algorithm will learn how to make decisions that reach the goal in the most efficient way using reinforcement. When trained using reinforcement learning, artificial intelligence systems can draw experiences from many more decision trees than humans, which makes them better at solving complex tasks – at least in gamified environments.
Learning to win
Reinforcement learning shares many similarities with supervised learning in a classroom. A framework establishing the ground rules is still required, but the software agent is never told what instructions it should follow, nor is it given a database from which to draw upon. This type of approach allows a system to create its own dataset from its actions, built using trial and error, to establish the most efficient route to a reward.
This is all done sequentially – a software agent will take one action at a time until it encounters a state for which it is penalised. For example, a virtual car leaving a road or track will produce an error state, and revert the problem back to its starting position. For many processes, we don't actually need the system to learn to make new decisions as it develops, rather just refine its data processing capabilities, as is the case with facial recognition technology. However, for some, reinforcement learning is by far the most beneficial form of development.
One of the most famous examples is the case of Google's DeepMind, which uses a Deep Q-Learning algorithm. This was created to master Atari Breakout, the classic 70s arcade game, in which players smash through eight rows of blocks with a ball and paddle. During its development, the software agent was only provided with the information that appeared on screen and was tasked with simply maximising its score.
As you might expect, the agent struggled to get to grips with the game early on. Researchers found it was unable to grasp the controls and consistently missed the ball with the paddle. After a great deal of trial and error, the agent eventually figured out that if it angled the ball so that it became stuck between the highest layer and the top wall, it could break down the majority of the wall with only a small number of paddle hits. Not only that, it was able to understand that each time the ball travelled back to the paddle, the efficiency of the run dropped, and the length of the game increased.
The agent was basing its decisions on a policy network. Every action taken by the agent was recorded by the network, which also notes the result and what could be done differently to change that result. The result, also known as a state, can, therefore, be predicted by the agent.
Building a winning data strategy
Get serious about data and data scienceFree download
Problems with reinforcement learning
The example above is useful for understanding the fundamental principles of reinforcement learning, but gaming environments, no matter how large, only offer limited scope for learning and rarely offer anything meaningful beyond simple testing.
Success is not always easily translated into real-world use cases, particularly as it relies on a system of reward and failure states that are often ambiguous in reality. Tasking an agent with solving a particular challenge within tight parameters is one thing, but creating a realistic simulation that's applicable for everyday use is far harder.
If we take the example of an autonomous vehicle system, creating a simulation for it to learn from can be incredibly challenging. Not only does the simulation need to accurately represent a real-world road, and convey the various laws and restrictions that govern car use, but it also needs to take into account constant changes in traffic volume, the sudden actions of other human drivers (who may not be obeying the highway code themselves), and random obstacles.
There are also a variety of technical challenges that limit the potential of this type of learning. There are examples of systems 'forgetting' older actions, results and predictions when new knowledge is acquired. There have also been problems with agents successfully achieving a desired positive state, but doing so in an inefficient or undesired way. For example, in 2018 Deepsense.ai sought to teach an algorithm to run, but found that the agent developed a tendency to jump instead as it arrived at its future positive state far more quickly.
The future of machine learning?
These gaming environments, however interesting, are really only for testing purposes. Real-world applications require agents to learn far more complicated environments, and depending on how abstract or unknown the challenge is, RL might not be the easiest approach.
RL is best applied to specific, quantifiable goals - for example, in teaching self-driving cars how to park, change lanes, overtake other cars, and more.
The tech is also being used in factories, where robots can not only perform tasks more efficiently than humans, but without risk of injury. Google has used RL to control the cooling of its data centres without human intervention, which has resulted in the tech giant reducing its energy spending by 40%.
In trading and finance, an RL agent can be trained to decide whether to hold, buy, or sell stocks using market benchmark standards, removing the need for analysts to make every decision.
Other applications in the future include diagnosing medical conditions, smart prosthetic limbs, and fully automated factories. It’s not an easy technology to implement, but with time, it could be the driving force of future technology.
AI for customer service
IBM Watson Assistant solves customer problems the first timeView now
Solve cyber resilience challenges with storage solutions
Fundamental capabilities of cyber-resilient IT infrastructureFree Download
IBM FlashSystem 5000 and 5200 for mid-market enterprises
Manage rapid data growth within limited IT budgetsFree download
Leverage automated APM to accelerate CI/CD and boost application performance
Constant change to meet fast-evolving application functionalityFree Download