2 February 2021 | Tübingen

Sometimes too convenient: AI under the microscope

Scientists from Tübingen and Toronto call for stronger testing procedures for algorithms

A cow at a beach might throw off the algorithm © Adobe Stock Images

The achievements of artificial intelligence (AI) are usually regarded as success stories: AI translates texts with impressive accuracy, detects cancer in some cases better than doctors. However, it is often forgotten that AI also makes mistakes. An international team of scientists has compiled the various forms these take in a perspectives paper. In it, they focus on how AI learns and how things can go wrong, even though the algorithms perform well in standard testing procedures.

The interdisciplinary team of scientists from the University of Tübingen (Robert Geirhos, Claudio Michaelis, Wieland Brendel, Matthias Bethge, Felix Wichmann) and the University of Toronto, Canada (Jörn-Henrik Jacobsen, Richard Zemel), observed that many AI errors occur secretly, so to speak, and initially go unnoticed. For example, at first glance, an algorithm seems to be excellent at recognizing which animals are in a photo. Only on closer inspection does it turn out that the algorithm has found a much more convenient strategy and sometimes simply pays attention to the background: For example, an empty green hilly landscape is unceremoniously referred to by the AI as a “herd of grazing animals” – because animals often occur in front of a green background – while a cow in front of an unusual background, for example on the beach, is not recognized in contrast. What at first seems like a funny mistake can actually have serious consequences. This is because similar AI processes are also used, for example, in cars with modern driver assistance systems to recognize pedestrians and are being tested in medicine for the early detection of cancer.

This “shortcut learning” is not a phenomenon unique to AI systems, though. On the contrary, such shortcuts in learning are also observed in nature, including humans. For example: When students in history classes merely memorize dates without developing a deeper understanding of historical contexts, just to meet the specific requirements of a class assignment. Or an example from the animal world: In an experiment in which rats had to find their way through a maze with differently colored walls to reach their destination, the animals were quickly unerring. Amazingly, they found their way much better than the experimenters had expected. The ability to distinguish colors visually is not particularly pronounced in this type of rat. It was not until the test results were closely examined that it became clear that the animals were not using their eyes to find their way through the maze. Instead, they had learned the right route thanks to their highly developed sense of smell, because the different wall colors smelled differently.

Analogously, AI systems also often look for any route to the goal that is most suitable for the system at the time. This may not be tragic for pictures of cows in different landscapes. But this behavior becomes problematic when, for example, banks use algorithms to grant loans. In the U.S., the applicant’s zip code can be used to infer the applicant’s social environment – and thus also the likelihood of repayment, i.e., creditworthiness. The shortcut in this case is that a correlation gains decisive significance via the specification of the zip code. But the algorithm cannot determine whether there is a causal relationship between the zip code and the actual creditworthiness of the bank customer.

In their paper, a perspective Article published in the journal Nature Machine Intelligence, the researchers describe the common pattern behind many of these anecdotes reported by other researchers. They call for greater scrutiny of AI in the future. Good test scores of machine learning in mainstream testing procedures are not sufficient to guarantee good everyday usability. Specifically, the authors suggest developing and applying stronger test procedures in AI research. “The current standard test procedures are too weak, because this means that algorithms are often only tested on similar data,” says Robert Geirhos, a doctoral student at the University of Tübingen and lead author of the paper. “We therefore advocate that AI systems be tested with much heavier data sets, in which the unexpected is queried and thus transfer performance is demanded of them.”

After all, until one has investigated whether an algorithm can easily handle unexpected images, such as a cow on the beach, you have to at least consider that the algorithm took a shortcut on the way to the answer. “Fortunately, however, ‘taking a shortcut’ is not the only option for an algorithm, only the most convenient one,” Geirhos points out. “If you challenge AI appropriately, it can certainly learn highly complex relationships.”