< Back

What Do Chihuahuas and Muffins Have to Do with AI?

This article by Gary M. Shiffman is published as a series with accompanying video shorts. View them here.

I used a popular internet meme of “chihuahuas and muffins” a few years ago in a series of lectures to explain Artificial Intelligence and Machine Learning (AI/ML). This meme-as-teaching aide has become widely referenced. I am revising and posting the content here to serve as an easy reference.

[Karen Zach, March 9, 2016. https://twitter.com/teenybiscuit/status/707727863571582978]

My initial motivation was to empower regulators and those in regulated industries like financial services to understand AI/ML, so that the benefits of innovation could improve performance in the public safety missions they perform. But this content can benefit anyone in any technology-lagging sector of the economy.

Small groups of creatives, coders, and developers invent amazing technologies all the time. But the world only changes when ordinary people – not just the math, design, and coding people – trust, adopt, and use these technologies. One of the first steps to getting people to trust and adopt a new technology is to teach them how to measure its performance.

When you type the word “chihuahua” into an internet search bar, you might take for granted the countless images of chihuahuas. It seems easy for the search algorithm, and it is easy for you to evaluate the results. You see all chihuahuas and almost no “not-chihuahuas”. But you may also see a bias in the results (all the chihuahuas are dark-haired, or light-haired, or long-haired, etc.).

Replace “chihuahua” with “human trafficker” or “money launderer,” and imagine the importance of innovation in industries which fight crime and exploitation, such as financial crimes compliance.

But measuring algorithm performance in complex areas of human behavior is not as easy as looking at images of chihuahuas. This is why we have methods for measuring accuracy and bias:

Accuracy is a simple numbers game; no complex math required. Accuracy allows us to talk about the combined efficiency and effectiveness of a tool.

Bias is what most people refer to when they talk about the dangers of moving into a fully automated world – for example, when Amazon designed a recruiting algorithm that preferred men over women, or when a criminal justice program incorrectly identified Black defendants as higher risk for recidivism and incorrectly identified white defendants as lower risk.

These are examples of machine bias. Human bias occurs constantly, and one bias often perpetuates the other. As a human, when making a decision without the certainty of facts, you default to bias. When unsure of the accuracy or biases of a new AI/ML system, you make decisions about which technologies to deploy based on what your community of peers is doing.

In this series, I will try to light the way toward the facts. I will open up the black box and explain what machine learning really means; the role of training anHud testing data; the role of the human in establishing thresholds to calculate accuracy; and what blueberry muffins have to do with all of this.

The Data is the Algorithm

Look again at the chihuahua-muffin images from the previous section; can you identify the chihuahuas? Easy enough. But what if you had to identify all the chihuahuas out of an array of 200,000 or 20 million images? Assume for a moment that your employer has an important reason for this task. For the purposes of this thought experiment, assume “chihuahua” represents a searched-for behavior, such as human trafficking, risky correspondence, or sanctions violations.

As a human, this search across entire customer populations would be possible but not feasible. Nobody would want this job, and error rates would be high. In the real world of financial crimes compliance, banks have historically avoided these searches across entire populations and relied upon rules-based alerts instead. But enter Machine Learning (ML), and the overwhelming, or seemingly impossible, task becomes feasible.

Machine Learning is learning by example (“inductive”). People building ML algorithms need examples. Want to identify chihuahua images? Feed an algorithm many examples of chihuahuas. More specifically, task humans with identifying a large population of chihuahuas, label these as such, and then identify a large sample of images of “not-chihuahua”, and label them as such. That’s it. The data is the algorithm. If your sample size is large and properly labeled, you’ve got a good algorithm. The very best algorithms are those trained on the “most best data” – the training data.

Why muffins? Because muffins challenge the algorithm. Muffins represent the “not-chihuahuas”, and the food-dog meme entertains because of the similarities. The training data of chihuahua images creates the chihuahua discovery tool – an algorithm able to distinguish chihuahua images from across the internet index.

What could go wrong? Not enough training data or bad training data. The algorithm is the data, so if you only have 10 chihuahua images, your algorithm will likely miss most target images across a large population of possible chihuahuas. In a large training sample, if another dog breed such as “pug” is improperly labeled as “chihuahua”, then any algorithm trained on this large set of pug and chihuahua images will learn the error. An algorithm has no consciousness, like a child might have; teach the computer that “chihuahua” = pug or chihuahua, and the algorithm will work and identify both. In this instance, the algorithm has picked up a “pug” bias.

Consider the consequences of these training data errors when working in high-consequence fields. A chihuahua mistaken for a muffin means a lot more when a crime-fighting team uses an algorithm which identifies a “not-drug-trafficker” as a likely “drug trafficker.”

If you work in the financial crimes and compliance world, then to identify a human trafficker or an elder fraud scammer, you need to build an algorithm using the “most best” training data – properly labeled examples of known criminals. The more known criminal data available for training, the better the algorithm, because the data is the algorithm.

What does “better” mean, and can this be measured? In the next section, I will discuss the importance of testing data in building and evaluating AI/ML algorithms.

Testing Data and Drawing the Threshold

Previously, I introduced Machine Learning (ML), training data, and the source of accuracy and bias, and I made assertions about building “better” algorithms. Now, let’s unpack “better” and how to measure algorithmic performance.

Remember, in this series, “chihuahua” can stand in for anything you seek to discover. You created a large sample of properly labeled data, the training data, and fed that to an algorithm, creating a chihuahua algorithm.

The output of any ML algorithm is a distribution. Along the x- or horizontal axis, you have a measure of chihuahua-ness, sometimes referred to as the algorithm’s confidence in “predicting” the entity to be chihuahua. Along the y- or vertical axis, you have the count of entities.

Once the algorithm creates the distribution, the human must perform the single most important task: draw the threshold. In my decade plus of working with ML systems, this is perhaps the most misrepresented aspect of the art of deploying AI/ML technologies into high-consequence operational environments.

Machines have no conscious awareness of right and wrong; humans must do this. How many images must be treated as “alerts” and sent for human review? A data scientist might say that the algorithm “predicted” what entity is of interest to the operator. But the prediction requires a threshold. And a threshold depends upon particular risk profiles and risk preferences. The machine only creates the distribution using training data provided by the humans. The human makes the next move of drawing a threshold.

In a small population, a person can easily identify the chihuahuas from the not-chihuahuas. But finding the sought-after pattern across the large data set, the ML goes back to work by using the labeled data and creating test data.

In the image here of 500 entities in a post-algorithm distribution, the test data, only the labeled chihuahua images appear in color for the purpose of this article; the computer can “see” the label. The human-drawn threshold tells the system to treat scores of eight and above as-if chihuahua, and seven and below as-if not-chihuahua. Now, we can measure performance.

First, we count True Positives, False Positives, True Negatives, and False Negatives.

Above (right of) the threshold = Predicted Positive

Chihuahu as above the threshold = True Positive

Not-chihuahuas above the threshold = False Positive

Below (left of) the threshold = Predicted Negative

Chihuahuas below the threshold = False Negatives

Not-Chihuahuas below the threshold = True Negative

Looking at the image, every chihuahua above the threshold that is actually positive is called a true positive – the algorithm-human team got it right. Everything above the threshold that isn’t a chihuahua is a false positive.

Similarly, “not-chihuahuas” below the threshold are true negatives – a win for team algorithm-human. All chihuahuas below the threshold are false negatives – human traffickers and money launderers that evaded us, again. Counting and some simple math gets us to the measurement of accuracy: effectiveness and efficiency.

ARTICLE 4: How Do You Know If It Is Working? Measuring the Accuracy of AI

Accuracy is a measurement of both effectiveness and efficiency. Effectiveness measures performance of the task – for example, finding drug traffickers. Efficiency measures the amount of work needed for a level of performance – how hard one works to find the next trafficker.

Regulators usually demand effectiveness, and workers and corporate leadership usually seek ever-increasing efficiency. People on the front lines usually talk in terms of the number of “false positives” they must deal with each day.

Think of “efficiency” as the number of accurate predictions above the threshold as a percentage of all entities above the threshold. Of everything predicted chihuahua, what percent were actually chihuahuas (true positives)?

Efficiency = True Positives

––––––––––––

All Positives (True Positives + False Positives)

In an average bank compliance department, one might expect to see efficiency of about 5%. This means that in the financial crimes space, about 5% of the cases predicted positive are true positives, and 95% are false positives: five chihuahuas found, or every 100 cases reviewed. This is a difficult job that ML will improve, for sure.

Think of “effectiveness” as the number of accurate predictions as a percentage of all possible cases of interest. Of all of the chihuahuas, how many did the algorithm move to the right of the threshold? 20 chihuahuas found out of 25 is 80% effective.

Effectiveness = True Positives

––––––––––––

True Positives + False Negatives

Two factors impact efficiency and effectiveness, one controlled by the operators and the other by the model trainers:

First, where to draw the threshold for a given algorithm, set by the operators, can increase effectiveness by decreasing efficiency, or increase efficiency but decrease effectiveness. This decision must be human and based upon risk appetites and profiles of the institution’s leadership.

Second, improving the algorithm will move more chihuahuas to the right and more not-chihuahuas to the left. More better training data will help the modelers do this. With better separation like this, both efficiency and effectiveness improve, making work life better for the operator and buying down risk at a lower cost for the institution.

Efficiency and effectiveness can be computed with some counting, addition, and division; it’s not a difficult topic. Anyone can speak with confidence on AI/ML performance, even without knowing the details of deep learning and neural networks.

But one challenge remains to be understood: how to think about bias.

ARTICLE 5: Bias Isn’t a “Given” in AI

Thinking about measurement of AI/ML in terms of chihuahuas and muffins, it turns out, is pretty easy and easy to remember. Efficiency and effectiveness described in numbers enable comparison. How does the challenger system compare to the system in use? One might say the existing system identifies 10 out of 20 criminals for every 100 reviewed, or 50% effectiveness and 10% efficiency.

Existing System

05/20 = 25% effective

10/100 = 10% efficient

AI/ML Challenger System

10/20 = 50% effective

18/50 = 36% efficient

Based upon accuracy alone, go with the challenger. Easy decision. But how do we know if the new AI/ML-based system reflects some bias that we cannot accept? Using the same analogy, what if all the chihuahuas found were dark-haired? Or old?

With all of the media attention given to bias in AI, a lot of people assume bias is a “given”. It’s one of the most frequently cited arguments by those who fear AI technology. But bias doesn’t have to be a standard part of all AI/ML output; it all goes back to the data.

How do you check for bias? Here are some questions to start the dialogue: What are the age, gender, and color of the chihuahuas found?

If the algorithm found 50 chihuahuas and missed 50 out of 100 possible chihuahuas, and the 50 it found were all dark-haired chihuahuas, then the output has a bias. This again seems like a role for humans – reviewing the output.

Because the algorithm is the data, you will find that your training data were imbalanced with an over-representation of dark-haired chihuahuas. In the AI/ML world, the bias is not a directly human bias coming through the process, but a bias in the training data. However, it’s important to note that the training data may have been generated by biased human processes in the past. Biased algorithms come from the training data. Biased training data comes from biased humans.

The reason that Amazon’s doomed hiring algorithm was biased toward hiring men was because most of Amazon’s employees were men when the algorithm was built. The reason a criminal justice program incorrectly identified Black defendants as higher risk for recidivism and incorrectly identified white defendants as lower risk was not because the algorithm miscomputed. It was because the algorithm properly learned on training reflecting biases in the underlying criminal justice system.

Both of these examples show how good intentions and good math but bad training data can produce inappropriately biased, but accurate, algorithms. The lesson is that measuring accuracy alone is not enough.

It is important to note that incumbent systems have plenty of biases, so choosing to not deploy innovation likely reduces accuracy but does not reduce bias. Eliminating bias in training data eliminate bias while capturing the accuracy advantages of AI/ML. But this requires human involvement.

Measuring accuracy is easy: use Test Data, draw a threshold, and using True Positives, True Negatives, False Positives, and False Negatives, calculate Efficiency and Effectiveness. Recognizing bias and making corrections, however, requires humans.

Create an interdisciplinary group from across your organization, and have a discussion. Ask people with different perspectives if results appear biased. If someone identifies an unexpected number of older chihuahuas or long hair, etc., then you’ve identified the bias. The cause will lie in the training data. The standard for innovation should not be perfection, but improvement. AI/ML systems can empower front line workers to discover threats such as human and drug trafficking, without violating the security, liberty, and privacy of others. It’s as easy as chihuahuas and muffins.

Gary M. Shiffman, PhD, is an economist working to solve problems related to human violence. A Gulf War veteran and former Senate National Security Advisor, Chief of Staff at US Customs and Border Protection, DARPA Principal Investigator, and Georgetown University professor, he founded two technology companies, Giant Oak, Inc, and Consilient, Inc. He is the author of The Economics of Violence (2020), and his essays have appeared in media outlets such as The Hill, the Wall Street Journal, USA Today, TechCrunch, and others.