Sendhil Mullainathan, "Discrimination by Algorithm and People"

Presented at the Legal Challenges of the Data Economy conference, March 22, 2019.


SENDHIL MULLAINATHAN: Thank you. It's a real delight to be here, just even walking into the building at such a beautiful place. Thank you for having me. I just hope no one's expecting me to play the piano. That was the only intimidating thing.

Let me start with the problem that I think is one of the big issues. It was in Omri's talk is discrimination. I think that if you look at the literature right now, the popular conversation around algorithms, I think it's not an understatement to say, even bigger than privacy, the biggest issue that seems to dominate the public discussion is issues of machine bias.

This was a ProPublica article that I think had a very large effect. And it wasn't just this article. Many of you may have seen Kathy O'Neill's book the White House put out in 2016. You'll notice big data. The big thing at the end of it is, "civil rights." It is a striking discussion that we have currently all around algorithms and discrimination.

In fact, recently-- I don't know how many of you saw this-- but there was a news report that got a little bit of press around how Amazon had implemented an algorithm for hiring, and they had to get rid of it because the algorithm turned out to be sexist, was the claim. So they actually stopped the entire thing.

What I want to talk today about is I want to talk about this issue of discrimination, and I want to talk about it in one category of application. I think one of the most difficult thing in working in this area is that the word "algorithm" is used for just a wide variety of activities that algorithms do that have nothing to do with each other, really.

So I want to talk about one category that's pretty big but is also pretty specific. And this is a category that we sort of have called "prediction policy problems." Let me walk you through what these are. These are problems where there are some decisions to be made, and that decision has the structure that we're going to predict some outcome using some inputs.

And the prediction of that outcome is what dictates the decision. And we're interested in cases where there's large amounts of data, and this is, I think, the ideal-use case for machine learning and for AI. I'll give you some examples. Hiring fits this.

The Amazon was a machine-learning case. Because, after all, the decision of who to hire involves looking at some inputs, their résumé, and forecasting some output, how will they do on the job, and you've got lots of résumés.

Pretrial release, which I'll talk about in a second, satisfies this condition. I look at the defendant, I have to make a decision, what is their flight risk? What is their public safety risk? And those predictions are what dictate my decision, release or not release.

Education decisions often satisfy this condition. So as an example, who do we admit to a college? Same problem? There's a sheer, large array-- once you realize this category or problem, you realize so many things have this credit decisions. We can just keep going. That's what I'm going to focus on today is that category.

And I want to flag what this does not include. There are a huge set of things you might encounter that this is not about. It's not exhaustive. An example is representative harm. Let me give you this example in the case of discrimination.

If you went on Google and you did a search for men on Google Images, you can do this-- they might have fixed it in the last six months, but this, as recently as 2018, was true. You can search for women. I don't know if anyone notices something odd about all the images here. It's not a representative feature of the United States, much less the world.

Here's another one, which is more striking. If you search for "unprofessional hairstyles" on Google Images, and if you search for professional hairstyles, again, you might notice a interesting pattern. These are representational harms. Algorithms do many things in collecting data together and replaying them to us. This is not a prediction problem. So I'm not going to talk about these. But in this category of representational harms is where a lot of things start to happen. We're not gonna talk about that. We'll focus on the prediction policy case.

In that case, here's what I'd like to do. I want to just give you a sense of where I'm coming from and how I'm struggling with this issue that's underneath everything I just showed you. Are algorithms a force for bad? That's what I'm struggling with a little bit. And "bad" in this very specific and socially important sense of exaggerating or contributing to the inequalities that we see in the world.

So I'm going to do that is just by giving you a sense of where I come from. I've worked on two projects in my life that have given me attention on this issue, so I want to give you the same tension-- that over the last two, three years, I'd say a team of us have been struggling with these issues and three insights have arisen, and I'll try and convey those three insights. They're not the insights that I think you get-- at least we didn't get them when we started this project, and I still think they're underrepresented in discourse. So let me start with where I'm coming from.

The first place where I'm coming from takes me back-- I don't want to list the number because it's a long number-- many years, let's just say-- to one of the earliest projects I worked on was this project on discrimination, where-- here we go-- what we did was, we took résumés such as this, and we sent out thousands of these résumés to employers. But before we did that, for half of them, we put a name on like "Brendan Miller."

And for the other half, we put a name on like "Jamal Jones." For those who are not familiar, "Jamal" is very clearly and distinctively black name in the United States. Brendan is a distinctively and clear white name. And we had many such names.

What we were interested, is there discrimination in the United States workplace? Because these résumés are the same. All we've changed is the name. And we measured how many callbacks for interviews people got. So the résumés and the left, if Brendan gets more callbacks than Jamal, what could be the cause other than their name? After all, it was truly just randomized.

Somewhat disturbingly, what we found in this paper was there was a pretty big gap. White names elicited about 8.5% callback. Black names elicited about 6.19% callback. The gap gets much bigger if you look résumés that have higher skill.

It's about 10.7% versus 6.7%. Just to put this in perspective, it means whites get called back 50% more, armed with the same résumé. So if you just translated that into unemployment lengths, you would see blacks have about a third to 50% difference in unemployment.

This is an old paper by now. There have been more recent work done in the same vein, quite a bit of work, like extremely interesting work. And I would encourage you all to read up on it. These audit studies have become an amazing technology for auditing discrimination. Here's one. These researchers simply went onto eBay and sold a bunch of iPods, OK? So that's the first condition. They just posted those photos.

In another condition, they just changed one small thing-- the hand of the person holding the iPods from a white hand to a black handle. In a third condition that is somewhat obscure, but, for whatever reason, they had tattoos. I've never understood this condition, but I sort of get it. And all they measured off of these thousands of iPod sales was what was the rate at which these iPods would sell.

And what you found was about 3.8% and 3%. Don't get a tattoo. That's the first lesson. Second tattoo is, don't be black in the United States, much less. And there are many such studies like this. It is disturbing, and it is a litany of disturbing facts. So that's one hat I wear. Let me tell you what I took away from that. Human bias is rampant.

So let me try another study we did much more recently. This is on pretrial. I learned a ton about the law in this thing, partly because I knew very little to begin with. But it's easy to learn when you start at zero. One of the things I learned was that, in the United States, there are about 12 million arrests, which is, for me, a mind-boggling number-- every year, 12 million.

The other thing I learned was that one of the most important decisions that happens happens within 24 hours of arrest. The person who's arrested, a judge has to decide, will they be sent home while they wait for trial, or will they have to wait in jail? In the US judicial system, it is a very consequential decision, whether you're sent home or wait in jail, and here's why.

About 3/4 of a million people are in jail at any point in time. That's about a third of the incarcerated population in the United States, which means a third of people who are incarcerated are simply waiting for trial. They're not people who've been found guilty for anything. They're just waiting. And they're going to wait a pretty long time. Most jurisdictions, the average wait is two to three months. In some jurisdictions, the average wait is 9 to 12 months, which is why many crimes involve, even a guilty finding is simply time served.

And I just want you to look at this number. It's very easy to brush past this two to three months. Remember, you're put in jail not because of the guilt of the crime. The law says you are put in jail to wait because you pose a flight risk or a public safety risk.

Now imagine having to call your employer and saying, uh, I won't be back. I won't be back for two to three months but hold my job. Even a week of jail is pretty much enough to lose your job. Think of the consequences of this decision if a person is jailed.

Conversely, think of the consequences of this decision if we have somebody in custody whom we release, and they go on to commit a crime, a murder or rape, when we had them right there. Both sides of the equation are loaded, and it's a complex decision, which is why we have judges taking their time to make this decision.

So let me tell you what we did. It is also a case that this decision is perfectly made for machine learning algorithms. Because the judge is asked to do one task, not to adjudicate guilt or innocence, not to make decisions of justice and what would be fair. Instead, it is by law to look at the person and assess their flight risk in some states; and in other states, their flight risk and their public safety risk. It's a pure prediction problem by construction and by law.

It is also a prediction problem for which we've got, thanks to the United States tendency to incarcerate a lot of people, a lot of data. We've got millions of observations of what happens to people when we release them.

So we trained an algorithm to form a prediction of who poses the biggest risk. And then using that algorithm, we simulated what would the algorithm do on any one person. The algorithm, by itself, is just a predictor. Society has to make the judgment of where to draw the threshold. So here's what I've done in this graph.

I've taken everybody in the data set. I've rank-ordered them by predicted risk-- let's say public safety risk for a minute. And on the left is the outcome if you release nobody. That is, of course, obviously, if I release nobody, I get no crime.

As I go to the right, I'm releasing more and more people starting with the lowest risk people and walking my way to the right, which is why the crime rate keeps going up. So this is the algorithm's performance curve. Where we choose to be on this curve is a moral choice about how we trade crime and release rates. The algorithm gives us the curve.

What's important, though, is that we're already on a particular point. We have a status quo. The status quo is that we release 74% people-- this is a data from New York-- 74% of people with a crime rate of 11.3%. So what would happen if the algorithm also released 74% of people. We can read off the curve and realize it would produce far less crime. But put differently, we could keep jail populations the same and reduce crime rates by about a quarter.

To me, the more interesting case is, well, you were willing to put up with 11.3% crime in the current status quo. So why don't we go from there all the way to the right and release people until we get to 11.3%? In that case, we can get the same crime rate but have about 40% fewer people in jail, which is an astonishingly large number.

My co-author in this, Jens Ludwig, he always points out that when you arrive in Laguardia Airport in New York-- many people don't realize this, but when you arrive at Laguardia and you drive into Manhattan, if you ever do that, you should look out to the right-- you will see Rikers, a massive facility, which is our entire population.

Jens points out that if we took this point in the curve, that is equivalent to closing Rikers in July. It's unfathomable how large the difference is. So there you have it. I want to just point out, there's some crucial issues, in the interest of time, I'm not going to be able to get to. There are major statistical issues, which is what we have to solve. For example, we don't know what the jail would have done had they been released. And we have to address that.

These are not just petty crimes on the y-axis. If we focus on murder, rape, and robbery, the effects are bigger, which makes sense. It is harder for humans to make forecasts of rare events. And I also want to point out there are many related problems that are terrible use cases for algorithms. Parole decisions look like this problem; but what's key is that in this problem, only flight risk and public safety risks matter by law.

For parole, recidivism is one tiny input into a much bigger calculus that society must take. And that's going to be important in everything else I show you. For machine learning algorithms to work, our objectives need to be funneled down to one variable that we think is measured and captures nearly the totality of our objectives. The biggest problems in this field come in this last-- the failure mode is the last fact not being true.

OK, having said that, I came into this work thinking human bias is rampant in such decisions, and I walk away thinking, well, the human mind is not perfect at large statistical activities and not great at prediction. So the second paper makes me feel like algorithms can help a lot.

But this is where the tension begins. The data the algorithm uses to train itself is generated by human beings. So if you look at, for example, the inputs, it's prior arrests, yet we know arrests have human bias built into them. So a danger is when algorithms inherit this bias.

So the questions I think I want to talk about in the rest of this talk are, should we fear, given the last point, or should we promote the use of algorithms? Should we design them differently if we're worried about inequities? And who's the "we"? Private actors are going to go about using these algorithms, so how are we to regulate the use of these algorithms vis-á-vis these inequities?

And that's the tension that I sort of find myself in. And that's what I want to explore in the rest of this talk, so let me just give you, in this exploration, three big insights that have come up. And to leave room for questions, I'll try to be fairly quick.

So the first insight comes on the original paper. If this is like closing Rikers in July, how should we think about the inequities created by this algorithm? The first observation I want to make is so banal that somehow people fail to make it. If I were to close Rikers in July, who's going to benefit the most? The populations that are disproportionately in Rikers.

We spend so much of our time thinking about the relative inequities of algorithms, in this case, the first thing most people want to ask is, oh, at the margin, what is the racial disparity the algorithm has? But I think the first question we need to ask is, actually, who's in Rikers? When in fact, blacks are disproportionately there.

And if you include Hispanics, 85% of the population of Rikers come from minority populations. So if the first order of effect of these algorithms is to reduce jail populations, that first order absolute effect benefits minorities.

I say this because there's a disparate benefit of applying these algorithms in some cases, and we forget the scale impact of these activities. And often, the first order of effect of these things is the scale impact. And this effect shows up in many cases. In credit scoring, we think about the relative impact, but credit scoring has dramatically increased access to credit. And the people who didn't have access to credit actually were disenfranchised populations.

So I'm not saying it's all good. I'm simply just saying this is an effect that we have to consider. We have to consider the absolute impact before we think about the relative impact. And in this case, the absolute benefits are pretty large because the effect is to reduce jail populations.

What about the relative impact? By "relative impact," what I mean is, let's look at any given jailing population. Is the algorithm disproportionately jailing minorities? I told you how the algorithm did on efficiency, the crime rates it produced. But how does it do on equity? Is it more inequitable than the judge? Less than inequitable than the judge? So let's look at that. Sorry if these numbers are not readable. I'll walk you through them.

So the first row I have is the judge, holding constant the release rate. The second row is the algorithm, holding constant the judges release rate. And what you see is that the judge detains 31% of the people in front of them. The algorithm detains a hair more at about 32.38%. So it's a little bit more, but if you're in that 1.3%, a small consolation to that is a small number. It is still a number.

So that suggests, oh, wow, these algorithms are-- this algorithm, in particular, does discriminate a bit more than the judge. But to this, I want to point out something. Algorithms are inherently literal objects. I think there's a logical error people make when they take these algorithms and then they say, oh, they've discriminated.

Algorithms only optimize the objectives they were given. We said, predict crime rate. We didn't say, predict crime rate and be equitable about it. So we can't build an algorithm to do one thing and judge it on another activity. If we care about equity, which we ought to, why not incorporate that objective into the building of the algorithm? So that's what we did.

Suppose we said, I want equal release rate for all races. And notice, that's not what the judge does. The judge detains a lot more blacks than they do whites. But if I enforce equal release rate across things, that's a great deal more equity than we had before.

What you'll notice is, if I do that, that algorithm produces about the same gain. So we can get effectively the same gains that I showed you before, but also huge equity gains, moving from an equilibrium when African-Americans are much more likely to be detained to an equilibrium in which we have equal detention rates across all races.

This second fact is something that I want to highlight, which is that we must explicitly incorporate the objectives we have into the construction of the algorithm. We can have a great deal of positive relative impacts.

In fact, this theme is something we see again and again. We can basically have both efficiency gains and equity gains. And that ought not to surprise anybody. To the extent that the current system is discriminatory, algorithms, in looking for efficiency, it will not cost them much to generate equity because there are a lot of people of minority groups that are being jailed for no particular good reason. So here, we get big gains in both.

So lesson one that I take away is that algorithms can reduce both absolute and relative harm, but it will not happen on its own. If you don't state and build the objective, algorithms can do a lot of relative harm. OK. So we can't simply take equity as a collateral objective, something that we get and look at after the fact. It must be built in. To the extent that we think about regulating it-- I'll end on this-- this must be a regulated objective, if that's what we want.

OK, let me move to the second lesson. This is a paper we just finished. This is on a topic that many of you may be less familiar with, so I'll try and walk you through it. These are called "care coordination programs." It's going to seem a little obscure why I'm talking about health care, but you'll see in a second why it's actually a pretty big deal.

So here's the issue. If you look at health care in the United States, but really everywhere in the world, the people who cost the system a lot are the people who have many chronic conditions. So when you've got diabetes and heart disease, you're going to go through the hospital system a lot, so much so that, in the United States, and this is spreading to other countries, people have introduced these care coordination programs, which are programs that target these people with extra resources.

For example, before going to the hospital, there's a number you can call that's just for you. And there's a person at the other end who knows you and knows your case, and you can tell them what your problem is, and they'll say, you better come in now, or they'll say, no, don't worry about that. Just up your medication.

Or as another example, when you show up at that hospital, there's actually a separate place you go to to manage you. So it's kind of like quite a good thing to get on this program. Now why is this relevant for us? Well, this is a lot of resources. You can't give this resource to everybody. What you want to do is you want to target the patients who are at highest risk. So you want to find the complex patients ahead of time.

So in fact, this is one of those algorithms that is hidden at scale. There's several vendors of these algorithms, but they're already being applied to over 120 million people in the United States, where health records are uploaded and these algorithms make predictions about who is at risk of being the most complex cases.

So what we did was we got access to one of these algorithms. It's very rare that you're able to study an algorithm from the inside. Typically, like with the Google Images, the men and women, we have to audit them from the outside. But here, we're able to study the algorithm from the inside for a variety of reasons. And so we've got access to this live-scaled private sector algorithm, where now, what I want to do is to not just look at, is it biased, but look inside as to what generates any problems we find.

And the way I'm going to look at that is I want to look at the consequences. I want to look at the algorithm's predictions. What does it do? It takes patients and ranks them, predicts how risky they are. And I want to look at the consequences of that risk score for who gets put in these programs. So what I'm going to do is I'm going to ask what kinds of white and blacks would be chosen for the program and how that relates to their health.

So since the program is allocated by a risk score that the algorithm produces, what I'm going to look at is I'm going to look at the expected risk, the expected health of people at different levels of the risk score for whites and blacks. So let's start with that.

So on the x-axis is the algorithm's risk score. On the y-axis, is a measure of health in the following year. And I've done it for two populations, blacks and whites. And what you see here is something we found for every measure of health that we've looked at, or nearly every measure, that at every level of risk score, blacks are much sicker than whites.

And that's very problematic because what do we do? We draw a threshold and admit people based on that threshold. So if we zoom in on this, what we'd see is, at this auto-enrollment threshold, blacks have about 28% more chronic illnesses than whites.

Now suppose you said, no, no, no. I want to admit so that people who are admitted into the program have equal health risk. Then what you want to do is, let's simulate replacing "healthier whites" with "less healthy blacks." So I want to zoom into this part here. This is what the curve looks like.

So I want to add the sicker blacks and substitute out the far less sick whites that are included. If I engage in that sort of arbitrage-- so kind of get to the same level of health, basically, the fraction auto enrolled blacks would go from 17% to about 36%. There's a huge racial disparity in this program. And this is an algorithm that's being implemented-- and not just this algorithm, but this category of algorithm is implemented, as I said, for hundreds of million people.

So how did we get here? How did this algorithm produce such a big inequity? So where's the algorithm going wrong? One way to understand it is to figure out where the algorithm is going right. Here's the same figure, but on the y-axis is not health but is costs.

And now, you'll notice no disparity. What you see is that the algorithm is very well calibrated between races for dollars. In other words, the problem had to do with the variable that we gave the algorithm to optimize. It was asked to optimize costs.

And the problem is that black and whites have different costs at the same level of health. Whites have better access to health care. So for the same sickness level, they actually end up costing more in dollars. So we have this perverse outcome where a very particular choice-- it wasn't the innards of the algorithm. It wasn't what variables it used. It was actually the objective that it was given.

And I hope you see a theme that's emerging. The objective that it was given, or the label-- so in the machine learning literature, there's sort of a label, which is the thing the algorithm is predicting. The algorithm just optimized for the label it was given. It was told, here's costs. Find the sickest people. Sickness equals cost. And it did that. But that's not our full objective.

So the question I want to ask is, why was cost chosen and not health? That's the source of the problem. Why did that happen? And I think there are two theories here. As with all things in economics, one theory is follow the money. So, hey, hospitals and insurers care about dollars. As a society, we care about health. But if you've ever followed health care, all of health care is about the difference between health and costs. So if there are incentive problems everywhere else in health care and their externalities, maybe the externalities in the system are algorithmically propagating. So that's one view.

Another view, which is somehow often underappreciated as a potential explanation, misunderstanding. We have this new tool. Algorithms making predictions. Everybody is new to this game. And, you know, mistakes were made.

And here's one way to think about why this might be true. Even in policy conversations, not amongst health care providers, but health care policy researchers, you can find papers that talk about health, and you look in the actual tables, and they're talking about dollars spent. Conversations fluidly move between these two categories.

And here, in the second view, cost labels was chosen through a combination of convenience, it's easy to get, and not understanding the big consequences it would have. And the differences between these views matters because a larger policy objective and how we think about how algorithms will have an impact depends on how you think about these two views.

For me, this is really important because so much of what we do in the algorithms literature, even the bail paper I told you earlier, it's all guesswork about how things will scale. This is not guesswork. This is a thing that's already scaled. So if we want understand the problems that are going to happen at scale, I think we have to understand problems like this and study them.

I think this is a little bit the canary in the coal mine. So how do we differentiate these views? Now I have some technical empirical work that we did that started to give us an answer. But near the end of this project, we had kind of an insight, which we decided to-- no, "insight" is a strong word. I'd say, a lark.

We said, why not try something. It's far less technical, and that's what I'm going to focus on now. We said, hey, why don't we just contact the manufacturer of the algorithm? That'd be revealing. Let's just tell him, look what we found. And you can imagine they're-- you know-- you can imagine what's going to happen next. No, you made this mistake. And, this is why it's not right.

But we figured, we don't have anything to lose. They can't stop us from publishing this paper. Well, let's just try. And you know what we found? They were super responsive. They went, they got their own data, they worked with us to understand the problem, but they've got 10x the data that we do, so they first revalidated and said, oh, yeah, this is a problem. And then they said, well, how are we going to fix it?

And so, then, we spent a little bit of time on a 3.7 million data set nationwide, in which we replicated it. And lo and behold, that is a problem not just in our 200,000 person data set, but isn't it great that you can say things like, just our 200,000 person data set? But anyway, replicated it nationwide on this 10% of there's, 3.7 million.

And then, they said, great. Let's develop this method to correct. And then, they found that they can actually-- at scale, they can go from 48,000 excess chronic conditions in blacks versus whites to an adjusted algorithm that they're going to start rolling out, which is about an 84% reduction in that gap, which, to me, is very optimistic and suggests that the misunderstanding view is, at least in this case, a big part of the story.

And I want to pause on that because the misunderstanding view is the kind of view that doesn't get enough credence in the world. We imagine actors are optimizing everything and everything is intentional. The reality is, we're all just learning how to use these new tools. And there's going to be a period where there are going to be errors such as this. And that has implications, implications such as, when we see a problem, let's try to fix it. Let's not scale too fast. Let's try things, investigate, probe, and continue to do things.

OK, but the other lesson I would like to take away from this is actually the original source of the problem is something we should focus a lot on-- the poor choice of labels. That's because, as human beings, we talk in concepts. We say, I have an algorithm that predicts health. Health is not a variable. Algorithms can only predict the literal variables they are given.

So when someone's talking to you about an algorithm, they haven't told you the exact variable in the data set, they actually haven't told you anything. Because algorithms are very literal. And in that gap between the literal variable and the variable you think it's predicting lies a ton of problems.

Go back to this is Amazon example. What was this algorithm predicting? There's no such thing as a hiring algorithm. There's an algorithm that predicted something. What do you think this algorithm was predicting? Now I don't work at Amazon. There are only news reports. But do you know what it was predicting? It was a résumé screening algorithm that, given a résumé, it was predicting what would humans do with this résumé? Is this the kind of résumé that led to an interview?

Notice that's very different than predicting productivity. I hope you can all see what happens if you predict what would a human do with this résumé. That's an algorithm built to literally repeat human biases. Imagine if we had built an algorithm that said, I want to predict what the judge would do. That's different than saying, I want to predict round truth. And this problem of label bias comes in many ways, but I think this is the biggest example.

Now I really do not like the American sport of baseball, but there is a metaphor in the American sport of baseball. If there was a football analogy here, I would have made it. But the analogy is, when players go to swing at the ball, the ball has to come into a small square, rectangle, on the left that is considered a strike if it goes into that region.

And there's a human who looks and guesses at this 90-mile-per-hour, 100-mile-per-hour object, whether it's in that square. Imagine if I was going to automate that activity. What I would want to automate is on the left, come up with an algorithm that produces that rectangle.

Imagine instead I automated it by taking all of the calls that umpires have made as to what was a ball and what was a strike. Well, that's what the heat map on the right is. Umpires are themselves insanely biased. If we simply automate human judgment, we would repeat all the biases of whatever the human had, which is what the Amazon algorithm was doing.

So poor choice of labels, I'll give you one more example of this. If you were building a self-driving car that's trying to decide when to brake, we can all agree that it would be a terrible idea to do it based on when a human would break. But think of how many algorithms are that variety. The label, the exact label, matters a tremendous amount.

When you read that Amazon article and when I read it, didn't immediately say, wait, what exactly was this thing predicting? It was called, "a hiring algorithm." What the hell does that mean? It doesn't mean anything. OK, lesson two, the misspecification of labels can be huge.

The last one is very small, so let me just do this one quickly. Here's an observation. I've shown you two audit studies-- the résumé audit, where we sent out a bunch of résumés and we saw what people did, and the care coordination audit, where I audited the algorithm for bias. In a way, I've audited a human, and I've audited an algorithm.

Let me point out something. Just look how different these two were. In one case, what did I do? I sent out tons of résumés, and I got an indirect statistical guess as to whether there was discrimination in the world at large. In the other case, I got the algorithm. I could do everything I wanted with it. It was much easier to audit the algorithm. And that is a point that I want to conclude on.

I'll say that human-decision processes are pretty opaque objects. If I wanted to go and do the equivalent of what I did with the label and I said, hey, in the résumé study, what we're HR managers optimizing? What was their objective function? I don't know. They might not know.

They might tell me that it's not the objective they're maximizing. They might tell me they might think it's the objective they're maximizing, but it's not. The entire literature on the psychology of discrimination says that people can discriminate without knowing it. People can also just lie to us.

Human objectives are unknown to us. They're mysterious objects. On the other hand, algorithmic objectives must be specified. You cannot build an algorithm without writing down your label and the objective function in a piece of code that is there for, potentially, the world to see.

Here's another one. Why did I have to take a résumé, and change the name, and submit it in this statistical way? Why couldn't I just go to the HR manager and say, here's Brendan Miller's résumé. What would you have done had it been "Jamal?"

Again, they can't tell us. They won't tell us. We can't trust this type of-- It's just not knowable. You know what's trivial with an algorithm? Take the input, change what you want, and see what it does. I don't need to wait and do statistical studies on real people. The algorithm is the rule.

Finally, I found the problem in the résumé study. But guess what? I can't say, let's go talk to a bunch of HR managers and fix their brain. But it was pretty easy to fix the algorithm. Now, it's being scaled. Human-decision processes are fundamentally different than algorithmic decision processes.

And so this is the third lesson. Regulation should force storage of all inputs. Once we've done that, detection becomes so much easier. Think of all of the legal wrangling that we have to decide whether sex discrimination actually took place in a particular hiring when humans are involved. When an algorithm is stored, the problem doesn't go away; but wow does it become much easier. In fact, if I was going to discriminate actively, the first thing I would do is not use an algorithm if I had to store everything.

Moreover, it's easier to audit, it's easier to diagnose problems, and it's easier to scale fixes. So that's the third lesson I want to end on here, which is that algorithms make detection of discrimination much easier. I'd focus so much on the level, but not thought about the process.

I want to come back to the Amazon thing, because it's a good place to end this talk. And it's related to the detection point. Here's an article. This Seattle company ultimately disbanded the team by the start of last year that built this algorithm and got rid of the algorithm. Now many viewed this as a great victory because we got rid of a sexist algorithm. But guess what was left behind? The sexist hiring process that generated the algorithm in the first place.

When algorithms discriminate, they are not just an algorithm that discriminates that's to be fixed. They are the sign that the original process that generated it was itself discriminatory. So simply getting rid of the algorithm does not get rid of the problem. It actually rolls us back to a much bigger problem. So let me end there. Thank you.


Feels like we've got about a few minutes for questions. Yeah? If there are questions, I think there's a mic coming around.

AUDIENCE: My question will be, you described a situation where we have a statistical machine learning on very fixed data set, but what will happen with digital data that are evolving with this new loop of changing objective of data?

SENDHIL MULLAINATHAN: Yeah, great point. I think that there's-- most of these algorithms are trained and retrained. But there's an extreme end of that distribution where there's so much data and so much retraining, Google search for example, where storage becomes a little bit hard. But outside of that extreme, storage in the more intermediate cases, are typically very easy.

One thing I should say, even in Google search, even though the algorithm is retraining itself, the objective is fixed. So the function that's produced changes, because there are better ways to optimize that objective, but the objective stays fixed. So the storage of the underlying data could get just expensive because there's lots of hard-drive space you've got to use. But outside of that, it's important that even as these algorithms learn, it's all to a fixed, known, written objective function. Other questions?

AUDIENCE: I'm in the automotive industry, and I was wondering if there will be objectives that were that they-- if they will be ruled out. For example, what happens if my car makes a decision to kill a person or not to kill the other person? So will there be an objective objective? And will this be legalized? Will this be ruled out?

SENDHIL MULLAINATHAN: I didn't talk about it much here, but it shows up in discrimination a lot, the following effect. By forcing us to write down our objectives, there is a shroud of ambiguity that human-decision processes and social processes benefit from that's now gone.

So the example you give is exactly that. We try to avoid situations where we make ethical trade-offs and write down the value of this life versus the other life. There are a few places where we can't avoid that-- you know, health and safety regulation, we've just got to put a cost on human life, on human limbs, on other things, but nobody likes that part of anything. Algorithms, because of their liberality, force us to state such things. And I think that's going to be an uncomfortable conversation. In the case of race-- let me give an example.

We had the situation where we had a win-win. Equity goes up, and efficiency goes up. Suppose, in some instances, there is a trade-off between equity and efficiency. Now, we have to have a conversation. How much equity? At what price? That's not a conversation that the system currently wants to have. And I think there's going to be a lot more unfortunate consequences that come from the literality of algorithm objectives. Should we take one more question? Oh, great. Yeah.

AUDIENCE: Thank you very much for this very interesting talk. My question is, so for example, for the trials, how do you have policy-makers except using algorithms? These are people who are not necessarily educated in machine learning. So I'd like to know how you have them accept these algorithms, which can seem like boxes to them sometimes.

SENDHIL MULLAINATHAN: Yeah, I think that there is a belief out there that algorithm adoption will be very low and very slow. And I think my read is that we should just all remind ourselves, we're very early in a process. And most of the guesses we have are just that-- guesses. I don't think many people realize, for example, in health care, this algorithm's already adopted-- 100 million people, that hasn't been hard.

Or as another example, 15 years ago, you might have thought, oh, people have overconfidence and hubris in their ability to pick the best routes. And 15 years ago, when Google Maps first came out, I'm sure there were a bunch of people who were like, I'll just do it on my own.

But now, the only people who are saying, I'll just do it on my own are probably your dad, not even my dad. My dad uses Google Maps. So I'm just saying that people's comfort with these algorithms is-- I would be more optimistic. I don't think it's as negative as we might guess. So thank you.

Big data