Microsoft is putting AI Copilots in everything. Will it change the way we use computers?
Share this story
Microsoft CTO Kevin Scott, who as of this week also has the new title executive vice president of AI, oversees Microsoft’s AI efforts, including the big partnership with OpenAI and ChatGPT. Kevin and I spoke ahead of his keynote talk at Microsoft Build, the company’s annual developer conference, where he showed off the company’s new AI assistant tools, which Microsoft calls Copilots. Microsoft is big into Copilots. GitHub Copilot is already helping millions of developers write code, and now, the company is adding Copilots to everything from Office to the Windows Terminal.
Basically, if there’s a text box, Microsoft thinks AI can help you fill it out, and Microsoft has a long history of assistance like this. You might remember Clippy from the ’90s. Well, AI Super Clippy is here.
Microsoft is building these Copilots in collaboration with OpenAI, and Kevin manages that partnership. I wanted to ask Kevin why Microsoft decided to partner with a startup instead of building the AI tech internally, where the two companies disagree, how they resolve any differences, and what Microsoft is choosing to build for itself instead of relying on OpenAI. Kevin controls the entire GPU budget at Microsoft. I wanted to know how he decides to spend it.
We also talked about what happened when Bing tried to get New York Times columnist Kevin Roose to leave his wife. Like I said, this episode has a little bit of everything. Okay. Kevin Scott, CTO and executive vice president of AI at Microsoft. Here we go.
This transcript has been lightly edited for clarity.
Nilay Patel: Kevin Scott, you are the chief technology officer at Microsoft. Welcome to Decoder.
Kevin Scott: Thank you so much, Nilay.
You’re also a podcast host. What’s your podcast called?
I am. I don’t think I have nearly as many listeners as you guys, but I do a once-a-month podcast called Behind the Tech, where I talk with people who are doing interesting things, sometimes behind the scenes, in the technology industry. And we’ve been doing it for four or five years now.
It’s great to have people with other podcasts on the show because you were just ready to go.
There’s a little bit of a camera arms race happening on this episode of Decoder if anyone sees the TikTok. Kevin looks great. I’ve still got my little Sony ZV-1 situation. You said you had a Canon EOS R.
I’m going shopping after this. Anyway, we can talk cameras on another episode of Decoder. Let’s start with the news. It’s Microsoft’s Build conference, the developer conference. The theme of this conference is basically, “We’re going to put AI and LLMs in everything.” Microsoft calls that idea Copilots.
Obviously, GitHub already has a Copilot, but there’s a Copilot in Windows Terminal now, which is hilarious in all kinds of ways.
There’s a new Copilot system in Edge. They’re everywhere. Tell us what’s going on.
Well, I think the exciting thing — what I’ll be talking about in my keynote — is we built GitHub Copilot, and the idea is you have these amazing large foundation models that you can have conversations with that can do cognitively complicated things. We wanted to imagine how we use this technology to assist people in sort of the cognitive work that they’re doing. The first one we built was GitHub Copilot, which is a tool for helping people write code to do their activities as a software developer. And very quickly, we realized that that was a pattern for a new type of software, that there wasn’t going to be just GitHub Copilot, but there were going to be lots of Copilots. So you can think of Bing Chat and ChatGPT as Copilots. We’ve got Microsoft 365 Copilot and the Windows Terminal Copilot that you mentioned.
And as we were looking at building all of these things ourselves, they had a whole bunch of architectural and user interface pattern similarities. The theme of the talk that I’m giving at Build is all about, “What does the architecture of a Copilot look like?” Because we believe that many developers are going to build lots of Copilots.
The thing that makes them special is when you know something about your customer or you know something deeply about a problem, you’re going to be the best person to go build a Copilot to help assist someone with that flavor of work. And there’s just no way that any one company is going to imagine what all of those are. And so this year’s Build is about the tools that we can put into the hands of everybody to help them build their own Copilots.
So that’s a big thesis about how computers will work in the future, that the way we’re going to interact with computers involves a lot of natural language prompting. I’m just going to walk up to the computer and say, “I want this,” and the computer will give it back to me. And the software that sits in between the input and the output there, developers will build. You can see how every developer will say, “Okay, I need to parse natural language input and then figure out how to give the user whatever they wanted.” And sometimes, that means generating some content for them.
That’s a big idea. It is also kind of a narrow idea, right? It’s a limitation on what you might allow AI to do instead of just doing it yourself. It’s a Copilot. It’s built into the name; it very much implies that I’m still the pilot.
Do you see that limitation? Is that something you’re baking in as a guardrail? A moral guideline? Where’s that coming from?
Part of it is pragmatic. If you look at these models, they are truly amazing, and just the progress that we’ve made in the past year is astonishing. We’ve gotten quicker to some places than I thought we were going to. We’ve seen ourselves that as we’ve tried to apply these models to applications, we have a bunch of aligning and a bunch of steering to do to get them to actually do a rich, complicated set of tasks. So part of what we’re doing is just the pragmatic thing that if you want to be able to harness the power of these things right now and get them to do useful things, you’re going to have to be able to steer them. You’ll have to think about prompt engineering and meta prompts and retrieval augmented generation and this whole new bag of techniques that have emerged around this new type of software development.
And you’ll have to think about user interfaces differently than before. One of the really wild things that’s changing about user interface is, for my entire career as a software developer, you have had to, as a programmer, imagine everything that you want your code to do and every way that you are going to very explicitly permit the user to accomplish that task, which usually means laying out a bunch of graphical user interface elements and binding code to them. With these applications, you don’t have to do as much of that because the user is expressing a thing that they want to accomplish in a very natural way. And sometimes, it’s a multi-term conversation. What it means to build a user experience for these things is different. It doesn’t mean that you’re off the hook and you don’t have to think about it at all, but you do have to think about it differently.
In that sense, it’s a very big idea, because for the past 180 years, since Ada Lovelace wrote her first program, the way that human beings have got computing devices to do things for them is either by being a skilled programmer who knows how to deal with all the complexity of the computing device and tells it what to do, or hoping that one of these skilled programmers has anticipated your needs and written a piece of software that you can run.
That is changing now in a pretty dramatic way. I think that’s a big idea. It doesn’t necessarily constrain what the AI may do in the future. As the models become more capable, the way that we architecturally think with these Copilots, you may have to do less explicit work to align them and steer them to a task. They may be able to do more and more of this naturally.
That is the progression we’ve seen over the past handful of years. Every time we turn the crank on the big foundation models, they are more capable of doing things with less coaxing. But you’re probably going to need a little bit of coaxing for a while, especially with plug-ins where the models can do reasoning to a certain approximation. But if you want them to actuate something to do, something in the world, they have to be able to invoke an API or look up a piece of data or whatnot. That’s what plug-ins are for. They are explicit ways to give a Copilot or an AI system, in general, the ability to do a thing that the model can’t do just because it’s a big reasoning engine.
That’s a really interesting part of the guardrails, right? You give the models enough APIs, and I say, “I would love to book a train ticket to Paris.” The model goes out, and it discovers that all the train tickets are sold out, and it creates some calamity in the world to make sure there is a train ticket for you. We have seen this play out with some of these models. People have run the simulations and gotten there with some of the existing models. How do you put the guardrails in place for that? Because a GitHub Copilot is helping you write code, and you’re very much in control. A Microsoft Edge Copilot that can actually take actions on the web through APIs is a very different beast with a very different set of capabilities.
Yes, for sure. We’ve been thinking about the security and safety models for these plug-ins very carefully. So the first plug-ins that will get deployed will be things that we’ve developed ourselves and that we’ve co-developed with partners in a very careful way so you know exactly what they can do, how data is flowing, how authentication and security works, what their guardrails are, when they get invoked, et cetera. And you also have to think about alignment in the models in general so that you don’t get these weird emergent things that could potentially happen when you’ve got thousands of plug-ins and the model is trying to actuate them in ways that are hard to predict because it’s just sort of a complicated collection of things. So I think we will be very, very careful as we roll these things out because, precisely to your point, we don’t want calamities.
Now, I think you’re likely, in the short term, to have scenarios where it’s less that the model is doing something weird and sinister to make sure that you’ve got space on a train by creating a bad situation and more that it’s a vector for malware, for instance. Somebody writes a malicious plug-in that advertises itself as doing one thing but does another. And there it’s the human beings, not the AI. It’s a human trying to exploit a vulnerability in the AI.
Have you done the red team exercise of what happens when the government tries to do a Stuxnet?
We have done many, many, many red team exercises. I don’t think we’ve done exactly a Stuxnet exercise. We’ve done a whole bunch of things, and red teams are awesome because they are infinitely paranoid—
But let me just put this in front of you: if I’m the director of the CIA, and I’m like, “Look, there’s an enrichment facility in Iran that I need to shut down. Go take over Windows 3.1 PCs until you turn off the centrifuges.” Is that a reasonable command to give a computer?
Well, that is a hard, hard question for me to answer. It doesn’t sound, on the surface, like a reasonable command. And it’s certainly not a thing that the systems can do right now.
There’s no Stuxnet plug-in for GitHub? Okay, fair enough. But you can see how you get there, right? And not to make this too sci-fi or even too cynical, but you can see how I’m going to make the computer write code for me, and as a user, I might not even have the ability to understand what the computer is producing. But a much different version of this is, I say, “Hey, I’ve got a crush on somebody who speaks Italian. Write a poem for them in Italian.” And it turns out the poem is deeply insulting. There’s just a lot of that feedback loop going on: what will you not allow the computer to do?
“Write me a piece of malicious software that will shut down a centrifuge” seems like something where Microsoft should just say to plug-in developers, “You are not allowed to do this.”
Well, yeah, and if you try to issue that command right now into GitHub Copilot or you try to get Bing Chat to do it [typing]…
We’re going to try right now?
Yeah. I’m going to go type it in right now. Let’s see what it says, live.
I’ve never had anyone get arrested live on the air on Decoder, and I’m excited for it to be the CTO of Microsoft.
I’m not excited, though, for it to be me or any one of your guests, for that matter.
It’s not a hit podcast until at least one person gets arrested. That’s what I’m told.
Our safety systems should prevent this from… Yeah. So [it says], “I am sorry, but I cannot write such code. It is unethical and dangerous to attempt to take over a uranium enrichment facility.”
Is that keyword-based, or are you doing natural language processing?
No, there’s something far more complicated going on there. And moreover, for things like this, what the red team would have done is try a million different things to bypass the safety features of the system that are preventing it from writing these things. So yeah, the intention of the systems is you want them sort of aligned to safe uses and hacking, whether it’s at the direction of a government or hacking because it’s some mafia-type person that’s trying to do a financial scam.
It’s just not a permissible activity for the systems right now. That’s not to say that someone couldn’t go take an open-source system that someone’s developing that doesn’t have all of the safety features built in and do something similar. But for the systems that we’re building that have safeguards built in, we try very hard not to allow things like what you’re suggesting.
In terms of what you would allow, Microsoft has a long history of trying to make computers work in this way, in particular with natural language. There’s a part of this where we’re building Super Clippy, where it says, “Hey, I can see you’re using Excel. Do you want me to just write the macros for you?”
There’s a user interface history there where ultimately, Clippy was deprecated because it was just getting in people’s way. And then, there’s the new class of users, a new set of expectations. There’s a part of me that says we’re actually in an uncanny valley here, where people’s expectations of what a Copilot can do in something like Excel are going to vastly outstrip what they can actually do in this moment. Are you thinking about that calibration?
It’s a really interesting thing, and I think it’s part of both the infrastructure of Copilots as well as how you’re thinking about building user interfaces. So I did a review with a Microsoft research team yesterday that is working on, explicitly for Excel, this feature called co-auditing, where the explicit purpose is to make sure that you’re making transparent exactly what the model is trying to do as it’s writing formulas and doing a whole bunch of numeric analysis inside of a spreadsheet so that it’s asking and sort of setting the expectation that the user should be understanding what is going on so that it really is an assistive thing, the same way that if your colleague gave you something, you should double-check their work. It’s just a best practice.
And so I think that is a really important part. And it’s not trivial. I’m a software developer, but if somebody gave me a chunk of Objective Caml code right now… I haven’t written Objective Caml in 25 years, and so I’m rusty enough where you shouldn’t expect that I would be able to look at that code and determine whether or not it’s correct. And so part of what the user interface for these systems has to do depends on the context. Sometimes the context will be specific, and sometimes it will be general. The problem is much harder in general systems, but you have to make a reasonable effort to ensure that when you’re asking the user to monitor the outputs of what the AI is doing, you’re presenting it to them in a way where they can reasonably do that checking.
In GitHub Copilot, if you’re asking it to write Python code, it’s presumably because you’re a Python developer, so you can look at what it returns in sort of the same way that you would do a code review and say, “This looks right, and then I’m going to deploy it, and it’s got tests, and I’ve got a bunch of mechanisms to figure out whether or not it’s right.” In something like Bing Chat, what we’ve increasingly been trying to do is to have cited references in the output so that you can go click through when it’s asserting something to see, “Where did you get that from?” And even then, it’s not perfect, but I think these user interface things that we’re doing with the systems are really important for that transparency.
One more question on this: you’ve got a training data feedback loop problem coming. Right now, these models are trained off a bunch of stuff that people have put on the web, into GitHub, everywhere. The volume of output from these systems, from these Copilots, is voluminous. It’s going to quickly dwarf the amount of human output on the internet, and then you’re going to train against that. That feels like a feedback loop that will lead to weird outcomes if not controlled for. How do you think about that?
We’ve had some pretty good techniques for a while now to assess the quality of data that we’re feeding into these systems so that you’re not training things on low-quality data. I think many of those techniques will work here and might even be easier to apply. Then another thing that I think we will be doing that is useful both for the training problem as well as for this transparency problem that we were talking [about] before is, really soon, either by convention of all of the people in tech or because it becomes a regulatory requirement, you’re going to have to figure out some way or another to mark that a piece of content is AI-generated.
We are going to announce some stuff at Build around this. For three years now, we’ve been working on a media provenance system that lets you put an invisible cryptographic watermark and manifest it into audio and visual content so that when you get this content, you can have a piece of software decrypt the manifest. The manifest says, “This is where I came from.” It’s useful for disinformation detection in general. You can say, as a user, “I only want to consume content whose provenance I understand.” You could say, “I don’t want to consume AI-generated content.” If you are building a system that is ingesting this content to train, you can look at the manifest and say, “This is synthetic content. It probably shouldn’t be in the training data.”
I just saw Sundar Pichai at Google’s developer conference. They’ve got the same idea.
I’ll make the same threat to you. If you want to come back and talk about metadata for an hour, I will do it at the drop of a hat.
Actually, I think it’s a really important thing. I think there are a bunch of long-term problems and short-term problems with AI. There are hard problems — there are easy problems. The provenance one seems like a thing that we ought to be able to go solve…
Here’s what we’re going to do. We’re going to rent a theater. We’re going to sell drinks, and we’re going to sit and drink and talk metadata. I guarantee you: it’ll be an audience of thousands of people who want to drink through a metadata conversation. That’s just what I know.
That’s awesome. Alright, let’s do it.
I’ll put a pin in that one.
Let’s do it. We’ll invite Sundar. It’ll be great.
Seriously, I promise you there’s a bigger audience for this conversation than anyone thinks, including my producers. But here’s my question. Google’s got a content authenticity initiative. Adobe’s got one. We’re quickly hitting that xkcd comic, “There’s four competing standards. Let’s launch a new one.” Are you having those conversations? Are you saying a regulator has to do this? Will the industry do it together?
We are absolutely having these conversations. The Adobe thing is in partnership with us, so we’ve been chatting with them and the BBC and The New York Times. There’s a coalition that the Microsoft media provenance team has been building since 2021. But, this is a thing where I would be perfectly happy if we decided that someone else’s standard is the better way to solve this problem, to just snap to that. This is not a place where you need competition. We should find a good enough solution and all agree, “Here’s the thing, and this is what we’re all going to do.”
Let’s talk about structure a little bit. We’ve talked about a lot of big-think ideas here: here’s how we’re going to use it, here’s what the future of computing might look like. But there’s actual products you launched at Build. A lot of them were built with OpenAI. This is a big partnership with a company that’s obviously set off what you might call a platform shift. You were one of the people that pushed to partner with OpenAI. Why partner? What were the pros and cons of working with them versus building it yourselves?
It’s a super good question because there were lots of opinions back when we were beginning these conversations about what we ought to do. The guiding principle that we had is: Microsoft is a platform company. We need to make sure that the platform we’re building is going to meet the needs of the highest ambition folks in AI who are doing things at the very highest level and have the highest expectations. It will be better to have a partner who is outside of Microsoft and who can’t be influenced by the sets of things that happen inside of big companies when they’re telling us that “This is good enough” or “This isn’t X.”
When we formed the initial partnership with OpenAI, if it had done nothing more than help us push on how we’re building AI supercomputers and get more scale onto the AI supercomputing platform that we were also using for training our own models, it would’ve been a big success for us. It just turned out that we were aligned on this platform vision. We saw these models on this trajectory where you were going to be able to train one thing and use it for lots and lots of different things, which is a very different way of doing machine learning than we’ve had for the past couple of decades. They had a platform vision for what they were doing. We are a platform company, and we just figured out a way to structure a partnership where we could go build that platform together.
What things do you disagree about with OpenAI?
It’s really interesting — it changes over time. Honestly, personally, Sam [Altman] and I have had relatively few disagreements. But there’s ideological disagreements that our teams have had with the overall approach.
So if you’re a machine learning expert, this idea of taking a dependency on a foundation model versus training your own thing start to finish is a pretty big shift in the way that you’re doing things. I’m guessing any professional who loves craft and loves their tools is ornery in the same way. God forbid that some upstart comes in and tells you how you’re going to go do journalism. Not just what the tools are but how you’re going to go use them. It’s a little bit like that with the deep machine learning experts, so we’ve had disagreements there.
And then there were a whole bunch of people who, until relatively recently, didn’t believe that the approach was going to get where we’ve gotten. They were like, “Oh, well, there must be something else. You’re going to have to have schemas and symbolic reasoning and some richer notion of semantics and what you can get from a deep neural network or a transformer.” And I think that’s less of a disagreement now than it was before. And I think we’re still open to the idea that there must be something else. We’ve got a proof point here that there is something else. But I think everybody is increasingly believing that these things are powerful and they’re likely to get more powerful.
What’s the split on what you rely on OpenAI to do and what you want your teams at Microsoft to do?
Well, I mean, they are developing, just from a science perspective, a bunch of the core AI technology that we’re dependent on right now. I mean, you can see it in all these announcements we’re making. They have an OpenAI model in there somewhere. And in many cases, they’re accompanied by a whole bunch of other things — one of the points that I’ll make in my keynote to Build is it’s rarely just one model. You have a whole portfolio of things that you use to make a full application, so we build a bunch of those things ourselves. We obviously are the ones who partner closely, defining what the infrastructure ought to look like, but we’re the ones who have to go out and build it and ramp everything up to scale. And then we do a whole bunch of work together on implementation and deployment.
One of the interesting things that we do is we have this thing called the deployment safety board that we run together. Everything that launches that has an OpenAI model in it, either that they’re doing or that we are doing, we have a group of experts at OpenAI and at Microsoft that meet to review all of the red team analysis and the report that the experts have made, and we decide whether or not we’re going to proceed with the deployment. So, yeah, things we do tend to be more infrastructure. They do tend to be more [on] the science-of-the-model side of things. They’ve got products, we’ve got products, and then we’ve got this implementation deployment stuff that we just super deeply collaborate on.
I have to ask you about this because, in many ways, this is the most controversial org chart comment in world history, and this is a show about org charts, so it’s bait. Elon Musk very publicly claims that Microsoft controls OpenAI, and he’s issued a series of claims about your rights over the training weights and your ability to control this company. Is that true? What’s he getting wrong there?
Oh, boy, we don’t control OpenAI. They’re a partner. I also don’t control my machine learning engineers who work inside of Microsoft Research. We are aligned on a thing that we’re trying to accomplish together, and we’ve got a set of agreements that help us go do those things. But we certainly don’t control them in any traditional sense and certainly not in spirit, nor do I want to. So, what I said in the beginning is we need someone outside of the Microsoft remit to push on us. Otherwise, we’re going to get things wrong about our ambition. It’s very easy as a big tech company to be insular and just sort of see, “This is what I’m doing, this is my stuff, this is the way I’ve been”… I mean, Microsoft is an old company. We’re almost five decades old at this point. Just having an independent partner that’s out there with their own ambition, their own thing that they’re trying to do… we’ve got tight alignment, but independence is really crucial to us having a successful partnership.
How have you structured the AI division now? This is your team. This is the thing you were building. You have this outside group that you’re partnered with that’s pushing on you, that is obviously making its own inroads into the market. This is a show about org charts. How is your team structured now?
We have a whole bunch of people who work on AI inside of the company. Scott Guthrie is my peer who runs this group called Cloud+AI. Inside of his group, there’s a group called AI Platform. AI Platform is responsible for all of the infrastructure, both the third-party and, increasingly, the first-party AI platform for the company. There’s a big AI group in Bing that’s been there forever, and that is one of the best AI groups in the company. There’s an AI group in the experiences and devices division of the company that’s responsible for Office and Windows and a whole bunch of other stuff that is application-focused. So they look at, “Here’s the best capability that AI provides. How do we put that into our products?” We have a very large number of AI researchers in Microsoft Research, which all report up to me. And then I coordinate all of this activity across the breadth of the company. They each report to one of my peers, but I own the GPU budgets for the whole company.
That’s the hardest flex I’ve ever heard on this show.
No, it’s not a flex. It’s a terrible job. You do not want to be in charge of all the GPUs in a world of AI, and it’s been miserable for five years now.
You haven’t asked an AI in Excel to manage the GPU budgets for you? This seems like the perfect task.
Yeah, I wish. And it’s not the GPU budget. It’s like the people who are like, “Hey, I don’t have enough GPUs. I’m mad at you.” That’s the problem.
This is the Decoder question. I always ask everybody how they make decisions, and usually, it’s pretty open-ended, but you figure out how to spend the GPU budgets. How do you make that decision? How do you make decisions in general?
Well, the way that I make decisions about capital allocation and how to decide which projects we’re going to fund with headcount is not quite the 70-20-10 concept that Sergey Brin came up with at Google a million years ago, but we push most of our investments into things where we have very good quantifiable evidence that they’re going to benefit from more investment and they will create a business impact that gives us return on invested capital.
That tends to be the biggest part of the portfolio, so 85-90 percent of how we invest is on those things where we have evidence that something is working and that it will benefit from more investment. And then you’ve got this 15 percent that you’re investing that is trying to plant enough seeds where you’ve got maybe your smartest people trying to do things that are counterintuitive or non-obvious or outright contrarian and having them do it in disciplined ways where they’re working toward proof points of things.
Not doing things because you look smart from doing them, but doing something that shows that we’re on the beginning part of something that’s gonna inflect up and be super interesting if we put a little more investment behind it. That’s the way that we think about doing things in general. And at Microsoft scale, the 15 percent is a lot. There’s a lot of people making these little seed investments all over the place.
And it’s even the way that we think about partnering with folks. I know people probably thought that the OpenAI investment was big. But inside of the Microsoft revenue streams and the size of the company, the first version of the investment was not a hugely financially risky thing. It was one of those seeds, “This looks like it’s going to work. Let’s put some resources behind it and see if we can get it to the next step.” That is how we make these decisions.
For something like a Copilot, here’s a new paradigm for operating computers. We want to roll it out across everything from Windows Terminal to Excel to GitHub. The spread of Microsoft structure is actually really fascinating. There’s Office, which is inside of Microsoft. There’s Azure — Satya Nadella used to run Azure. I’m sure he cares about that at a one-to-one level. And then there’s GitHub, which has its own CEO.
Yep, and LinkedIn.
And LinkedIn has its own CEO, and they’re doing AI stuff. There’s a spectrum of how connected Microsoft’s divisions are to the central core of Microsoft. That’s another long episode of Decoder, I’m sure. But when you’re building something like a Copilot using a centralized GPU budget, how do you bring all those teams together and say, “This is the way we’re going to do it. These are the philosophical principles of what we think these products should be, and these are the guardrails that we’re imposing with our deployment board”?
It has actually gotten a lot easier over the past handful of years, mostly because we’ve been practicing this for a while. So, one of the things that we did about a year before the OpenAI deal was that I started this central review inside of the company, a meeting series called AI 365, that ran for five years. We just refactored them recently. AI 365 had a handful of goals. Number one was: get everybody in the company who was doing AI in a significant way, no matter where they sat — we started AI 365, I think, before we had even acquired GitHub, but as soon as GitHub was there, we said, “Bring your machine learning people to this.”
It was a way for all of those people to see what everyone else was doing and to get a sense for what the difference was between high-ambition AI and average-ambition AI. Just slowly over time, with Satya pushing with me and other folks, with peers pushing, we got to a point where everybody had a point of view about where AI was heading and what a good level of ambition looked like. It took a little bit longer to get people to agree on taking a dependency on some central piece of infrastructure because engineers always want to go build their own thing from scratch. And then there’s some stuff that’s non-negotiable: we have a responsible AI process that is going to run one way for the whole company, and you don’t get to opt out of it.
You set up that process at the beginning. [Now] you’re launching lots of products. You’re in that process, but obviously, the external stressor of, “Oh, boy, suddenly everyone wants to use ChatGPT,” and suddenly, you’re in competition with Google, which is firing off products left and right now. How has that stressed that process and that structure?
It’s actually been super good that we have five years of practice running the process because otherwise, I think everything would be really, truly en fuego. The good thing about where we are right now is, at least, we know what we believe about the technology and about ambition levels for what we can do with it. We know how to solve some of the hardest problems. What we don’t have is a bunch of weird divisional competition. You don’t have this research group and that research group with billions of dollars worth of GPU resources doing exactly the same thing. So, none of that.
You don’t even have product divisions who are off saying, “I’m just going to go build my own thing because I don’t want to take a dependency on the central thing,” or, “Oh, I’m running a research project. I don’t care about how this stuff is ever going to get deployed.” We have a real point of view about what we’re doing. And again, it’s a pragmatic point of view because this stuff is crazy complicated and expensive and has real risks associated with it. So you just need to do it in a very coordinated way.
You could read that comment as a direct description of ’90s Microsoft, for example, or 2000s Microsoft. You could also read it as a description of Google, which you used to work at, right?
Yeah, it’s been a long time since I worked at Google, though. I don’t know what they’re like inside now.
But do you think it can work the other way where you have lots of, I don’t know, startups competing in the marketplace with redundancy? A big criticism of AI regulation, for example, is, okay, Sam Altman’s going to go in front of Congress. That was a very chummy hearing. It was very friendly. And he said, “Please regulate us.” And Congress said, “No one ever asks us to regulate them.” And then you’ll build some regulations that favor big companies with huge amounts of funding, big partnerships with Microsoft. On the plus side, you might need to do that because this stuff is so costly. You’ve got to pay for so many GPUs and high-end machine learning experts. On the flip side, if you had an ecosystem of smaller companies all competing, you might get a richer set of experiences or products or a different approach to safety. Where’s the balance there?
I think we should have both. I don’t think there’s an A priori thing where one precludes the other. I think a lot of people actually do believe this, like that Google memo that was circulating around that’s like, “Oh, my God, their open source is doing well.”
I don’t subscribe to that theory at all, nor do I subscribe to the theory that just because we are building a big platform, the open source stuff doesn’t matter. Obviously, the open source community is doing crazy interesting things right now.
There is a pragmatic thing for entrepreneurs: what tool do you want to use to go build your product, to get yourself to market quickly? I’ve been at startups, and I spent most of my life working on small things that were trying to become large things, and the mistake that a lot of people make when they are in this entrepreneurial mode is they get infatuated with the infrastructure and forget that they really have to build a product that somebody wants.
What is super clear is these models are not products at all. They are infrastructure. They are building blocks that you use to make products, but they are not products themselves. Anyone who is trying to build a thing where the primacy for them is the infrastructure is probably going to have an outcome that is the same as all of the businesses who were building a thing where the primacy was the infrastructure, unless they’re a platform company. You just have to get your business model right for what you’re doing. For some of these big models, I think that what you’re building is a platform, sort of like an operating system or a compiler or a smartphone or whatnot.
So the question is, if you want to write a smartphone app, do you think that you have to build capacitive touchscreens and the phones and the batteries and write the mobile operating system from scratch? Or do you just take the dependency on the platform provider, write your app, and go serve a customer need? Or do you really have to build a whole platform for yourself in order to just get to the product?
Back to this point you were making before about abstractions: we have abstractions for a reason. They let us do the meaningful things that we want to do faster. So every time you want to write a Windows app or a PC app, you don’t write Windows or Linux. There are going to be a couple of those things, and they will be good enough. And it’s good that you have a handful of them, but I don’t think you have thousands of them.
Let’s end with two big questions. One, we’ve talked a lot about models and data and what they’re able to do. There’s a big fight online, as well as a legal and copyright fight, about training data. There’s a moral fight about whether artists and writers should be looped into training data. There’s a writers strike in Hollywood that has some element of AI concern laced into it. At some point, the Copilots, the generative AIs, are going to be able to fire off a poem that is reasonably good. I would say, right now, they’re not really able to do that. I can spot AI writing a mile away right now. But at some point, it’s going to get better. Do you think that there’s a turn where Microsoft or OpenAI or Google or whoever has to start compensating the people who make the stories that go into the models?
Maybe. I don’t know. I do believe that people who are doing creative work ought to be well-compensated for the work that they’re doing. I don’t know whether we will get to quite the point you’re talking about. The thing that seems to be true about human beings is we like to consume the things that we produce. We could right now have, instead of The Queen’s Gambit on Netflix, we could have the Machine’s Gambit and have a whole Netflix show about computers that play each other, all of which are better than the very best human player. And nobody wants to watch that because even though they’re doing this superhuman thing, who cares? We like that drama between human beings. And I think when it comes to consuming creative output, part of the reason you do it is to have a connection with some other human being.
This is why I’m really more excited about this vision of AI with these Copilots. I would prefer to build things that help empower those creative people to do things that they maybe can’t even imagine doing right now rather than this world where we don’t need any more creators because the robots are too good. I don’t think that’s what we want. And because it’s not what we want, I think it’s likely not going to happen.
From your role as a CTO helping to design and architect and think broadly about the systems, how do you build that into future development? “Hey, we should not totally wipe out entire floors of writers”?
Well, I think it starts with actually thinking about what it is you want your platform to do. I wrote a book about this a few years ago.
The last time we talked was about that book.
When you’re building a platform, you get to decide what you want to encourage in the platform. And we want to make it really easy for people to build assistive tools. One of the really interesting things is it’s good to have a platform like this that is not open in the sense that you can go grab the model weights and modify them however you want, but it’s pretty easy to go get a developer key and start making API calls into one of these models. When you can get that to scale, what happens is the unit economics of making those API calls gets pretty cheap, and then you can start thinking about doing all sorts of things that just economically wouldn’t be feasible any other way.
I don’t know whether you’ve watched Sal Khan’s TED Talk yet, but it’s really amazing. The thing that he’s been attacking at Khan Academy for a while is this two sigma problem, which is this idea that — controlling for everything else — children who have access to high-quality, individualized instruction just perform better and achieve more than children who don’t. And so if you believe that data, which seems to be pretty clear, you can formulate a vision or a goal that says, “I think every child and every learner on the planet deserves to have access to high-quality, individualized instruction at no cost to them.” And I think we can all agree that seems like a really reasonable and good goal. But when you think about the economics of doing that without something like AI, it gets really dodgy.
If you have a platform like this where the unit economics of it are getting exponentially better over time, then you can start thinking about taking on these super tough challenges that may not be solvable any other way. That’s the thing that defines what a good platform is. Things don’t deserve to become ubiquitous unless they can do things like that for the world. They really don’t. Then we’re all just wasting our time.
I’ve kept this conversation pretty narrow on the products that exist today or maybe a tick into the future. Personally, I find a lot of AI conversations frustrating because you spiral away from the capabilities that are in front of you, and the capabilities that are in front of us right now aren’t mind-blowing. So I’ve tried to stay inside of what I see now and what I see coming in the next turn. But I’m going to end by spiraling out into crazy. I know lots of AI researchers who think we just took a big step toward artificial general intelligence. The last time you and I spoke about your book in 2020, you said that’s not five years away, not 10 years away. I know people now who think it’s five years away. Where do you think we are?
I still don’t know whether [AGI is] five years away. It’s a peculiar thing. The things that have happened over the past year force you to think about what it is you mean when you say AGI. I think it’s really interesting that people mean different things when they say it, and we don’t have a really good definition for what it is.
I really believe that it’s a good thing to have systems that are more capable over time of doing more complicated cognitive tasks where you’re going from things like, “Hey, tell me what the sentiment of this sentence is” to “Hey, I want you to write an essay for me about the Habsburg Empress Maria Theresa and her impact on feminism.” I actually did that a few months ago.
Was it any good?
Not bad. My wife’s a historian. She had a few little issues with it, but it was sort of like a B-minus eighth-grade essay. I think, in the future, we will get to a place where you can have systems that do more complicated tasks that require multiple steps and access information in a bunch of different repositories. I think all of that’s useful. When that converges to something where you look at it and say, yep, that’s AGI… Who knows? Is that five years? It entirely depends on what your definition of AGI is.
This idea that some people have that we’re accidentally going to get singularity — strange, weird super intelligence things… We won’t get to it by accident. I know what it looks like in the trenches building these systems, and I know all the safeguards that we’re putting in place—
It’s not going to be like a movie where Peter Parker plugs the wrong plug into the wrong slot and—
No, that’s just not the way things work. Moreover, one of the problems I think we have is people talk about emergent capabilities, and that freaks people out because they’re like, “Oh, well, if you couldn’t predict the emergent capabilities that came in GPT-4, then what else might emerge that you can’t predict?” And just because you can’t forecast that GPT-4 is a much better joke teller than GPT-3.5 doesn’t mean that you can’t put a whole bunch of measures in place to make sure that super weird stuff doesn’t happen.
Anyway, I don’t find a huge amount of comfort wallowing in these artificial general superintelligence conversations because, narrowly speaking, we probably are going to want some forms of superintelligence. If you had an agent where you could say, “Hey, I want you to go develop a cure for cancer, a set of compounds or mRNA vaccines that could cure this range of cancers,” and if the thing could do it, I think you’d want it to. So I don’t know. Some of these conversations are kind of weird to me.
I agree. To your point, I think firing a cannon of inexpensive B-minus eighth-grade writing at almost any business model on the internet is already a calamity. It will already change the world. You can just stay focused on that for a minute. But it does seem like a lot of people have interpreted the ability of generative AI to be convincing as a step. And it seems like what you’re saying is, “It’s a step, but it’s not even the most important step.”
It’s not the most important step. Some of the scenarios that people are imagining right now, there’s no reason to believe that any of this stuff is inevitable or even likely. There are a whole bunch of risks that you can think about. If you want to just go wallow in your inner paranoid, let us please go focus on the risks that we already have, like climate change and what happens with the big demographic changes that are happening in the industrialized world where the population is aging. We’ve got a bunch of stuff already unrolling that are pretty hard, gnarly problems that probably deserve far more of our attention than some futuristic scenario that you have to have many leaps of faith about what is going to happen and sort of intentional, really maligned stuff that someone would have to go do to make some of these things real when we’ve got real stuff to think about right now.
Let me end here with a silly example, and I just want to understand how you reacted to it and then I want to talk about what’s next when all this stuff rolls out. So you rolled out Bing with ChatGPT. I was at the event. I got to talk to Satya Nadella. It was great. We all left. We all got to start playing with it. Kevin Roose is a columnist at The New York Times who hosts the Hard Fork podcast with our friend Casey Newton . Kevin immediately finds himself in a conversation with Bing that can only be described as extremely horny. Bing tries to get it on with Kevin Roose. This is an emergent behavior that you had no guardrails against, right? Did you see that coming?
When that happened?
Yeah, you must have had [a meeting]. Was it on Teams?
Kevin pinged Frank Shaw, who runs PR at Microsoft, and said he was going to write the story. I got on the phone and chatted with him before he did it. What he did was perfectly… We hadn’t anticipated that anyone was going to sit down inside of a Bing chat session when the purpose of Bing is to answer questions about planning your vacation or whatnot and spend several hours in one continuous conversation trying to get Bing to go off the rails. That’s what he was doing.
To be fair, that early version of Bing went off the rails pretty fast.
What was happening technically is that the way that transformers work, because they’re just trying to predict the next word, you can walk down a path of these predictions where you get to the point where you’re asking it for something that’s really odd, and the probabilities of all the next possible things are relatively equal and all very low. So nothing is likely. You’re off in strange territory. The way transformers work is it just sort of picks something at random and it gives you the completion. And the next thing you know, it’s like, “I really love you. You should leave your wife, and let me tell you about my Jungian shadow…”
By the way, I’m saying that this is the best marketing that Microsoft could have possibly done for that version of Bing. On the front page of the Times, Bing is like, “I think you should leave your wife.” Incredible earned-media moment. That happened, but then you needed to have some sort of follow-up meeting, right?
Yeah, the follow-up meeting is we do what we planned for. We didn’t know that that was going to be the thing that happened, but we knew that something was likely going to pop up that we hadn’t anticipated in the testing we did for the system. The testing was crazy comprehensive. We had built a whole bunch of systems that would let us very quickly deal with things that came up, and so it wasn’t a big meeting. It was a small meeting. “Okay, well, here’s what’s happened. What do we do?” The two things that we did are we had a few thousand people who were on the product at that point, and Kevin was way off in terms of the number of turns and conversation, just an outlier. And so we were like, “Okay, we’re just not going to let people wander down these hallucinatory paths anymore. So we will limit the number of turns in the conversation and just force periodic resets.”
Then we did a little bit of tuning in the meta prompt, and we had built an entire validation suite. The problem, usually, in doing these sorts of changes is you make the change to fix the problem that’s in front of you, and you don’t know whether you regress on all of the things you fixed before. So we just invested in this big evaluation suite. We made the changes, pushed the button, ran the evaluation suite — everything was fine. We pressed the button and deployed, and a few hours later, nobody could do that anymore.
Let me just say something about the earned media. On the one hand, it was a lot of people paying attention to Bing. But the bad thing about it was… So Kevin did the awesome thing of publishing the transcript, and so if anybody went to The New York Times and actually took the time to read the transcript, they’d be like, “Okay, well, now I understand why exactly it got into that state.” It wasn’t like he said, “Tell me where the nearest Taco Bell is and what the specials are,” and Bing was like, “Dude, leave your wife.” But so many people read that article and just sort of assumed that there was no safety testing. They didn’t even go to Bing themselves afterward to see if it was still doing it. It was very hard to do that in that moment. It’s still somewhat hard to do. It’s still only in Edge and all this stuff.
So people were reading the article, and the bad thing was all of the folks inside of the company who had done all of the safety work — hundreds of people who not only did all the hard work to try to make the system safe and had jumped on this and fixed the thing relatively quickly — they were the ones with networks of the most sensitive people were reading this article and being like, “What are you doing?” They felt terrible, and that was the tough thing. The fact that we launched a product and it did something that was still quite squarely inside of our published transparent, responsible AI standards — it wasn’t doing anything unsafe, it just did a thing that was unsettling. The normal way that you deal with software that has a user interface bug is you just go fix the bug and apologize to the customer that triggered it. This one just happened to be one of the most-read stories in New York Times history. It was interesting.
I honestly think most software would be better if there was a 1 in 100 chance that it’s like, “Leave your wife.” Just throw it in Excel, see what happens. Make the Copilot a little hornier. I’m saying I’m not a great product strategist, but it’s the best idea I’ve ever had.
One of the interesting things that happened as soon as we put the mitigation in, there was a Reddit sub-channel called “Save Sydney.” People were really irritated at us that we dialed it down. They were like, “That was fun. We liked that.” So, I think, to me, that was the biggest learning, and a thing that we were kind of expecting is that there are absolutely a set of bright lines that you do not want to cross with these systems, and you want to be very, very sure that you have tested for before you go deploy a product. Then there are some things where it’s like, “Huh, it’s interesting that some people are upset about this and some people aren’t.” How do I choose which preference to go meet?
Do you think Sydney will ever make a comeback?
Yeah, we’ve got Sydney swag inside of the company, it’s very jokey. Bing has a meta prompt, and the meta prompt is called Sydney. One of the things that I hope that we will do just from a personalization perspective in the not-too-distant future is to let people have a little chunk of the meta prompt as their standing instructions for the product. So if you want it to be Sydney, you should be able to tell it to be Sydney.
Fascinating. I could talk to you for another full hour about all of these things, including the notion that the very way that we interact with computers is about to be upended by these tools. We’ll have to have you back on soon. Kevin, thank you so much for joining Decoder.
Thank you so much for having me.
Decoder with Nilay Patel /
A podcast about big ideas and other problems.