Season 1 • Episode 3
The media is plenty freaked out about “deepfakes”: Computer-generated videos of famous people saying things they never actually said. But only the video is faked; the audio parts, the voices of those fake celebrities, were supplied by human impersonators. But now, software exists to mimic anyone’s voice, opening a Pandora’s Box of fraud, deception, and what one expert calls “the end of trust.” Fortunately, a new coalition of 60 news organizations and software companies think they have a way to shut down the nightmare before it begins.
Guests: Ragavan Thurairatnam, Dessa. Nina Schick, author and deepfakes expert. Joan Donavan, Harvard Kennedy School. Charlie Choi, CEO of Lovo. Dana Rao, chief counsel, Adobe.
Deepfakes are the latest in computer-generated imagery: they’re videos of people doing and saying things that they never actually did or said. Like, there’s a video of Obama saying,
“Obama:” “President Trump is a total and complete dipshit.”
But what’s weird is the voices in those videos are still done by human beings. Impressionists. Impersonators. The technology to simulate their voices still wasn’t good enough to fool anyone.
I’m David Pogue—And this…is “Unsung Science.”
Season 1, Episode 3: Voice Deepfakes
Every fall, Adobe hosts a conference called Adobe Max. It’s a chance for the engineers to strut their stuff, show what they’ve been working on, and make announcements to a captive audience of customers and press.
The conference focuses, of course, on creative software—for photos, videos, music, and so on—because that’s Adobe’s thing, right? They make Photoshop for editing photos, Premiere for editing videos, and so on.
One session every year is called Adobe Max Sneak. Here’s how Adobe describes this presentation:
Faux 1: The Max Sneaks session invites our engineers out of the lab and onto the stage. Many Sneaks from previous years have later been incorporated into our products.
In 2016, the Sneak session featured the usual sorts of Adobe experiments. There was a prototype app that replaces the sky in a photo with a different sky, with one click
There was an app that could adjust the colors in a bunch of photos to match the color scheme of an existing document.
And then…there was Project Voco, which was described as Photoshop for voice.
Speaker Let’s hear from Zeyu you about Photoshop voiceovers. Please welcome to the stage… Zeyu. (applause)
Zeyu: Hello, everyone! Let’s do something to human speech. I have obtained this piece of audio where there’s Michael Key talking to Peele about his feeling after getting nominated.
He’s referring to Key and Peele, the comedy duo.
I forgot to mention that Jordan Peele, half of that duo, was sitting right there on the stage. He’d been hired as the cohost for this event. That’s the Jordan Peele who went on to write and direct movies like “Get Out” and “Us.”
Anyway, the Adobe researcher now played a recording of Peele’s partner, Keegan-Michael Key. In the clip, Key is describing his reaction at learning that he’d been nominated for an Emmy.
Key I jumped on the bed and— and I kissed my dogs and my wife, in that order. (laughter)
Zeyu: So how about we mess with who he actually kissed? Project Voco allows you to edit speech in text, so let’s bring it up.
The Voco window shows audio waveforms across the top—and lined up beneath them, the corresponding words.
Zeyu And when we play back, the text and the audio should play back at the same time. So let’s try that.
Key And I kissed my dogs and my wife.
Zeyu OK, so what do we do? Easily, copy paste. Let’s do it.
Using his cursor, Zeyu copied and pasted the word “wife” to make it come earlier in the sentence—
Key And I kissed my wife and my wife. (crowd)
—and then typed over the second occurrence of the word “wife” with the word “dogs.”
Zeyu We can just type the word “dogs” here.
Crowd No, no!
Key And I kissed my wife and my dogs.
But that was just rearranging recorded words. Now came the really nutty stuff.
Zeyu Wait, here’s more, here’s more; we can actually type something that’s not here, so.
Using his keyboard, he deleted the word “wife,” and he typed the word Jordan, as in Jordan Peele.
Zeyu: And here we go:
Key And I kissed Jordan and my dogs. (crowd reacts)
At this point, Jordan Peele leaps out of his chair in mock horror. He’s stomping across the stage, like, “I’m outta here.”
Peele You—you a witch! You a demon!
Zeyu I’m magic. We’re not just going to do with words, we can actually type small phrases. So we do “three times.”
Zeyu: And, playback!
Key And I kissed Jordan three times.
Crowd Ohhhhh!! (Crowd cheering)
So yeah. They had fed 20 minutes’ of Key’s voice recordings into Project Voco, and now, just by typing, they could make him say things that he had never actually said. And there was absolutely no way to tell that it wasn’t real.
The crowd seemed to love it. The only person who seemed at all troubled—was Jordan Peele.
Peele I, I’m, I’m blown away. I can’t believe that’s possible. You just type it in, and it interprets the person’s voice. If this technology gets into the wrong hands… (laughter)
But Zeyu Jin was quick with reassurance.
Zeyu Don’t worry. We actually have researched how to, like, prevent forgery. We have, like think about like a watermarking detection.
Later, the Adobe blog described the event like this.
Faux 2: Project VoCo, allows you to change words in a voiceover simply by typing new words. As always, we’d love your feedback.
And boy, did Adobe get feedback. From the BBC:
Faux 3: It seems that Adobe’s programmers ignored the ethical dilemmas brought up by its potential misuse.
From the CreativeBloq blog:
Faux 4: This raises ethical alarm bells about the ability to change facts after the event.
From Affinity Magazine:
Faux 5: The ethical issues associated with its misuse are endless.
Adobe soon began issuing this statement to reporters:
Faux 6: Project Voco was shown at Adobe Max as a first look of forward looking technologies from Adobe’s research labs, and may or may not be released as a product or product feature.
Well—surprise, surprise: It was not released as a product or product feature. In fact, it was never heard from again.
Now, meanwhile, in the rest of the world, the tech media was abuzz with stories about the rise of deepfakes.
>>TV & NPR audio clips about “deepfakes”
Someone on Reddit first coined that term—deepfakes—to describe videos where the computer has replaced one person’s face with another’s. In the beginning, most deepfakes were made by amateurs grafting popular actresses’ faces into porn videos.
But there was also a hilarious-slash-creepy trend of people putting Nicolas Cage’s face onto other actors in famous movie scenes.
By 2018, deepfakes had gotten good enough that one video of President Obama was convincing in every way—except for the words coming out of his mouth:
Peele: They could have me say things like, I don’t know, “President Trump is a total and complete dipshit.”
Now, you see, I would never say these things. But someone else would.
Buzzfeed had made that video as a sort of public-service announcement about video deepfakes.
Oh, it made the point, alright. But 8.5 million views later, hardly anyone has commented on its one glaring flaw: The computer algorithm did a great job of generating the video of Obama. But they had to use an Obama impersonator—a human being—to do the voice. Guess who they got to do the impression?
Yup—same guy who’d been on the stage two years earlier witnessing the unveiling of Project Voco.
Now, usually, audio technology always comes before video. There was radio before there was TV. There were cassette tapes before there were videotapes. There was streaming audio before there was streaming video.
But for some reason, in deepfakes, video came first. Audio deepfakes came along only later—and they took their time.
Let me play you the state of the art in voice deepfakes as of 2017. This is supposed to sound like Donald Trump:
Trump: I am not a robot. My intonation is always different.
By the beginning of 2019, the state of the Trump deepfake had reached this level:
Trump: With this technology, it can make me say anything. Such as the following: / Barack Obama is a wonderful man. Do you think this sounds like me? We are working hard to improve these results. That is all for now. See you later, alligator.
Yeah. Maybe much later, alligator.
But then, later in 2019, the world got a load of Fake Joe Rogan.
Rogan: It’s me, Joe Rogan. Friends, I’ve got something new to tell all of you. I’ve decided to sponsor a hockey team made up entirely of chimps. Chimps are just superior athletes. And these chimps have been working out hard. I’ve got them on a strict diet of bone broth and elk meat. See you on the ice, folks.
That is not, in fact, the voice of comedian and top podcaster Joe Rogan. That…is an audio deepfake. It’s not Joe Rogan; it’s Faux Rogan. Let’s meet the guy who made it.
David For my pronunciation pleasure, Ragavan, will you pronounce your name so I can get it right?
Ragavan Yeah, it’s uh— it’s yeah, you could say Ragavan. The last name, even I don’t attempt to pronounce.
RAGAVAN: It’s Thurairatnam, but I’m pretty sure I’m saying it wrong.
Ragavan is the cofounder of Dessa, a Toronto company specializing in machine learning.
David And what is Dessa’s actual business? What –what did you found it to do?
Ragavan We kind of started looking at banks as potential customers. [00:08:47] And it kind of we— we ended up making like AI software for these sort of big, big, boring companies — But we also want to do crazy stuff because, like, it’s— we just saw that, you know, this this technology can do so many things. And we —we really wanted to show the world what it could do and also just have some fun.
David So the RealTalk project was one of these side —side hustles.
Ragavan That’s right. Yeah. The Real Talk Project was one of those side projects.
Dessa’s dive into AI speech synthesis began at a company dinner in the summer of 2018.
Ragavan I asked the team like, what can we do to, like, really show people, like deep learning can do amazing things, and also get a lot of attention.
So one of the engineers who ended up working on the project, his name is Hashim Kadem. He said, like, you know, “there’s this podcast that looks like the most popular podcast in the world, Joe Rogan’s podcast, like, if we could get on there, get noticed on there, something, that could be that could be really good.”
And that was the sort of seed of it.
The Dessa team figured they had plenty of source material to use for its Rogan voice clone. After all, Joe Rogan has made over 1600 episodes of his podcast—and they tend to be long. Sometimes five hours long.
Ragavan On the surface it’s like, “oh, there’s hours of podcast recording. This should be easy, right?”
But Joe Rogan, it’s just —like he’s just crazy. He puts his mouth like right on the microphone. It has all these weird things in it. And also, it’s like a conversation, which is just completely different. It’s like —one person talks, the other person talks in the middle of him talking, there’s laughing, there’s like coffee drinking and, you know, all sorts of things. And— and that makes it a lot harder.
So what the team ended up doing was, they ended up using just his ad reads. So like, you know, whenever he’s reading an ad for his podcast—because we knew it’s just him, you know, he’s not going to be doing anything weird. It’s a lot easier.
<<<Clip of Joe Rogan reading an ad >>>
Meanwhile, the team was also working through the tedious process of producing a perfect typed transcript of everything in those Rogan recordings.
Now, the next part of the story requires a gentle understanding of artificial intelligence, machine learning, and deep learning. It’s technical, and I debated just cutting this whole section. But hey—you’ve put on a podcast to learn something, right?
Ragavan? Take it away.
David Are you able to do a layman-friendly distinction between deep learning, and machine learning?
Ragavan Yeah, let’s try that. Normally, software, you have to write all the rules for it. So, like, you know, if you think of an app, like you have to say exactly, like “when the user does this, I want this to happen.”
But with machine learning, what we do is, we, we make software kind of— learn how to do things just by showing it data.
So, for example, let’s say I wanted to recognize something in an image. In machine learning, I would write by hand, like these sort of mathematical things that say, like, “oh, look for straight lines, you know, count how many straight lines there are. Count how many, you know, blobs of yellow there are or red there are.”
OK, I got it, sort of. Then what’s deep learning?
With deep learning, we take —take it a bit further, and it kind of learns directly from the data. With deep learning, it’s just like, “give me the data and give me the answer; I will figure out the rest. I will learn everything in order to make this happen.”
I think that’s what’s really powerful about deep learning. And that’s— that’s one of the reasons why, you know, in the past few years, we’ve seen so many crazy things come out of it.
I’m beating this dead horse about how hard is to create a deepfake voice to emphasize… how hard is to create a deepfake voice! I mean, the early incarnations of the Faux Rogan voice were not that convincing. Here’s an example:
Rogan: Rogan on AlexNet (RealTalk early clip)
Yeah, OK, it’s…something. But even after further work and hand-tweaking, there were still weird gaps and unnatural emphasis. Like this:
Rogan: (Anyone else clip)
But eventually, the Rogan voice got really good. At fakeJoeRogan.com, for example, you can take a little quiz. You can listen to sentences, and try to figure out if they’re Joe Rogan…or Faux Rogan.
I’ll play three examples, and you can test your deepfake radar. Ready? First one:
Rogan: What was the person thinking when they discovered cow’s milk was fine for human consumption? And why did they do it in the first place?
Real or fake? Remember your vote. I’ll give you the answers in a second. OK, second one:
Rogan: Some of you just need to improve the quality of your existence on earth. You gotta do the right things.
And finally, example number 3:
Rogan: Fantastic old-world craftsmanship that you just don’t see any more.
OK, remember your answers. After the break—we’ll see how you did.
Welcome back. Let’s see how you did with your deepfake detection skills.
The first sentence—
Rogan: What was the person thinking when they discovered cow’s milk was OK for human consumption? And why did they do it in the first place?
That one’s fake.
Rogan: Some of you just need to improve the quality of your existence on earth. You’ve got to do the right things.
Fake again. And finally, example 3:
Rogan: Fantastic old-world craftsmanship that you just don’t see any more.
That’s an actual recording of Joe Rogan.
So how’d you do? Pretty soon, it’s going to matter.
Nina By the end of the decade, you’re looking at a future where one Youtuber with limited resources or skills can kind of produce something that’s better than what the best Hollywood studio can produce today for millions of dollars and with teams of special effects artists. Don’t you worry, that is coming!
This is Nina Schick, author of a book called Deep Fakes: The Coming Infocolypse. Like, “apocalypse” but with “info.” “Infocolypse.”
Nina When I first started to come to deepfakes, you know, it was as they were emerging at the end of 2017 in the form of nonconsensual pornography on Reddit. And I immediately realized that deepfakes could become the most powerful weapon of political disinformation known to humanity.
Nina may be one of the world’s most informed experts on why audio deepfakes are dangerous.
Nina Number one, you can fake media of anyone saying or doing anything. So you can imagine how, for instance, if you take the context of the United States after the George Floyd video came out, imagine there was a leaked recording of Donald Trump uttering a racial slur. You can see how that leaked audiotape could, in that incendiary kind of political environment, really kick off something far more dangerous.
But I should note that this also has a very real risk to businesses. Imagine a business leader is caught on tape saying something that they didn’t actually say. It could be potentially devastating.
And, of course, the opportunities for scammers are delicious.
Nina Ultimately, it’s something that can affect every individual, right? One of the classic frauds that is perpetrated against millions of us every day worldwide, is the desperate phone call from a loved one. Right? “Dad, I’ve been an accident. I need money now. I’m in jail.”
Now, imagine fraudsters can use AI to scrape social media to find a video of your son, your wife, your daughter, and then use that AI basically emulate their voice with just a few seconds of training data— and now you get the call and it’s literally your son. /It is absolutely terrifying, to say the least, that this technology can be deployed by malicious actors without control.
Anyway, deepfakes purporting to show people saying things they never said is only half the problem. The other half is the opposite situation—people blaming deepfakes for things they actually did say!
David I remember that one of Trump’s first responses to the “grab them by the pussy” video was, “I never said that! Software created that.”
Joan That kind of reaction is called the Liar’s Dividend, which is that people can come out and say, “well, I didn’t do that. I didn’t say that —that wasn’t me.”
Meet Joan Donavan, research director at Harvard’s Kennedy School Shorenstein Center on Media, Politics and Public Policy.
David And that fits on a business card?
Joan Hey, when you’re me, you don’t want anyone to have your email or your phone number. I don’t even have business cards, I don’t want people to know how to get in touch.
She’s spent a lot of time studying misinformation. And she says that the antidote for the liar’s dividend—is other people as witnesses.
Joan You don’t build a court case based on a single shred of evidence. Everything adds up. Right? We have to kind of build or weave a story here.
And then also if it is an interaction that is being that is being faked, like, is there a way to legitimate those claims, just as we would as any good journalists would, you know, verify. But it’s going to require people talking to people to make sense of the thing.
Now, Dessa, the company that created Faux Rogan, did it, as they say, to get attention. And they got it. The whole company was soon thereafter bought by Square, the digital payments company.
But why did Adobe do it? Why did they make Project Voco in the first place?
It wasn’t to torpedo our public trust in anything anybody ever says again. Here’s what Adobe’s blog post said:
Faux 7: When recording voiceovers, dialogue, and narration, wouldn’t you love the option to edit or insert a few words without the hassle of recreating the recording environment or bringing the voiceover artist in for another session?
Voco was created to make life easier for creative people. To fix stumbles in podcasts, audiobooks, and narration. To clean up dialogue in movies, TV shows, and games, when you need to edit lines after the actors are no longer available. To dub movies into other languages with the original actor’s voice.
Here’s Nina Schick again:
Nina You can see how this is going to basically change the future of the movies, change the future of advertising, I mean, change entire industries.
But another really compelling example, is using synthetic voice to give those people who’ve lost the ability to speak, for instance, through a neurodegenerative disease or a stroke, being able to give them their voice back, literally give them their voice back. And there’s already a team of researchers working on this.
And that’s why there are now a bunch of companies that can turn your voice into a deepfake—a voice clone—so that you can type whatever you want to have read aloud in your voice.
To make a voice clone, you need to feed the machine-learning algorithm a lot of clean audio. You’re usually asked to read 20 or 50 sentences into the mic.
DP reading: Sentence number 4. The rainbow is composed of many bands of white light.
That’s partly to teach the AI—and partly to prevent you from cloning the voice of somebody else without their awareness. You’d have to put a gun to their head, sit them down, and make them read those exact sentences.
So: How good is the result? I tried all of the voice-cloning services I could find.
Here’s a voice I generated for free at site called Resemble.ai:
Resemble: Hello, and welcome to the brilliant new podcast called Unsung Science. I’m David Pogue. Or not.
Wow…Well, Resemble does offer sliders that let you change the pitch, emphasis, and emotion of each word. That “or not” at the end really sounded wrong–
Resemble: Or not.
—so I’m going to make the pitch lower, and change the emotion to annoyed.
Resemble: Or not.
Much better! Or not.
Well, let’s see if I could use it to pull off the phone scam that Nina Schick described:
Resemble: Hi Dad, it’s David, as you can obviously tell by the sound of my voice. I’ve been an accident. I need money now. I’m in jail. Can you send me some money right away?
Yeah, probably not.
Well, how about its competitor, ReplicaStudios.com?
Replica: Hi Dad, it’s me again. David. I have some bad news. I’ve been brutally mugged in the streets of Paris! I need you to send me money. Lots of money. Please please please.
Nope. Not sold. Without a lot of hand work by engineers, the state of the art is just lame.
Now, with hand work by engineers, the state of the art is really good. This is the David Pogue voice clone made for me by a company called Lovo.ai:
DP Lovo: Now I’m in business. This fake Pogue is much more convincing than those free ones.
To get something that good, I had to read 20 minutes of text. And if I were an actual customer, I would have had to pay a thousand dollars.
You know the voices you’ve been hearing in this episode, reading statements by Adobe, and quotes from various news outlets? They’re all AI voices generated by Lovo.
Gotcha! Yeah—I like my podcasts with a twist.
Charlie So a lot of the AI systems out there, if you feed it in gold, it will output gold. But if you feed it in garbage, it will output garbage.
Meet Charlie Choi. He’s the CEO of Lovo, speaking to me from Korea.
David I tried a bunch of the free voice cloning services and they were not good. Why is it that you can make ones that could actually fool someone and they can’t?
Charlie We have a team of data scientists who, after receiving the recording data, we go in and really try to understand if this person has spoken every single word. And we try to annotate every single emphasis or maybe breathing patterns or laughs, so that the AI voice sounds more natural and more human. And for us, we can even simulate stuttering or, all of these imperfect artifacts which make human voice so real. Because humans aren’t perfect.
Charlie We’re also teaching it where the emphasis goes in, or which part of it is a laugh or which part of it is a sigh. We’re also feeding it, for example, pitch information, so that the model learns how to change around the pitch.
By the way: Remember how Adobe’s Project Voco was meant to make it easier to edit podcasts and audiobooks? Well—that idea was too good to stay down. Today, you can have that freedom by paying for a service called Descript.com. It’s a suite of tools for podcasters to make it easier to edit recordings. Here’s their ad:
Ad: Meet Descript. It’s a powerful new tool that makes editing easy. So easy that you’ll want to edit videos.
And if you’re willing to pay $24 a month, you get this:
Ad: Get this. Descript can turn your text back into audio. It’s called Overdub. Just type what you meant to say right into Descript.
Wait, what? Isn’t that exactly what Project Voco was supposed to do—five years ago?
I tried it out. (Open parenthesis: Descript and the other companies mentioned here didn’t pay me to talk about them; most of ‘em didn’t even know I was doing this. Close paren.)
First, I had to teach Descript my voice by reading 15 minutes’ worth of prepared text:
DP: “The penguins stay when all other creatures have fled, because each guards a treasure.”
…and then, 24 hours later, Descript was ready to do the Project Voco thing. Let’s recreate the same Key and Peele joke that Adobe used, but using my own voice. Here’s what I actually recorded:
DP: I jumped on the bed and— and I kissed my dogs and my wife in that order.
And then, I edited the sentence just the way the Adobe guy did onstage, to produce this hilarious result:
Faux DP: I jumped on the bed and I kissed Jordan three times.
OK, that’s pretty amazing.
Voco and Descript are meant to fix a word or two in a legitimate recording. You can’t use them to generate a whole paragraph, or a whole speech.
That is a bigger challenge, and that’s the purpose of services like Lovo—to make a full-scale voice clone that can say anything of any length and sound convincingly human. Right now, they take a lot of work and a lot of money. But Harvard’s Joan Donovan says that technology will march on soon enough.
Donovan: As deep fakes require fewer and fewer images of people, and audio fakes require fewer and fewer sound bites, it’s pushing us into a future of forgery that is going to it’s– it’s going to be confusing for a while.
So—is that it? Society is doomed? Nobody will ever be able to trust any photo, video, or audio clip again?
Well—maybe not. It may be that you already know about the solution to the deepfakes problem—you heard it described 20 minutes ago.
Remember Adobe’s 2015 demo of Project Voco? In that session, the presenter promised that the company was also working on fraud-detection technology, so we’d know the difference between real and phony recordings. Remember?
Zeyu Don’t worry. We have, like think about like a watermarking detection.
Well, Adobe hasn’t forgotten.
Dana: One of the early experiments we— we were working on was something we called Project VoCo which is a voice editing, synthesizing software. But we actually ended up deciding not to release it yet, because we actually didn’t know how to protect it.
This is Dana Rao, who’s Adobe’s chief counsel. Ever since that Voco demo, he and Adobe’s engineers have been trying to figure out how to prevent a deepfake-ageddon.
Or an Infocalypse.
Dana: I was talking to our chief product officer, I said, you know, we’re probably at the point where, where this is going to be really hard, as we said, tell fact from fiction. In a world where you don’t believe anything anymore, there are two big problems. One is, you believe a lie. And the other big problem is you no longer believe the truth. Right? And once you lose both of those things, if you’re in a democracy, you’ve sort of lost the ability to govern.
Their first thought was to use artificial intelligence to detect if some photo or recording is fake or not.
So we took the question back to our research team. The first question is, can we use A.I. to detect fakes? Like, that would be the easiest answer, right? And the response we got back from our researchers was, the technology to do the editing, which is what we do, is always going to be at par or step ahead of any technology to detect it. It’s just like the security in the arms race where you like, you’re always— you’re improving your security, but the bad guys are out there improving their attacks. And sooner or later, you’re going to lose that battle, or at least something’s going to get through.
But then—a eureka moment.
We don’t necessarily need technology that can identify a fake. What would be just as good is a way to prove that something is real. That would solve the trust problem. If there’s some leaked recording of the president saying, you know, “I like to run over baby animals,” knowing if it’s authentic would be just as good as knowing if it’s a fake.
Dana: And so we said, “all right, what is another way to talk about this problem?” Let’s flip the problem on its head. And what we meant by that was, why don’t we give a place for good actors to go to be trusted, instead of trying to catch all the bad actors, which we think is a losing proposition? And that’s what CAI is designed to do.
CAI is the Content Authentication Initiative. Five years after the Project Voco demonstration, Zeyu Jin’s reference to watermarking—
Zeyu Think about, like, a watermarking detection.
—has blossomed into a full-blown—I don’t know, program? Feature? Technology? Campaign? Consortium? All of the above.
I’ll let Dana Rao describe how it works.
Dana: It occurred to us that we’re in this unique position to help the consumers understand, like what happened to an image? I’m gonna, you know, enhance the image. I’m going to make it sharper. I’m going to make it clearer.
You make all the edits, and then you publish it. Once you publish it on the social media platform or wherever it is, the people can see it, they can see a little icon and they’re like, “oh, I wonder if the president really did go there,” and they can click on it and they can say, “well, it was David who took it.” They can see the location of the image, where it was taken. They can see the edits that were made if they want to. They can actually see the original, they can go to our website, see the original image and see edited image and decide for themselves.
Now you have the facts. You decide for yourself. We empower the user to do it. That’s sort of the end to end system that we’re working on with a bunch of different partners to build out and hopefully change the conversation around how you consume content.
Obviously, this idea can work only if every link of the chain preserves that encrypted metadata that’s embedded in the picture or recording. The phone camera that takes it. The software that edits it. The social-media network that posts it. Every step of the way.
Dana; And that’s why this is an open standard. It’s not an Adobe tool, it’s not proprietary. We’re building it with a bunch of partners. We want everyone to use it, we want every news media outlet to use it. We want every social platform. We want everyone, whoever does this. This is not a not a play for us to get money. We’re not charging for it. So if you want your story to be told, you can do it.
Already, a bunch of companies are on board, including chip makers like Intel, ARM, and Qualcomm; software makers like Adobe and Microsoft; news outlets like the New York Times, the BBC, and the CBC; websites like Twitter, Facebook, and Getty images; and 55 other companies.
Here’s an ad from the CAI website, which gives you an idea of how these companies will explain CAI to the public:
Ad: I am photographing with a CAI-enabled prototype. It’s saying, “Don’t take my word for it.” There’s literally software that can prove, like, I didn’t mess with this photo. This is where it was taken, this is when it was taken, and this is the certification that it’s me who’s made that content.
The feature that the CAI companies are adopting has a name, too. It shall be known as “Content Credentials.” When you see something suspicious online, you’ll click a Content Credentials icon to see that content’s credentials. And the path that it took to your eyeballs.
And now, the big punch line: after years of work, Adobe has finally introduced this feature to the public. Just this week—assuming you’re listening to this podcast when it’s hot off the servers—Adobe unveiled the Content Credentials at Adobe Max.
Yeah, that’s right: the story that began at the Adobe Max conference five years ago…ended with the Adobe Max conference last week. This episode has bookends! Now that’s what you call an ingeniously structured podcast.
Of course, the Content Credentials technology isn’t a silver bullet. For one thing, the version just released works only on photos. Adobe hopes to have video and audio authentication maybe next year. Meanwhile, Harvard’s Joan Donovan says we’ll still have a lot of work to do—in policy, law, and in public awareness:
Joan People have figured out how to wield this technology for serious, serious and grave consequences. We have a duty to the future to say that we’re not going to allow it. We’re not going to let it proliferate.
And so as we think about the future of technology policy, I believe we need a whole of society approach. What is our responsibility to one another? What is technology companies’ responsibility for that distribution and that exposure? And then how do we as a society, like, figure out what the true costs of misinformation are, so that we can do something about it?
You know, throughout all of these interviews, I kept thinking: A new technology. Capable of editing a record of actual events. Experts predicting the erosion of public trust…Where have I heard all this before?
Deborah: A picture may no longer be worth a thousand words. These days, the picture that the camera takes may well not be the picture that we end up seeing in newspapers and magazines. Technology makes it difficult, maybe even impossible, to tell what’s real and what’s not.
That’s Deborah Norville, the host of “The Today Show,” in February 1990. Her guest that day was Russell Brown, from Adobe, demonstrating version 1.0 of a brand-new program called… Photoshop.
Russell: We’ll take this show of Nancy and Ron. I’m gonna place myself into this photograph. Based upon the skill of the artist using the program, they can give the illusion that photograph was quite real.
There was also another guest, a cautionary voice:
Norville: Fred Ritchin is an author who has written a book. You warn against the dangers of what people like Russell do.
Fred: Well, the thing is, when you see a photograph, you really tend to believe that something happened and when people start monkeying with photographs, you don’t know which photographs are real, which ones happened, and which didn’t. My concern is that if the media takes to doing what Russell is demonstrating now, that people, the public, will begin to disbelieve photographs generally, and it won’t be as effective and powerful a document of social communication as it has been for the last 150 years.
Of course, these days, nobody worries about Photoshop bringing down civilization. We’re totally blasé about edited photos. We just go, “oh, that must have been Photoshopped,” and we go on with our lives.
I asked Nina Schick if these audio and video deepfakes are really any different.
David Is there a newness to audio and video deepfakes that makes it more terrifying?
David And maybe we’ll just get to a place where everyone’s like, “oh, that’s probably a deepfake?”
Nina Photo and image manipulation has a long history. The difference now is that it is not just images. You are talking about video— video manipulation, which until now has only been in the realm of Hollywood studios.
Still, she does acknowledge that there’s more to it than the dawn of the Infocalypse.
Nina Like all powerful technologies of the exponential age, this is going to be an amplifier of human intention. It will be used for bad, just as it will be used for good. So just as they will be used by malicious actors, they’re going to be many commercially valid, legitimate applications.
Now, I wanted to end this episode with a twist: I thought I’d let my own voice clone from Lovo speak the final paragraph. But when I got the results back from Charlie Choi, it sounded so much like me that I didn’t think you’d be able to tell when I stopped and the deepfake voice started, and the gag would lose all impact. So I’m going to make it super clear. From the end of this sentence until the credits, you’re going to hear nothing but software, starting…now.
Clone: I thought I’d give the last word to—my clone. My voice clone, the one that Charlie Choi’s team at Lovo made for me. You’re listening to him right now.
And what I’d like my voice to say is that: Well, in the end, voice synthesis is just another technology. What happens from here isn’t about the tool; it’s about whoever’s wielding it.
I’m David Pogue—or a synthetic version thereof. And this…is “Unsung Science.”