The dream of converting podcasts into text

Note: This story has not been updated since 2021.

I love podcasts. But you can’t use Google to search for something that was said during a podcast. And if you can’t listen—because of physical disability, personal preference, learning style, or any other reason—you’ll never know what’s being said. An entire vibrant, conversational, fun corner of the digital media world is closed to you.

The solution is clear: Create a text transcript for every podcast! I’m pretty sure that this will eventually happen, but we’re not there yet. Speech-to-text technology just isn’t good enough, and human-created transcripts are more expensive than most podcasts can afford.

There are services that offer human transcriptions of podcasts—I’ve used both CastingWords and Rev, but they aren’t cheap. The cheapest I’ve seen is $1 per minute. That’s not unreasonable if you’re a highly capitalized commercial podcast with a big budget, but I’d wager that 98 percent of podcasts would lose money if they had to pay $1 per minute for transcripts.

But beyond that, these human-based transcription services still generate transcripts that are full of errors, misunderstandings, and nonsensical statements. The more arcane or technical the discussion—or the more voices on a podcast—the worse it can get. If you really want your transcript to be good, you have to go over it yourself, preferably by listening along—and that takes time. The cost just went up even more.

The great hope lies in software transcriptions, which can either ease the burden of human transcriptionists or replace them entirely. There are a few platforms currently offering speech-to-text transcriptions—I used Google’s API via the Auphonic service—and they cost a lot less than paying a human to transcribe them. But as you might expect, the results are comical at best¹, unintelligible at worst.

I ran last week’s Six Colors Secret Subscriber Podcast through the Google engine, just as a sample. Here’s something Dan said, which I transcribed and edited:

Right, and I think the hugest win here is this idea that Workflow succeeded in an environment where Apple did very little to foster anything in that area. From the scripting side and the Automator side, I think they were always kind of awkward because you could be very good at automating or scripting, but it always felt to me like a weird middle ground where people who are not technical… there was just no chance that they were going to sit down to write an AppleScript. And then for a lot of people who are very technical—we know many programmers and I’m one of these people who did work in programming—I have the hardest time grokking an AppleScript.

And here’s what the machine presented to me:

Right and I think that’s an amazing and the sort of like you just win here is this idea that you know workflow succeed in an environment where Apple did very little to sort of foster anything in that area from the scripting side in like the automator side I think they were always kind of awkward because, you could be very good at automated or automating or scripting put it always to me felt like a weird Middle Ground where people who are not technical like there was just no chance like, we’re going to sit down to write an apple script and then for a lot of people who are very we know many programmers who and I will I’m one of these people who did work in programming I have the hardest time cracking an apple script

And that was one of the cleaner passages. I cleaned it up by going over the audio and correcting all of the mistakes (and making some editorial judgment to remove some filler words and false starts).

This also points out another problem with text transcripts of talking, namely that we don’t talk the way we write. Even the most conversational of writers 👋🏻 will be more direct than a transcript of how people speak. The way our brains process speech is very different from the way they process writing. If I were to “translate” Dan’s statement into writing, it might look like this:

Perhaps most impressive is that Workflow succeeded despite Apple doing very little to foster automation on iOS. On the Mac, AppleScript and Automator always seemed awkward to approach if you weren’t already a fairly technical person. I used to work as a programmer, and even I had trouble figuring out how to use AppleScript.

Here’s the good news: While these machine translations aren’t readable, they are getting good enough to fuel search engines. A great proof of concept is this one from David Smith, which covers seven different podcasts.

I’m a little baffled why Google hasn’t just indexed the contents of every podcast on the Internet and poured it into the Google search engine. David’s engine works well because the computerized transcript is attached to a time code for that podcast episode, so when you find a search result you can click to hear what was really said, rather than relying on a baffling transcript.

This could go a long way to addressing the searchability of podcasts, which is why I’m hoping to (slowly) add automatic transcripts to all my podcasts. They won’t be great reading—which is why in the long term this technology needs to get much better in order to support people who are unable to listen at all—but they will help feed search engines and make it easier to find that moment when I first had Matt Fraction’s “Hawkeye” recommended to me.

“Goodnight everybody for listening to be uncomfortable I’ve been your Hostess and smell but really I Batman.” ↩

Report a typo

If you appreciate articles like this one, support us by becoming a Six Colors subscriber. Subscribers get access to an exclusive podcast, members-only stories, and a special community.

This Week's Sponsor

By Jason Snell

The dream of converting podcasts into text

Search Six Colors