six colors

by Jason Snell & Dan Moren

This week's sponsor

Six Colors Membership - Help support us by becoming a member and get access to a members-only podcast, newsletter, and community.

By Jason Snell

A bumpy road to the Apple conference call transcript

So yesterday I generated an enormous transcript of Apple’s conference call with analysts. While I can type fast, I have to admit that fast typing is not what allowed me to generate this transcript—it was magic, resilience, and panic.

Three months ago I wrote about how I experimented with the Trint transcription service, which uses computerized speech-to-text algorithms to generate a transcript that I can then correct on the fly with a convenient browser-based text editor that connects with the underlying audio.

Trint did a great job last time, so I set up my workflow Tuesday to take full advantage. I set Audio Hijack to record the call and split those recordings into new chunks every 10 minutes, and hooked up Trint’s integration with Dropbox so that I could upload files to Trint just by copying them to the right folder. The plan was to listen to the first five or six minutes of the call, upload a first file to Trint, and then just keep working a little bit behind the live call. It was a foolproof way to generate an almost-live call transcript.

Except… during the call, Trint (or, apparently, the third-party speech-to-text engine Trint is based on) crashed and burned. I managed to get two transcripts returned to me, but none of the rest of my files were processed.

What saved my bacon was that a while ago, I set up an experimental feature of the Auphonic podcast postproduction service, which routes audio through Google’s speech-to-text engine and automatically generates a podcast transcript. Like the other machine-generated transcripts I wrote about earlier this year, the results aren’t readable by human beings without a pass by a human editor.

Anyway, what I ended up doing was uploading my audio files of the call to Auphonic, as if they were podcasts, and had the service process them and run them through Google’s service. I opened the resulting file in BBEdit and played back the call audio in iTunes, correcting as I go. (I use SizzlingKeys by Yellow Mug Software to add keyboard shortcuts to make iTunes jump back a few seconds, which is a huge help in editing a transcript.)

The result is a transcript that’s pretty accurate and was generated far faster than I could’ve typed it, though I definitely would have preferred to use the audio-linked text editor offered by Trint.

Here’s an original chunk from yesterday’s call, as heard by Google:

Mike that is a great question. Since we I and I could not be more excited about a are and what we’re seeing what they are kid in the early going in to answer question about what category it starts in, just take a look at what’s already on the on the web on terms of what people are doing and it is all over the place.

And here’s the cleaned-up version:

Mike, that is a great question. And I could not be more excited about AR and what we’re seeing with ARKit in the early going. And to answer your question about what category it starts in, just take a look at what’s already on the web in terms of what people are doing and it is all over the place.

(I have to say, I was really impressed with the quality of Google’s transcript. It made a lot of dumb mistakes, but it also correctly interpreted stuff that I would have never believed a computer could understand.)

If Trint had been working, I really do think I could’ve had the entire transcript up within 10 minutes of the call ending. Maybe next time.

[If you appreciate articles like this one, help us continue doing Six Colors (and get some fun benefits) by becoming a Six Colors subscriber.]