Six Colors
Six Colors

by Jason Snell & Dan Moren

This Week's Sponsor

End users aren't your enemy! Kolide gets users to fix their own device compliance problems–and unsecure devices can't log in. Click here to learn how.

By Jason Snell

A bumpy road to the Apple conference call transcript

Note: This story has not been updated for several years.

So yesterday I generated an enormous transcript of Apple’s conference call with analysts. While I can type fast, I have to admit that fast typing is not what allowed me to generate this transcript—it was magic, resilience, and panic.

Three months ago I wrote about how I experimented with the Trint transcription service, which uses computerized speech-to-text algorithms to generate a transcript that I can then correct on the fly with a convenient browser-based text editor that connects with the underlying audio.

Trint did a great job last time, so I set up my workflow Tuesday to take full advantage. I set Audio Hijack to record the call and split those recordings into new chunks every 10 minutes, and hooked up Trint’s integration with Dropbox so that I could upload files to Trint just by copying them to the right folder. The plan was to listen to the first five or six minutes of the call, upload a first file to Trint, and then just keep working a little bit behind the live call. It was a foolproof way to generate an almost-live call transcript.

Except… during the call, Trint (or, apparently, the third-party speech-to-text engine Trint is based on) crashed and burned. I managed to get two transcripts returned to me, but none of the rest of my files were processed.

What saved my bacon was that a while ago, I set up an experimental feature of the Auphonic podcast postproduction service, which routes audio through Google’s speech-to-text engine and automatically generates a podcast transcript. Like the other machine-generated transcripts I wrote about earlier this year, the results aren’t readable by human beings without a pass by a human editor.

Anyway, what I ended up doing was uploading my audio files of the call to Auphonic, as if they were podcasts, and had the service process them and run them through Google’s service. I opened the resulting file in BBEdit and played back the call audio in iTunes, correcting as I go. (I use SizzlingKeys by Yellow Mug Software to add keyboard shortcuts to make iTunes jump back a few seconds, which is a huge help in editing a transcript.)

The result is a transcript that’s pretty accurate and was generated far faster than I could’ve typed it, though I definitely would have preferred to use the audio-linked text editor offered by Trint.

Here’s an original chunk from yesterday’s call, as heard by Google:

Mike that is a great question. Since we I and I could not be more excited about a are and what we’re seeing what they are kid in the early going in to answer question about what category it starts in, just take a look at what’s already on the on the web on terms of what people are doing and it is all over the place.

And here’s the cleaned-up version:

Mike, that is a great question. And I could not be more excited about AR and what we’re seeing with ARKit in the early going. And to answer your question about what category it starts in, just take a look at what’s already on the web in terms of what people are doing and it is all over the place.

(I have to say, I was really impressed with the quality of Google’s transcript. It made a lot of dumb mistakes, but it also correctly interpreted stuff that I would have never believed a computer could understand.)

If Trint had been working, I really do think I could’ve had the entire transcript up within 10 minutes of the call ending. Maybe next time.

If you appreciate articles like this one, support us by becoming a Six Colors subscriber. Subscribers get access to an exclusive podcast, members-only stories, and a special community.


Search Six Colors