By Jason Snell
August 2, 2017 2:25 PM PT
A bumpy road to the Apple conference call transcript
Note: This story has not been updated for several years.
So yesterday I generated an enormous transcript of Apple’s conference call with analysts. While I can type fast, I have to admit that fast typing is not what allowed me to generate this transcript—it was magic, resilience, and panic.
Three months ago I wrote about how I experimented with the Trint transcription service, which uses computerized speech-to-text algorithms to generate a transcript that I can then correct on the fly with a convenient browser-based text editor that connects with the underlying audio.
Trint did a great job last time, so I set up my workflow Tuesday to take full advantage. I set Audio Hijack to record the call and split those recordings into new chunks every 10 minutes, and hooked up Trint’s integration with Dropbox so that I could upload files to Trint just by copying them to the right folder. The plan was to listen to the first five or six minutes of the call, upload a first file to Trint, and then just keep working a little bit behind the live call. It was a foolproof way to generate an almost-live call transcript.
Except… during the call, Trint (or, apparently, the third-party speech-to-text engine Trint is based on) crashed and burned. I managed to get two transcripts returned to me, but none of the rest of my files were processed.
What saved my bacon was that a while ago, I set up an experimental feature of the Auphonic podcast postproduction service, which routes audio through Google’s speech-to-text engine and automatically generates a podcast transcript. Like the other machine-generated transcripts I wrote about earlier this year, the results aren’t readable by human beings without a pass by a human editor.
Anyway, what I ended up doing was uploading my audio files of the call to Auphonic, as if they were podcasts, and had the service process them and run them through Google’s service. I opened the resulting file in BBEdit and played back the call audio in iTunes, correcting as I go. (I use SizzlingKeys by Yellow Mug Software to add keyboard shortcuts to make iTunes jump back a few seconds, which is a huge help in editing a transcript.)
The result is a transcript that’s pretty accurate and was generated far faster than I could’ve typed it, though I definitely would have preferred to use the audio-linked text editor offered by Trint.
Here’s an original chunk from yesterday’s call, as heard by Google:
Mike that is a great question. Since we I and I could not be more excited about a are and what we’re seeing what they are kid in the early going in to answer question about what category it starts in, just take a look at what’s already on the on the web on terms of what people are doing and it is all over the place.
And here’s the cleaned-up version:
Mike, that is a great question. And I could not be more excited about AR and what we’re seeing with ARKit in the early going. And to answer your question about what category it starts in, just take a look at what’s already on the web in terms of what people are doing and it is all over the place.
(I have to say, I was really impressed with the quality of Google’s transcript. It made a lot of dumb mistakes, but it also correctly interpreted stuff that I would have never believed a computer could understand.)
If Trint had been working, I really do think I could’ve had the entire transcript up within 10 minutes of the call ending. Maybe next time.
If you appreciate articles like this one, support us by becoming a Six Colors subscriber. Subscribers get access to an exclusive podcast, members-only stories, and a special community.