After comparing various speech to text engines, and staring at transcripts, I got intrigued about how much more metadata I was getting back from Watson about the speech. With both timings and confidence levels I built a little visualizer for the transcript that colors things based on confidence, and attempts to insert some punctuation:
This is a talk by Neil Gaiman about how stories last at the Long Now Foundation.
Things are more red -> yellow based on how uncertain they are.
A few things I learned along the way with this. Reversing punctuation into transcriptions of speech is hard. Originally I was trying to figure out if there was some speech delay that I could guess for a comma vs. a period, and very quickly that just turned into mush. The rule I came up with which wasn’t terrible is to put a comma in for 0.1 – 0.3s delays, and put one period of an elipsis in for every 0.1s delay in speech for longer pauses. That gives a sense of the dramatic pauses, and does mentally make it easier to read along.
It definitely shows how the metadata around speech to text can make human understanding of the content a lot easier. It’s nice that you can get that out of Watson, and it would be great if more environments supported that.