All posts by Sean Dague

Visualizing Watson Speech Transcripts

After comparing various speech to text engines, and staring at transcripts, I got intrigued about how much more metadata I was getting back from Watson about the speech. With both timings and confidence levels I built a little visualizer for the transcript that colors things based on confidence, and attempts to insert some punctuation:

This is a talk by Neil Gaiman about how stories last at the Long Now Foundation.

Things are more red -> yellow based on how uncertain they are.

A few things I learned along the way with this. Reversing punctuation into transcriptions of speech is hard. Originally I was trying to figure out if there was some speech delay that I could guess for a comma vs. a period, and very quickly that just turned into mush. The rule I came up with which wasn't terrible is to put a comma in for 0.1 - 0.3s delays, and put one period of an elipsis in for every 0.1s delay in speech for longer pauses. That gives a sense of the dramatic pauses, and does mentally make it easier to read along.

It definitely shows how the metadata around speech to text can make human understanding of the content a lot easier. It's nice that you can get that out of Watson, and it would be great if more environments supported that.

 

 

 

 

 

Comparing Speech Recognition for Transcripts

I listen to a lot of podcasts. Often months later something about one I listened to really strikes a chord, enough that I want to share it with others through Facebook or my blog. I'd like to quote the relevant section, but also link to about where it was in the audio.

Listening back through one or more hours of podcast just to find the right 60 seconds and transcribe them is enough extra work that I often just don't share. But now that I've got access to the Watson Speech to Text service I decided to try to find out how effectively I could use software to solve this. And, just to get a sense of the world, compare the Watson engine with Google and CMU Sphinx.

Input Data

The input in question was a lecture from the Commonwealth Club of California - Zip Code, not Genetic Code: The California Endowment's 10 year, $1 Billion Initiative. There was a really interesting bit in there about spending and outcome comparisons between different countries that I wanted to quote. The Commonwealth Club makes all these files available as mp3, which none of the speech engines handle. Watson and Google both can do FLAC, and Sphinx needs a wav file. Also it appears that all speech models are trained around the assumption of a 16kHz sampling, so I needed to down sample the mp3 file and convert it. Fortunately, ffmpeg to the rescue.

Watson

The Watson Speech to Text API can either work over websocket streaming or with bulk HTTP. While I had some python code to use the websocket streaming for live transcription, I was consistently getting SSL errors after 30 - 90 seconds. A bit of googling hints that this might actually be bugs on the python side. So I reverted back to the bulk HTTP upload interface using example code from the watson-developer-cloud python package. This script I used to do it is up on github.

The first 1000 minutes of transcription are free, so this is something you could reasonably do pretty regularly. After that it is$0.02 / minute for translation.

When doing this over the bulk interface things are just going to seem to have "hung" for about 30 minutes, but it will eventually return data. Watson seems like it's operating no faster than 2x real time for processing audio data. The bulk processing time surprised me, but then I realized that with the general focus on real time processing most speech recognition systems just need to be faster than real time, and optimizing past that has very diminishing returns, especially if there is an accuracy trade off in the process.

The returned raw data is highly verbose, and has the advantages of having timestamps per word, which makes finding passages in the audio really convenient.

So 30 minutes in I had my answer.

Google

I was curious to also see what the Google experience was like, which I originally did through their API console quite nicely. Google is clearly more focused on short bits of audio. There are 3 interfaces: sync, async, and streaming. Only async allows for greater than 60 seconds of audio.

In the async model you have to upload your content to Google Storage first, then reference it as a gs:// url. That's all fine, and the Google storage interface is stable and well documented, but it is an extra step in the process. Especially for content I'm only going to have to care about once.

Things did get a little tricky translating my console experience to python... 3 different examples listed in the official documentation (and code comments) were wrong. The official SDK no longer seems to implement long_running_recognize on anything except the grpc interface. And the google auth system doesn't play great with python virtualenvs, because it's python code that needs a custom path, but it's not packaged on pypi. So you need to venv, then manually add more paths to your env, then gauth login. It's all doable, but it definitely felt clunky.

I did eventually work through all of these, and have a working example up on github.

The returned format looks pretty similar to the Watson structure (there are only so many ways to skin this cat), though a lot more compact, as there isn't per word confidence levels or per word timings.

For my particular problem that makes Google less useful, because the best I can do is dump all the text to the file, search for my phrase, see that it's 44% of the way through the file, and jump to around there in the audio. It's all doable, just not quite as nice.

CMU Sphinx

Being on Linux it made sense to try out CMU Sphinx as well, which took some googling on how to do it.

Then run it with the following:

Sphinx prints out a ton of debug stream on stderr, which you want to get out of the way, then the transcription should be sent to a file. Like with Watson, it's really going only a bit faster than real time, so this is going to take a minute.

Converting JSON to snippets

To try to compare results I needed to start with comparable formats. I had 2 JSON blobs, and one giant text dump. A little jq magic can extract all the text:

Comparison: Watson vs. Google

For the purpose of comparisons, I dug out the chunk that I was expecting to quote, which shows up about half way through the podcast, at second 1494.98 (24:54.98) according to Watson.

The best way I could think to compare all of these is start / end at the same place, word wrap the texts, and then use wdiff to compare them. Here is watson (-) vs. google (+) for this passage:

one of the things that they [-it you've-] probably all [-seen all-]
{+seem you'll+} know that [-we're the big spenders-] {+where The Big
Spenders+} on [-health care-] {+Healthcare+} so this is per capita
spending of [-so called OECD-] {+so-called oecd+} countries developed
countries around the world and whenever you put [-U. S.-] {+us+} on
the graphic with everybody else you have to change the [-axis-]
{+access+} to fit the [-U. S.-] {+US+} on with everybody else
[-because-] {+cuz+} we spend twice as much as {+he always see+} the
[-OECD-] average [-and-] {+on+} the basis on [-health care-]
{+Healthcare+} the result of all that spending we don't get a lot of
bang for our [-Buck-] {+buck+} we should be up here [-we're-] {+or+}
down there [-%HESITATION-] so we don't get a lot [-health-] {+of
Health+} for all the money that we're spending we all know that that's
most of us know that [-I'm-] it's fairly well [-known-] {+know+}
what's not as [-well known-] {+well-known+} is this these are two
women [-when Cologne take-] {+one killoran+} the other one Elizabeth
Bradley at Yale and Harvard respectively who actually [-our health
services-] {+are Health Services+} researchers who did an analysis
[-it-] {+that+} took the per capita spending on health care which is
in the blue look at [-all OECD-] {+Alloa CD+} countries but then added
to that per capita spending on social services and social benefits and
what they found is that when you do that [-the U. S.-] {+to us+} is no
longer the big [-Spender were-] {+spender or+} actually kind of smack
dab in the middle of the pack what they also found is that spending on
social services and benefits [-gets you better health-] {+Gets You
Better Health+} so we literally have the accent on the wrong syllable
and that red spending is our social [-country-] {+contract+} so they
found that in [-OECD-] {+OCD+} countries every [-two dollars-] {+$2+}
spent on [-social services-] {+Social Services+} as [-opposed to
dollars-] {+a post $2+} to [-one-] {+1+} ratio [-in social service-]
{+and Social Service+} spending to [-health-] {+help+} spending is the
recipe for [-better health-] {+Better Health+} outcomes [-US-] {+us+}
ratio [-is fifty five cents-] {+was $0.55+} for every dollar [-it
helps me-] {+of houseman+} so this is we know this if you want better
health don't spend it on [-healthcare-] {+Healthcare+} spend it on
prevention spend it on those things that anticipate people's needs and
provide them the platform that they need to be able to pursue
[-opportunities-] {+opportunity+} the whole world is telling us that
[-yet-] {+yeah+} we're having the current debate that we're having
right at this moment in this country about [-healthcare-] {+Healthcare
there's+} something wrong with our critical thinking [-so-] {+skills+}

Both are pretty good. Watson feels a little more on target, with getting axis/access right, and being more consistent on understanding when U.S. is supposed to be a proper noun. When Google decides to capitalize things seems pretty random, though that's really minor. From a content perspective both were good enough. But as I said previously, the per word timestamps on Watson still made it the winner for me.

Comparison: Watson vs Sphinx

When I first tried to read the Sphinx transcript it felt so scrambled that I wasn't even going to bother with it. However, using wdiff was a bit enlightening:

one of the things that they [-it you've-] {+found that you+} probably
all seen [-all-] {+don't+} know that [-we're the-] {+with a+} big
spenders on health care [-so this is-] {+services+} per capita
spending of so called [-OECD countries-] {+all we see the country's+}
developed countries {+were+} around the world and whenever you put
[-U. S.-] {+us+} on the graphic with everybody else [-you have-] {+get
back+} to change the [-axis-] {+access+} to fit the [-U. S.-]
{+u. s.+} on [-with everybody else because-] {+the third best as+} we
spend twice as much as {+you would see+} the [-OECD-] average [-and-]
the basis on health care the result of all [-that spending-] {+let
spinning+} we don't [-get-] {+have+} a lot of bang for [-our Buck-]
{+but+} we should be up here [-we're-] {+were+} down [-there
%HESITATION-] {+and+} so we don't [-get a lot-] {+allow+} health [-for
all the-] {+problem+} money that we're spending we all know that
that's {+the+} most [-of us know that I'm-] {+was the bum+} it's
fairly well known what's not as well known is this these [-are-]
{+were+} two women [-when Cologne take-] {+one call wanted+} the other
one [-Elizabeth Bradley-] {+was with that way+} at [-Yale-] {+yale+}
and [-Harvard respectively who actually our health-] {+harvard
perspective we whack sheer hell+} services researchers who did an
analysis it took the per capita spending on health care which is in
the blue look at all [-OECD-] {+always see the+} countries [-but
then-] {+that it+} added to that [-per capita-] {+for capital+}
spending on social services [-and-] {+as+} social benefits and what
they found is that when you do that the [-U. S.-] {+u. s.+} is no
longer the big [-Spender-] {+spender+} were actually kind of smack dab
in the middle [-of-] the [-pack-] {+pact+} what they also found is
that spending on social services and benefits [-gets-] {+did+} you
better health so we literally [-have the-] {+heavy+} accent on the
wrong [-syllable-] {+so wobble+} and that red spending is our social
[-country-] {+contract+} so they found that [-in OECD countries-]
{+can only see the country's+} every two dollars spent on social
services as opposed to [-dollars to one ratio in-] {+know someone
shone+} social service [-spending to-] {+bennington+} health spending
is the recipe for better health outcomes [-US ratio is-] {+u. s. ray
shows+} fifty five cents for every dollar [-it helps me-] {+houseman+}
so this is we know this if you want better health don't spend [-it-]
on [-healthcare spend it-] {+health care spending+} on prevention
[-spend it-] {+expanded+} on those things that anticipate people's
needs and provide them the platform that they need to be able to
pursue [-opportunities-] {+opportunity+} the whole world is [-telling
us that-] {+telecast and+} yet we're having [-the current debate
that-] {+a good they did+} we're having right at this moment in this
country [-about healthcare-] {+but doctor there's+} something wrong
with our critical thinking [-so-] {+skills+}

There was an pretty interesting Blog post a few months back comparing similar Speech to Text services. His analysis used raw misses to judge accuracy. While that's a very objective measure, language isn't binary. Language is the lossy compression of a set of thoughts/words/shapes/smells/pictures in our mind over a shared medium audio channel and attempted to be reconstructed in real time in another mind. As such language, and especially conversation, has checksums and redundancies.

The effort required to understand something isn't just about how many words are wrong, but what words they were, and what the alternative was. Axis vs. access, you could probably have figured out. "Spending to" vs. "bennington", takes a lot more mental energy to work out, maybe you can reverse it. "Harvard respectively who actually our health" (which isn't even quite right) vs. "harvard perspective we whack sheer hell" is so far off the deep end you aren't ever getting back.

So while its mathematical accuracy might not be much worse, the rabbit holes it takes you down pretty much scramble things beyond the point of no return. Which is unfortunate, as it would be great if there was an open solution in this space. But it does get to the point that for good speech to text you not only need good algorithms, but tons of training data.

Playing with this more

I encapsulated all the code I used for this in a github project, some of it nicer than others. When it gets to signing up for accounts and setting up auth I'm pretty hand wavy, because there is enough documentation on those sites to do it.

Given the word level confidence and timestamps, I'm probably going to build something that makes an HTML transcript that's marked up reasonably with those. I do wonder if it would be easier to read if you knew which words it was mumbling through. I was actually a little surprised that Google doesn't expose that part of their API, as I remember the Google Voice UI exposing per word confidence levels graphically in the past.

I'd also love to know if there were ways to get Sphinx working a little better. As an open source guy, I'd love for there to be a good offline and open solution to this problem as well.

This is an ongoing exploration, so if you have any follow on thoughts or questions, please leave a comment. I would love to know better ways to do any of this.

 

 

James Bessen: "Learning by Doing: The Real Connection between Innovation, Wages, and Wealth"

Interesting video by the Author of "Learning by Doing: The Real Connection between Innovation, Wages, and Wealth", which largely comes down to "it's complicated". Sometimes automation replaces jobs, but sometimes it increases jobs, especially when there was pent up demand.

ATMs actually increased the number of bank teller jobs, because it led to needing less people needed per branch, and banks openned up new branches to meet pent up demand. It's also why manufacturing jobs are never coming back, we've met the demand on consumption, and most industries making goods are in the optimizing phase.

What's also really interesting is the idea that new skills are always undervalued, because there is no reliable basis to understand how valuable they are. The transition from typesetting to digital publishing was a huge skill shift, but was pretty stagnant on wages.

Fluidity of Language

Having a toddler definitely makes you realize how fluid our brains are for mapping and adapting to language changes. And, how quickly those language changes can become a dialect that breaks up understanding.

Things I never considered before being a parent, what is the difficultly level of you child's name for them to say. Because it turns out that if you give you child a name with both a W and an R, those are pretty late on the sound acquisition timeline. So they are not going to use their name as a token for themselves during early speech development, because they physically can't pronounce it.

So, my daughter latched on to the other token that was constantly being used in her direction: "you". When I first saw that emerging it was completely confusing until I figured out the logic of how she got there. The first week I even tried to stamp it out. But, you know what, language is organic.

So, we're living with pronoun inversion for the moment. After two weeks of it, my brain rewired to make it normal, and I don't miss a beat any more. The only time I really realize it's a thing is when friends come over that haven't seen her in a while and she talks to them. And a "You have mama bear" is interpreted as gift giving instead of statement of fact. And the misinterpretation brings scowls.

A lot hinges on a single word some times, and the assumption that we are all using these tokens the same way. But even in normal adult interactions, we aren't. It gives me a finer appreciation of how even if you think you understand people, you need to double check.

Microsoft's Inclusive Design Manual

From Microsoft's Inclusive Design Manual.

Microsoft's Inclusive Design website is pretty amazing. There is an overview manual, as well as exercises to help train yourself in inclusive design situations. However, even just reading the short gave me a few aha moments. It's worth the 30 minutes to give it a read through.

Kudos to Microsoft for both doing this work, and making it publicly available.

Over communicating

I once had a college class where the instructor was in the process of writing a text book for the class. So we were being taught out of photocopies of the draft textbook. It wasn't a very good class.

It wasn't that he wasn't a good writer. He was. The previous semester I'd had a great class using one of his text books. But it was taught by a different professor. There were some places that the text made a lot of sense to me, and some places where the different approach of the non-author professor made far more sense. With two points of view it's about synthesizing an understanding. If something from the book didn't really stick, something from in class might. And each aha moment made everything you'd read or heard before make a bit more sense.

I was reminded of this this morning reading through some technical documentation that was light on content and heavy on references. It was written that way so as to not repeat itself (oh, this part is explained over her instead). And while that may make sense to the authors, it doesn't make for an easy on ramp for people trying to learn.

It's fine to over communicate. It's good to say the same thing a few different ways over the course of a document, and even be repetitive at times. Because human brains aren't disk drives, we don't fully load everything into working memory, and then we're there. We pick up small bits of understanding in every pass. We slowly build a mental approximation of what's there. And people have different experiences that resonate with them, and that make more sense to them.

There isn't one true way to explain topics. So, when in doubt, over communicate.

 

In praise of systemd

There was definitely a lot of hate around systemd in the Linux community over the last few years as all the Linux distros moved over to it. Things change all the time, and I personally never understood this one. It wasn't like the random collection of just barely working shell scripts that may or may not have implemented "status" was really a technical superior solution.

Recently I've been plumbing more native systemd support into DevStack (OpenStack's development environment setup tool). Personally systemd is just another way to start services, it does a few things better at the cost of having to learn a new interface. But some huge wins in a developer context come from the new logging system with it, journald.

Systemd has it's own internal journaling system, which can reflect out to syslog and friends. However, it's inherently much richer. Journal messages can include not only text, but arbitrary metadata. You can then query the journal with this arbitrary metadata.

A concrete example of this is the request-id in OpenStack. On any inbound REST API call a request-id is generated, then passed around between workers in that service. This lets you, theoretically, trace through multiple systems to see what is going on. In practice, you end up having to build some other system to collect and search these logs. But with some changes to OpenStack's logging subsystem to emit journal native messages, you can do the following:

journalctl REQUEST_ID=req-3e4fa169-80ec-4635-8a7e-b60c16eddbcb

And get the following:

That, is the flow of creating a server in all the Nova processes.

That's the kind of analysis that most people were added Elastic Search into their environments to get. And while this clearly doesn't actually replicate all the very cool things you can do with Elastic Search, it does really give you a bunch of new tools as a developer.

Note: the enablement for everything above is still very much work in progress. The DevStack patches have landed, but are off by default. The oslo.log patch is in the proof of concept phase for review.

This idea of programs being able to send not just log messages, but other metadata around them, is really powerful. It means, for instance, you can always send the code function / line number where a message was emitted. It's not clogging up the log message itself, but if you want it later you can go ask for it. Or even if it is in the log message, by extracting in advance what the critical metadata is, you can give your consumers structured fields instead of relying on things like grok to parse the logs. Which simplifies ingest by things like Elastic Search.

There is a lot to explore here to provide a better developer and operator experience, which is only possible because we have systemd. I for one look forward to exploring the space.

Universal Design Problems

Credit: Amy Nguyen

A great slide came across twitter the other day, which rang really true after having a heated conversation with someone at the OpenStack PTG. They were convinced certain API behavior would not be confusing because the users would have carefully read all the API documentation and understood a set of caveats buried in there. They were also astonished by the idea that people (including those in the room) write software against APIs by skimming, smashing bits into a thing, getting one successful response, and shipping it.

The theme of the slide is really Empathy. You have to have empathy for your users. They know much less about your software then you do. And they have a different lived experience so even the way they would approach whatever you put out there might be radically different from what you expected.

Why Do Americans Refrigerate Their Eggs?

Americans love refrigeration, and eggs are high on the list of items we rush to get into the refrigerator after a trip to grocery store. Meanwhile, our culinary compatriots in Europe, Asia and other parts of world happily leave beautiful bowls of eggs on their kitchen counters.

So what gives?

Mostly, it’s about washing. In the U.S., egg producers with 3,000 or more laying hens must wash their eggs. Methods include using soap, enzymes or chlorine.

...

But — and here is the big piece of the puzzle — washing the eggs also cleans off a thin, protective cuticle devised by nature to protect bacteria from getting inside the egg in the first place. (The cuticle also helps keep moisture in the egg.)

With the cuticle gone, it is essential — and, in the United States, the law — that eggs stay chilled from the moment they are washed until you are ready to cook them. Japan also standardized a system of egg washing and refrigeration after a serious salmonella outbreak in the 1990s.

In Europe and Britain, the opposite is true. European Union regulations prohibit the washing of eggs. The idea is that preserving the protective cuticle is more important than washing the gunk off.

Source: ‘Why Do Americans Refrigerate Their Eggs?’ - The New York Time

It’s about 50 degrees warmer than normal near the North Pole, yet again - The Washington Post

Extreme temperature spikes such as this one have occurred multiple times in the past two winters, whereas they only previously occurred once or twice per decade in historical records according to research published in the journal Nature.

As Mashable science writer Andrew Freedman put it: “Something is very, very wrong with the Arctic climate.”

Source: It’s about 50 degrees warmer than normal near the North Pole, yet again - The Washington Post

As someone that follows the science, I definitely understand the difference between weather and climate. However, it takes climate change to create aberrations this extreme, this often.

This is all very real. It is unfortunate that many of our elected representatives don't agree with the science.