Tag Archives: google

Comparing Speech Recognition for Transcripts

I listen to a lot of podcasts. Often months later something about one I listened to really strikes a chord, enough that I want to share it with others through Facebook or my blog. I'd like to quote the relevant section, but also link to about where it was in the audio.

Listening back through one or more hours of podcast just to find the right 60 seconds and transcribe them is enough extra work that I often just don't share. But now that I've got access to the Watson Speech to Text service I decided to try to find out how effectively I could use software to solve this. And, just to get a sense of the world, compare the Watson engine with Google and CMU Sphinx.

Input Data

The input in question was a lecture from the Commonwealth Club of California - Zip Code, not Genetic Code: The California Endowment's 10 year, $1 Billion Initiative. There was a really interesting bit in there about spending and outcome comparisons between different countries that I wanted to quote. The Commonwealth Club makes all these files available as mp3, which none of the speech engines handle. Watson and Google both can do FLAC, and Sphinx needs a wav file. Also it appears that all speech models are trained around the assumption of a 16kHz sampling, so I needed to down sample the mp3 file and convert it. Fortunately, ffmpeg to the rescue.


The Watson Speech to Text API can either work over websocket streaming or with bulk HTTP. While I had some python code to use the websocket streaming for live transcription, I was consistently getting SSL errors after 30 - 90 seconds. A bit of googling hints that this might actually be bugs on the python side. So I reverted back to the bulk HTTP upload interface using example code from the watson-developer-cloud python package. This script I used to do it is up on github.

The first 1000 minutes of transcription are free, so this is something you could reasonably do pretty regularly. After that it is$0.02 / minute for translation.

When doing this over the bulk interface things are just going to seem to have "hung" for about 30 minutes, but it will eventually return data. Watson seems like it's operating no faster than 2x real time for processing audio data. The bulk processing time surprised me, but then I realized that with the general focus on real time processing most speech recognition systems just need to be faster than real time, and optimizing past that has very diminishing returns, especially if there is an accuracy trade off in the process.

The returned raw data is highly verbose, and has the advantages of having timestamps per word, which makes finding passages in the audio really convenient.

So 30 minutes in I had my answer.


I was curious to also see what the Google experience was like, which I originally did through their API console quite nicely. Google is clearly more focused on short bits of audio. There are 3 interfaces: sync, async, and streaming. Only async allows for greater than 60 seconds of audio.

In the async model you have to upload your content to Google Storage first, then reference it as a gs:// url. That's all fine, and the Google storage interface is stable and well documented, but it is an extra step in the process. Especially for content I'm only going to have to care about once.

Things did get a little tricky translating my console experience to python... 3 different examples listed in the official documentation (and code comments) were wrong. The official SDK no longer seems to implement long_running_recognize on anything except the grpc interface. And the google auth system doesn't play great with python virtualenvs, because it's python code that needs a custom path, but it's not packaged on pypi. So you need to venv, then manually add more paths to your env, then gauth login. It's all doable, but it definitely felt clunky.

I did eventually work through all of these, and have a working example up on github.

The returned format looks pretty similar to the Watson structure (there are only so many ways to skin this cat), though a lot more compact, as there isn't per word confidence levels or per word timings.

For my particular problem that makes Google less useful, because the best I can do is dump all the text to the file, search for my phrase, see that it's 44% of the way through the file, and jump to around there in the audio. It's all doable, just not quite as nice.

CMU Sphinx

Being on Linux it made sense to try out CMU Sphinx as well, which took some googling on how to do it.

Then run it with the following:

Sphinx prints out a ton of debug stream on stderr, which you want to get out of the way, then the transcription should be sent to a file. Like with Watson, it's really going only a bit faster than real time, so this is going to take a minute.

Converting JSON to snippets

To try to compare results I needed to start with comparable formats. I had 2 JSON blobs, and one giant text dump. A little jq magic can extract all the text:

Comparison: Watson vs. Google

For the purpose of comparisons, I dug out the chunk that I was expecting to quote, which shows up about half way through the podcast, at second 1494.98 (24:54.98) according to Watson.

The best way I could think to compare all of these is start / end at the same place, word wrap the texts, and then use wdiff to compare them. Here is watson (-) vs. google (+) for this passage:

one of the things that they [-it you've-] probably all [-seen all-]
{+seem you'll+} know that [-we're the big spenders-] {+where The Big
Spenders+} on [-health care-] {+Healthcare+} so this is per capita
spending of [-so called OECD-] {+so-called oecd+} countries developed
countries around the world and whenever you put [-U. S.-] {+us+} on
the graphic with everybody else you have to change the [-axis-]
{+access+} to fit the [-U. S.-] {+US+} on with everybody else
[-because-] {+cuz+} we spend twice as much as {+he always see+} the
[-OECD-] average [-and-] {+on+} the basis on [-health care-]
{+Healthcare+} the result of all that spending we don't get a lot of
bang for our [-Buck-] {+buck+} we should be up here [-we're-] {+or+}
down there [-%HESITATION-] so we don't get a lot [-health-] {+of
Health+} for all the money that we're spending we all know that that's
most of us know that [-I'm-] it's fairly well [-known-] {+know+}
what's not as [-well known-] {+well-known+} is this these are two
women [-when Cologne take-] {+one killoran+} the other one Elizabeth
Bradley at Yale and Harvard respectively who actually [-our health
services-] {+are Health Services+} researchers who did an analysis
[-it-] {+that+} took the per capita spending on health care which is
in the blue look at [-all OECD-] {+Alloa CD+} countries but then added
to that per capita spending on social services and social benefits and
what they found is that when you do that [-the U. S.-] {+to us+} is no
longer the big [-Spender were-] {+spender or+} actually kind of smack
dab in the middle of the pack what they also found is that spending on
social services and benefits [-gets you better health-] {+Gets You
Better Health+} so we literally have the accent on the wrong syllable
and that red spending is our social [-country-] {+contract+} so they
found that in [-OECD-] {+OCD+} countries every [-two dollars-] {+$2+}
spent on [-social services-] {+Social Services+} as [-opposed to
dollars-] {+a post $2+} to [-one-] {+1+} ratio [-in social service-]
{+and Social Service+} spending to [-health-] {+help+} spending is the
recipe for [-better health-] {+Better Health+} outcomes [-US-] {+us+}
ratio [-is fifty five cents-] {+was $0.55+} for every dollar [-it
helps me-] {+of houseman+} so this is we know this if you want better
health don't spend it on [-healthcare-] {+Healthcare+} spend it on
prevention spend it on those things that anticipate people's needs and
provide them the platform that they need to be able to pursue
[-opportunities-] {+opportunity+} the whole world is telling us that
[-yet-] {+yeah+} we're having the current debate that we're having
right at this moment in this country about [-healthcare-] {+Healthcare
there's+} something wrong with our critical thinking [-so-] {+skills+}

Both are pretty good. Watson feels a little more on target, with getting axis/access right, and being more consistent on understanding when U.S. is supposed to be a proper noun. When Google decides to capitalize things seems pretty random, though that's really minor. From a content perspective both were good enough. But as I said previously, the per word timestamps on Watson still made it the winner for me.

Comparison: Watson vs Sphinx

When I first tried to read the Sphinx transcript it felt so scrambled that I wasn't even going to bother with it. However, using wdiff was a bit enlightening:

one of the things that they [-it you've-] {+found that you+} probably
all seen [-all-] {+don't+} know that [-we're the-] {+with a+} big
spenders on health care [-so this is-] {+services+} per capita
spending of so called [-OECD countries-] {+all we see the country's+}
developed countries {+were+} around the world and whenever you put
[-U. S.-] {+us+} on the graphic with everybody else [-you have-] {+get
back+} to change the [-axis-] {+access+} to fit the [-U. S.-]
{+u. s.+} on [-with everybody else because-] {+the third best as+} we
spend twice as much as {+you would see+} the [-OECD-] average [-and-]
the basis on health care the result of all [-that spending-] {+let
spinning+} we don't [-get-] {+have+} a lot of bang for [-our Buck-]
{+but+} we should be up here [-we're-] {+were+} down [-there
%HESITATION-] {+and+} so we don't [-get a lot-] {+allow+} health [-for
all the-] {+problem+} money that we're spending we all know that
that's {+the+} most [-of us know that I'm-] {+was the bum+} it's
fairly well known what's not as well known is this these [-are-]
{+were+} two women [-when Cologne take-] {+one call wanted+} the other
one [-Elizabeth Bradley-] {+was with that way+} at [-Yale-] {+yale+}
and [-Harvard respectively who actually our health-] {+harvard
perspective we whack sheer hell+} services researchers who did an
analysis it took the per capita spending on health care which is in
the blue look at all [-OECD-] {+always see the+} countries [-but
then-] {+that it+} added to that [-per capita-] {+for capital+}
spending on social services [-and-] {+as+} social benefits and what
they found is that when you do that the [-U. S.-] {+u. s.+} is no
longer the big [-Spender-] {+spender+} were actually kind of smack dab
in the middle [-of-] the [-pack-] {+pact+} what they also found is
that spending on social services and benefits [-gets-] {+did+} you
better health so we literally [-have the-] {+heavy+} accent on the
wrong [-syllable-] {+so wobble+} and that red spending is our social
[-country-] {+contract+} so they found that [-in OECD countries-]
{+can only see the country's+} every two dollars spent on social
services as opposed to [-dollars to one ratio in-] {+know someone
shone+} social service [-spending to-] {+bennington+} health spending
is the recipe for better health outcomes [-US ratio is-] {+u. s. ray
shows+} fifty five cents for every dollar [-it helps me-] {+houseman+}
so this is we know this if you want better health don't spend [-it-]
on [-healthcare spend it-] {+health care spending+} on prevention
[-spend it-] {+expanded+} on those things that anticipate people's
needs and provide them the platform that they need to be able to
pursue [-opportunities-] {+opportunity+} the whole world is [-telling
us that-] {+telecast and+} yet we're having [-the current debate
that-] {+a good they did+} we're having right at this moment in this
country [-about healthcare-] {+but doctor there's+} something wrong
with our critical thinking [-so-] {+skills+}

There was an pretty interesting Blog post a few months back comparing similar Speech to Text services. His analysis used raw misses to judge accuracy. While that's a very objective measure, language isn't binary. Language is the lossy compression of a set of thoughts/words/shapes/smells/pictures in our mind over a shared medium audio channel and attempted to be reconstructed in real time in another mind. As such language, and especially conversation, has checksums and redundancies.

The effort required to understand something isn't just about how many words are wrong, but what words they were, and what the alternative was. Axis vs. access, you could probably have figured out. "Spending to" vs. "bennington", takes a lot more mental energy to work out, maybe you can reverse it. "Harvard respectively who actually our health" (which isn't even quite right) vs. "harvard perspective we whack sheer hell" is so far off the deep end you aren't ever getting back.

So while its mathematical accuracy might not be much worse, the rabbit holes it takes you down pretty much scramble things beyond the point of no return. Which is unfortunate, as it would be great if there was an open solution in this space. But it does get to the point that for good speech to text you not only need good algorithms, but tons of training data.

Playing with this more

I encapsulated all the code I used for this in a github project, some of it nicer than others. When it gets to signing up for accounts and setting up auth I'm pretty hand wavy, because there is enough documentation on those sites to do it.

Given the word level confidence and timestamps, I'm probably going to build something that makes an HTML transcript that's marked up reasonably with those. I do wonder if it would be easier to read if you knew which words it was mumbling through. I was actually a little surprised that Google doesn't expose that part of their API, as I remember the Google Voice UI exposing per word confidence levels graphically in the past.

I'd also love to know if there were ways to get Sphinx working a little better. As an open source guy, I'd love for there to be a good offline and open solution to this problem as well.

This is an ongoing exploration, so if you have any follow on thoughts or questions, please leave a comment. I would love to know better ways to do any of this.



Moving off GMail

In early December I finally decided it was time to move my primary email out of google. There were a few reasons to do it, though the practical (reaching the limits on their filtering) largely outweighed the ideological.

Movable Email

If email is important to you, you should really register your own domain name, so you have a permanent address. I got dague.net back in 1999 to create a permanent home for my identity. This has meant over they years the backend for dague.net has changed at least 5 times, including me hosting it myself for a large number of years.

My Requirements

  • Can host email on my own domain - as I'd be moving dague.net
  • Web UI - because sometimes I want to access my email via Chromebook
  • Good Search - because there are times I fall back to full text search to find things
  • IMAP - because most of the time I'll be accessing via Thunderbird or Kaiten Mail
  • Good spam filtering
  • Good generic filtering on the server side - My daily mail volume is north of 1000 messages (40% spam), I need good filtering otherwise I drown


I eventually landed on Fastmail.fm, who I've been watching for a lot of years. They are fairly priced ($40/yr), their company contributes back to the open source software they run their business on, and because they are actually an Australian company, you'll get disclosure if some agency is accessing your accounts. They also give you a 60 day free trial, so you can do a slow migration over, and see if it will meet your needs.


One I was sure I was going to do this, I created my fastmail.fm account, and then pulled and configured imapsync to sync my existing gmail content over. I have a couple of GB of email, which means an imap sync takes a good 24 hours at this point. Imapsync is rerunable, so run it once, wait until it finishes, then run it a second time, and pick up the changes. Once it seems like you've basically closed the gaps between the two accounts, you can change MX records, and start getting email at the new service provider.

For safety the first thing I do once this has happened is build a forward rule from the new provider to the old one. Then if something goes horribly wrong, all my email remains in both locations for a while. A month later I'm still running that forward, though will be disconnecting it soon.

So far so good

The webmail for fastmail is really solid, honestly I like it better than gmail's web ui, which has become incredibly cluttered over the years. This is just email, which is good. It also has a search facility which is on par with google's. It's also available as part of the IMAP protocol, which means real searching from Kaiten mail on Android. Switching from GMail App to Kaiten Mail on my phone was about 10 minutes. And it means I can actually customize things I get alerted to, which gmail broken at some point. Thunderbird transition was simple.

I had gotten used to Raportive on gmail that would give me people's pictures on their email. I found the Ldap Info Show extension on Thunderbird, which looks people up on various social networks, and gives you pictures they have public.

Lacking APIs

The one complaint I have with fastmail, is that it's lacking APIs to handle your data. For instance, my filtering rules are complex. 342 lines of sieve and counting at this point. This is managed via a web form, but copy / paste on every change is something I'm not really into. I solved this by writing a python mechanize sync script so I can manage the rules locally, version controlled, then sync them up afterwards.

Address book has some issues, and I've not built a work around. The sieve rules they give you whitelist your address book as spam sources, so it's something I'd like to keep in sync. However, without an API it's not really worth it.

Overall: Very Good

Overall I'm very happy with the move. My biggest complaints are around the API issue, which I hope they correct in the future.

Google Goggles Magic

Last night a friend complained about a curry recipe gone wrong, so I decided to offer up the one I used to make with a certain amount of frequency. It's from a 1970s Time Life cookbook that I vaguely remember swiping from my friend Jehan in college. I took a picture on my cell phone to send it along.

Chicken Curry Recipe


The page is sufficiently stained with turmeric to realize how often it was made.

A little while later I noticed a Goggles Alert on my cell phone, it had scanned the image, and returned the following URL as a hit: http://littlechefapp.com/recipes/144571-chicken-curry-authentic#.UOGjR2JQCoM

Dead on. The future is pretty awesome some times.

The Nexus program - lead by example

2 years ago Google created the Nexus program for their Android phones. Vendors were off screwing around with things like dual screen android phones, 3D android phones, all the weird gimmicks that marketing people loved, and actually people hated (especially on a 2 year contract). The point of Nexus was to lead by example, and get out ahead of the vendors enough that they'd realize there was a better way to do this, and stop screwing around on egregious differentiation that had no real value.

I consider the Samsung S3 to be the natural child of the Nexus program. A non-Nexus device from Samsung which is gorgeous, wonderful to use, and has only minimal tweaks off the base UI. My wife and I just flipped to Verizon and got a pair of these last week, and my love for this device only gets stronger by the day.

I really think of Google's new fiber project as a Nexus program. Google is demonstrating that there is a better way to do broadband, and that the economics are there for fiber to the home if you look at it systematically. A rethinking of how to roll out a network. I'm pretty sure Google doesn't actually want to be an ISP, any more than they want to be a phone manufacturer. But by leading by example, they are going to change the nature of home connectivity.

Copyright in APIs

The Jury in the Oracle vs. Google case has decided that Google violated Oracle's copyright in implementing the Java APIs. Now, that's actually not too bad of news, because the Judge in the case told the jury to "assume APIs are copyrightable for this decision" but that he would eventually decide that independently. Given that the EU just ruled they are not, I'm hoping the judge in this case comes to the same conclusion.

If APIs are ruled copyrightable, this would break all kinds of interoperability that we take for granted today. As always, groklaw has the best coverage of this legal action.

What's your Google footprint

Last night, after the Drupal Meetup, we were having many interesting conversations at the bar. One started as a question about why I did so much open source activity. There are a lot of answers there, though mostly at this point open source is just in my DNA. If I do something, I open it, because that's what I now do.

But I posed as a return question for everyone to think about what their Google footprint was. If you search for your name, what comes up? how much of that is you?

"Sean Dague" in google returns: About 652,000 results (0.14 seconds). I am sure 99% of that is me.

Page 1 is (in order): my blog, my twitter account, my linked in profile, my directory entry in android market, a comment I wrote on greenmonk blog, my github, my old (long dead blog), my quora account, my meetup profile, my CPAN account. Some of that is current and used, some of it isn't so current, but because google ranks the communities important, it bubbles up.

If you start going through pages you'll see contributions to projects I've done, bugs filed, mailing list posts, presentations at conferences, retreads of my listings in twitter and android market on 3rd party sites. A public life on the internet that dates back to about 2001 (there may be earlier stuff, but that's when I started being consciously active in the open source world).

I can live with that, it's a reasonable picture of who I am, that future friends, associates, employers can all use and see for background. The amount of content I put out on my blog means that it will remain hit one for my name. It also means that I'm always on the front page of "Dague" in google as well. Having an uncommon name is actually an incredible boon in the 21st century if you want to build a reputation. Something that I hated as a kid, is something I'm very pleased about now.

Much like your credit score, your google footprint isn't ever completely in your control. But you can be very deliberate about putting out content, in code, comments, emails, blog posts, public social network artifacts, which will shape that footprint to be some representation of you.

Take a minute today an look at your Google footprint, and see what picture the internet gets of you.

I'd love to hear stories, challenges, or completely new ideas in comments, so please post. And just think, that will also add to and shape your Google footprint.

Tip of the Day: Google Maps styling

Tip of the day: If you are trying to embed Google Maps in a website, and things look horribly wrong, make sure that the following css style rule exists for your map div:

Google maps makes really interesting abuse of width for it's layout. I've got img max-width at 98% for the rest of the site so that images scale down correctly in the responsive design, but for google maps, that just causes chaos.

Is Google+ just another Chrome?

I've been really frustrated with Google+ slowly consuming all the rest of Google services, because I find it so deficient compared to Twitter, and even Facebook. My long form content lives here, on my own server, in my own blog. Both Twitter and Facebook make it easy to also have that content live a life in their platform.

Google+... not so much. We're more than 6 months after launch, and still no API besides scraping public posts. As such, I spend little time over there, and largely disdain the system, which doesn't loose much, because there are so few people generating content there anyway. With the launch of their "Google+ your world" search yesterday, I was even more frustrated. G+, still with no API, is now infiltrating the search rankings. Grrrr.

But this morning, I read this, and it occurred to me, what if G+ is another Chrome. By that I mean a project that isn't meant to be a market leader by itself, but one that's meant to shape a market to keep it fluid. Twitter and Facebook have a pretty epic duopoly on content right now, and they are both working to make it harder to consume outside of their bubble. This summer they both quietly killed RSS feeds off. You can still consume via their API, but even in that front Twitter's been waging a bit of a war on their API consumers, retaking the Mobile UI.

So maybe G+ was really a reaction to a trend Google was seeing, that the gated communities were throwing up more and more restrictions to making their content searchable in Google. Instead of bringing lawyers, bring technology. Make a competitor that is searchable, and get the gated communities to now really want to be included in the results. Make the market fluid again.

Maybe. I'm not sure I've even convinced my self of this. But it would explain some of the areas of focus in G+. It would also explain why public posts API is the only one they've released so far. At the end of the day, the social giant fight matters little to me, as long as I can syndicate into them, which is why the lack of G+ write API (and associated WordPress plugin) is my biggest concern. So while this softens my feelings on G+ a little, I really do wish they'd actually make the platform way more open. Then I might feel it was worth investing in content and discussions there. Until then, you can find my quick bits over on Twitter, and the long form ideas here, with Disqus, which makes it really easy to comment or converse outside the duopoly bubble.

Steve Yegge and the Google Platform issue

Steve Yegge is one of the most insightful people on the internet. I was really bummed when he stopped blogging, because his posts were always well thought out, funny, and really got to the heart of some key issues in software development.

Last night he posted publicly, by accident, Google's current biggest issue, a complete lack of a platform. He's really dead on.

I hope Google internalizes that post and does something about it.

Small Wins

Really interesting post about Google Wave from the inside. My favorite passage is this:

And this is the essential broader point--as a programmer you must have a series of wins, every single day. It is the Deus Ex Machina of hacker success. It is what makes you eager for the next feature, and the next after that. And a large team is poison to small wins. The nature of large teams is such that even when you do have wins, they come after long, tiresome and disproportionately many hurdles. And this takes all the wind out of them.

That matches up quite well with my experience. A series of small wins keeps the team momentum running strong. Nothing breeds success like success.