Building Thumbnails for PDFs

Thursday, June 18th, 2009

<Note: thanks to Buffy Miller for giving me a much slicker solution here>

Often I want to post a PDF on a web page, but would like the cover of the PDF to be the clickable element.  This just makes things look slicker.  This can easily be done with the imagemagick set of commands.  Using imagemagick, you can select the page you want to convert out of the pdf (using the [] annotation), get the right size (using -resize) and format (based on file extension).  The following command does the conversion in a single go:

convert -resize 150 inputfile.pdf[0] outputfile.png

One of these days I’ll get around to making a mediawiki extension that does that by default for pdf media attachments.

The Power of Perl: Converting an A4 PDF to Letter with Margins

Friday, June 12th, 2009

Because I’ve moved into the more elegant waters of Ruby and Mono, I sometimes forget just how power Perl can be.  Sometimes 8 lines of perl is all you need to solve a problem.

The Problem

So, I’ve gotten back in to Blood Bowl with my friend Pyg, but that’s mostly for another blog post.  As we’ve been relearning the rules, Pyg found a much better consolidated rule set on the web as a PDF, in A4, as it was created by Brits.  A4, is the far more logical way to make paper that’s normal page size ish.  However, it is slightly longer and slightly narrower than our Letter paper standard.

The real challenge here was that I wanted to create something that was bindable at our local Office Max. “Printing” the A4 PDF to a Letter PDF in evince gave me something with about 3/4 inch of white margin on the right side of every page.  If I could get that to alternate between the right and left sides of the page, then I’d be golden.

As you can see, left as is, bindng double sided would both look silly, and actually bunch through some of the text on the right side pages.

The Solution – Hack the PDF

PDF is just a document standard.  That means a lot of it is in plain text, for a gracious definition of that word.  I openned the file up in emacs and started searching for words that might represent this.  Eventually I found the following snippet in the PDF:

<< /Type /Page
   /Parent 1 0 R
   /MediaBox [ 0 0 611.999983 791.999983 ]
   /Contents 196 0 R
   /Group <<
      /Type /Group
      /S /Transparency
      /CS /DeviceRGB
   >>
   /Resources 195 0 R
>>

This is part of the Page definition, and what’s important is that MediaBox tag.  The 4 numbers there are X, Y, Width, Height of the content.  After some experimentation I determined that the values I needed for “right side pages” were: -52 0 559.999983 791.999983.  I need to set every other (of 84 pages) to that.  There are MediaBox definitions that have nothing to do with Page, so I can’t just look for them.  It has to be a MediaBox in that Page definition.

Changing your line break

Perl has a lot of operations that are line oriented, so you get 1 line at a time.  But one of the greatest powers of perl is it’s really easy to change what it considers a line break.  This is done with the line ending special variable $/.  A common trick is to $/ = undef; which means the first read of a file will read the entire thing into a string.  For this problem I decided that if I made << my seperator, I’d get the Page definition and MediaBox on the same line, making life much easier.  But enough of the details, here is the code:

#!/usr/bin/perl

use strict;

local $/ = ‘<<’;

my $count = 0;

while (<>) {
    if ($_ =~ m{/Type /Page}) {
        if (($count % 2) == 0) {
            $_ =~ s{MediaBox \[.*?\]}{MediaBox [ -52 0 559.999983 791.999983 ]}gs;
        } else {
            $_ =~ s{MediaBox \[.*?\]}{MediaBox [ 0 0 611.999983 791.999983 ]}gs;
        }
        $count++;
    }
    print $_;
}

This reads in the first pdf from standard in, the second to standard out.  Because I could change the line seperator I don’t have to keep track of state of if I’ve seen a Page seperator, and if I’m still in that block (i.e. proper formal parsing).  That lets me do it in 2 matches.

The results, as just as you would like, and made for some nice printouts for binding:

Sometimes you just need to roll up your sleeves and bang out some perl code. :)

Software in the era of drive by contribution

Tuesday, April 28th, 2009

I love git.  I’ll state that up front.  I also love github, which I’ve expressed in the past.  Both are making me look at software in a new way.  I also think the pair of them are changing some of the rules we know for how open source projects emerge and move forward.

Recently I was working on building a Rails based Event Calendar for MHVLUG.  This gave me a chance to dig in on ical, which has fascinated me since a set of talks at YAPC a decade ago.  There were 2 ruby ical libraries out there (icalendar.rb and vpim.rb), neither did quite what I wanted, and both projects were more or less dormant (the mailing lists were lots of “is anyone alive?” posts).  Ug, I was stuck, and if I had to start from scratch on ical, that was all I’d end up doing, never getting to my application.

I googled some more… and low and behold found a github.com fork of icalendar.rb, and forks of that.  Those forks implemented about 50% of the fixes I needed to get ical generation with timezones to work.  So I forked from one of those and 6 changesets later, had what I needed.  I then built my application, and life was good. 

A few days later I decided to collect up all the changes in all the github icalendar trees, and merge them into my tree.  While git itself can be somewhat confusing, github adds this really slick web interface on top of git trees, that makes the merge process pretty painless.  This is one of their key innovations, and it’s just incredible.  I selected all the outstanding changes that would merge cleanly, pulled them in, and now had a tree which largely encompassed the 8 existing forks on github.com.  I posted back to the dead mailing list and let people know there was this now living github tree where the project had seemed dead.  I got a couple of new patches people wanted in, and 2 months later the maintainer actually showed up again and gave me admin access to the icalendar project so I could publish official versions.

This pattern repeated a few more times on the project.  I found a piece of code on github that did 90% of what I needed, but I needed a change.  I created my fork, added my feature, and pushed it back out (with a pull request).  A few days later the maintainer pulled them back in, and now they are officially part of the project.  I’m not vested in those projects, but I had relevant fixes, and because we were all using a tool that makes it easy to be a casual contributor, they are now part of the open source projects in the sky.

Casual Contributions

If you haven’t seen the paper on participation inequality, go and read it… now!  Previously most of the studies on open source community participation focussed on big projects like the Linux Kernel, or Apache.  That’s sort of like trying to understand patterns of home construction by looking at Frank Loyd Wright’s houses.  Those projects are outliers in how communities work.  This study did a much broader look at online communities and found the striking 1-9-90 pattern:

This is how communities work.  1% of the population does most of the work, 9% are casual contributors, and 90% are just consumers.  Your user base is a silent majority.  In an open source world the 1% are the core contributors, and possibly the heavy power users.  9% is the people that file a bug now and then, maybe a patch or two, everyone else is the people that just download your code and you never hear from them.  This patern more or less holds true for all volunteer efforts.

In open source we’ve got an issue, which is that getting code from the 9% is hard.  The 1% typically has access to a central source management repository, and can merge code fixes as soon as they see them.  The 9% has to follow a completely different process, posting patches to trackers or mailing lists, many of which get lost because there are a bunch more manual steps to pull them into the main tree.  If any process requires more effort by the 1%, it typically won’t happen, they are full up on time as it is.

And this is where git and github, start making things interesting.  While I run a number of open source efforts, I end up in the 9% all the time.  If you are now using git for your main tree the 9% and the 1% are now using the same tools, which allow seemless inclusion of code.  The merge algorithm on git is really wonderous.  I’ve had instances of massive renaming of files while trying to integrate external fixes in those files, and everything just worked.  It actually surprised the hell out of me.

The 9% just want to casually contribute something they aren’t signing up for a lifestyle.  Get my fix out there, if other people want it, great, if not, so be it.  The fact that integration is 2 mouse clicks and 10 seconds of effort makes the chance of capturing those changes much more likely.

Recovering the Brown Field

Ever look at sourceforge.net?  or any of the clones?  50% of those projects never got off the ground.  Another 40% have died out for other reasons, the contributors: had a family, started working for a company that doesn’t let them work in OSS, got bored withthe project, died, or became inactive for any number of other reasons.  When open source software exploded in 2000, there was a lot of greenfield.  Everyone was out there building new stuff that no one had done before.  But now we have a lot of brown field.  A lot of 1/2 planned, 1/2 finished pieces of code that have useful bits in them, but have been abandonned by their original creators.

Tools like git and github help you recover that brown field.  In the last couple of months I run into project after project that petered out in 2006, but has a bunch of good code.  That means they are about 2 critical bug fixes away from being useful on modern systems.  It’s really not much work, but in the old system , with the projects locked up in a forge with an SVN or CVS source management system, they were dead.  You had to start over.  With github you can import that tree and keep working.

It’s a new pattern on how the open source community is going to function, while it could be built on any distributed SCM, the fact that git has a really good svn 2 way bridge, and that github made itself “person oriented” vs. “project oriented” really make me believe that it’s creating a uniquely new pattern for both recovering the brown field of open source, and enabling the 9% to be much more effective with their output.

Software in the era of drive by contribution

Now that we’ve got a set of tools that really were designed for helping the 1% and the 9% work together, I think we’re going to see a whole new blosoming of open source software.  The rules of what it means to be a project contributor are changing, in really exciting ways.  Forking used to cheap, and merging expensive, which is why forking was considered an insult.  But with tools like git merging is cheap, so the offensiveness of forking goes away.  It opens up for more experimentation, and more complex contributions happening outside the 1% group.  All this increases the velocity of contribution, and thus the volume of open source software out there.

I really think distributed source control is changing a lot of assumptions for how software gets developed.  So if you haven’t yet dug into the space, do it.

Software Debt

Monday, March 2nd, 2009

I’m glad to know that I’m not alone in thinking about Software Debt.  Ward Cunningham has a great video up about it, which is starting to filter through the internets.

Maybe if we get people thinking about Software Debt a bit more, we can make the software world a bit more sane.

OpenID and Gravatars

Sunday, February 15th, 2009

… also known as – please don’t make me fill out those same 6 fields to get into your website!

A few weeks ago I gave an MHVLUG talk on Ruby on Rails.  At the normal dinner outing afterwards one of our members was talking about maybe creating a small rails application where people could share and publish the podcasts they listen to, which I think is a great idea.  (Hopefully they’ll work on it at our web-hack-a-thon.)  But that lead into the inevitable issue of “user accounts”.

Man, I hate having more user accounts.  And if we are going to do this project, I really didn’t want everyone in the LUG to have to have another one.  So I resolved to see what I could do about reusing the MHVLUG accounts in an external way.  It’s actually pretty easy as there is a Mediawiki OpenID extension which lets you go both ways.  You can enable OpenID logins to the wiki, and make people’s user pages OpenID providers.  Rails has a very good openid plugin (plus it’s integrated as part of the restful-authentication-tutorial) so that would make it trivial to write an application that people can log in using their MHVLUG password (the id will be a bit different, but that’s explainable).  While Facebook and Google are still dragging their feet a bit here, Yahoo, AOL, and Wordpress.com are all on the bandwagon, so many people already have these ids, they just don’t know it.

That got me following a few threads on OpenID, and looking at Wordpress.  It turns out that Wordpress also has a good OpenID plugin.  What’s quite interesting about that plugin is that it can make a wordpress instance the OpenID provider for 1 of the Wordpress users.  So if you have a personal blog, it means you can now very easily be your own OpenID.  Being able to login in as http://dague.net is quite convenient.

Lastly, I wanted to throw something in about gravatars.  You know how everyone wants you to upload your picture to their website?  Stop the madness!  Gravatars are just keyed off your email address, so if an site has that, they can look you up, and get your profile pic from the gravatar folks.  Newer wordpress templates automatically integrate this.  I did minor adjustment to my template to get this support in there.  I’ve sworn now that Meetup.com is the last people I’m ever uploading a picture for, and that’s just because it’s hard to find complete strangers in a dinner without photos.  Again, there is a good rails plugin for this, so it’s pretty trivial to integrate if you are doing a Ruby on Rails application.

So, if you are a web2.0 hipster, and thinking about making a new service, please don’t make me create a new account, because, honestly, that’s getting close to being a deal breaker for me at this point.  And if you want my picture, the gravatar people have it.  I’m not uploading it for you again. :)

In praise of github

Tuesday, December 30th, 2008

A few years ago I became sold on distributed source control.  Being able to do offline work, try out new ideas cheaply, and throw them away, all were great things.  I started with mercurial, but over the summer started using git.  A couple of things pushed me over the edge.

  • git appeared more modular, at the end of the day this wasn’t really true.  The lack of a libgit was actually very disappointing (especially after I had sworn there was one), as I’ve got a number of interesting ideas stalled behind that one.
  • the git-svn pluggin, which provides really good 2 way integration between svn and git trees.  I’ve stopped making anon svn clones, I now do a git-svn clone.  If I want to fix something locally, I can now version that fix.
  • github – free social hosting of git trees

Github helps you over the hump in publicly hosting git trees.  Honestly, the hump isn’t very high, but the documentation out there could be a bit more straight forward.  I’d been chugging along using github for all my random open source projects, some that are active, some which are stalled.  But the source code is out there for others to take a look at.  Github provides nice instructions for people to clone the work, and run with it.  It’s definitely a prettier interface.

Github really started to shine for me this past weekend though.  I was looking for ical generation code for ruby to replace an email tool that I wrote in perl for our MHVLUG monthly meeting emails.  There exists 2 ruby ical projects, vpim and icalendar, neither of which support timezones in the ical generation, and both with pretty inactive mailing lists.  Once it became clear that the problem was not solved, I decided to dig in and see if I could come up with something workable.

But once you go social, github really shines

There had been a post on the icalendar devel list a few months back that said he had fixed a couple of timezone issues and provided a github url.  I cloned that project, and realized that while it got closer to what I needed, it still didn’t quite do what I needed.  So I clicked the fork button.

I was now given my own fork of the icalendar source.  But more importantly, it also showed me all the other forks on github, which there were 5 others.  I made my fixes, pushed them back public, and then proceeded to start to accumulate up some of the other changes out there.  There is even a fork queue which shows all the outstanding changes in other forks out there, as well as odds on whether or not the patches will apply.

While you could figure all this out on your own with the command line, that kind of discovery and view is really a help and a timesaver.

And it’s even better if you are doing ruby

Github is written in ruby, though I’m not sure on the framework behind it.  As an added bonus to people hosting ruby code on the site, the team created a gem build service into github.  You add a specially formatted gem spec file to your github tree, and you’ll get a gem built on each checkin.  My 2 ruby libraries that are there now are configured to build gems, easy for all to install.

If you haven’t checked out git, or github, you should.  While I found the learning curve on git to be higher than I really wanted to deal with, the community is very active, and the number of things that support git now is quite high.  Rails generators even support git now, automatically source managing via git or svn if you ask them to.  Github popped out of no where in 2008, and I can’t wait to see where they are going to go in 2009.

A new “law” for information

Tuesday, November 25th, 2008

Many people know Metcaff’s law:

the value of a telecommunications network is proportional to the square of the number of connected users of the system (n2)

but it occurs to me we need a new law. Something that goes along the lines of:

the value of a piece of information increases as the square of the number of interfaces that information is exposed via

This thought has been kicking around in my head for the last couple of months after trying out Hiveminder, an online todo list.  Unlike other todo list applications I’ve tried Hiveminder strives to provide the same information in as many different ways as possible.  There is a website, a mail interface, an IM interface, a twitter interface, an IMAP interface, an iCal interface, etc, etc.  I think I sorted out that there are about 12 ways to get access to data in hiveminder.  It makes the information so much more useful, because how I need to interact with that information changes depending on the task at hand.  I actively use 3 interfaces for Hiveminder on any given day, because how I want to glance at things is different from how I want to edit things.

That todo list is definitely far more valuable to me now that any other todo list that I’ve tried to keep.  When you are building some system that is primarily about data, think about this approach to it.  The value of the data goes way up if there are more ways to interact with it.  Maybe it’s not quite as the square, but I do know that a piece of data available via 2 interfaces is at least twice as valuable as exposed in only one way.

Thinking about Debt

Wednesday, September 17th, 2008

I’ve been thinking about a lot of things in terms of debt recently, and the world looks a bit different if you do that. Debt is borrowing against the future, be that in time, money, energy, health, etc. Debt is what you get when you take short cuts, as you are borrowing from the future.

When your debt is money, it’s somewhat easy to understand. You take money from the future you which you have to pay back at some point. It’s a little harder to understand in areas that aren’t money.

If you create a new piece of code you are creating both value and debt. Debt is created by taking shortcuts, as the software will need to be reworked to reasonably extend it in the future. You take a short cut now to pay for it later, with interest. Every future feature will take longer until you pay back your debt. Refactoring is really all about paying down debt in a responsible way in software.

Most of the time the right approach is to pay off your debt. The other option is bankruptcy (which we are seeing a lot of this week in the financial world). Software bankruptcy is throwing the whole thing out and starting from scratch.

When I started thinking about software development in terms of debt in the last few weeks, lots of things started to make a lot more sense. Shortcuts are debt. Inconsistent interfaces are debt. Inconsistent coding style is debt. Bad or wrong abstractions are debt. Missing documentation is debt. Confusing APIs are debt. If you want a project to move forward more productively you need to eliminate some of your debt, as it’s what slows people down (green field code is easy, brownfield is hard).

I’d love to hear about other concepts of debt, and what debt looks like in other media besides software. Please post comments if you are so inclined.

Hello Thunderbird

Monday, September 8th, 2008

Push finally came to shove, and I’ve now entered the 21st century by making Thunderbird my email client (I actually tried Evolution for a day, but after 20 crashes gave up. But that’s a different story.) Previously I was using mutt. There were a bunch of reasons to do this, though the biggest one for me was getting to turn off a box at home that was my IRC proxy, gateway to my home network, and ssh point for reading my email in mutt. That should save us at least a few hundred watts.

The New Configuration – Server Side

I’ve moved to using dovecot as my imap server. This has the advantage of being able to handle a home directory full of mboxes nicely (which courier could not). This means I can keep my perl based dynamic mail filtering working on the server until I manage to rewrite it as a thunderbird extension. I was using IMAPS before just as a secure POP, but now I’m actually taking full advantage of having imap remote folders.

My IRC proxy moved to my linode, which was probably a better place for it to be anyway. I even bothered to package it as a ppa for ubuntu, which means you can easily install as well.

Lastly, my gateway box is now a kvm guest running on my big home media / backup server. I was quite impressed by how nice virt-manager made the system install and setup from an ubuntu 8.04 iso. I had to do a little manual effort to configure bridge networking correctly, and deal with conflicting dnsmasq instances, but after that all was good.

The New Configuration – Client Side

I’ve now got thunderbird setup for 3 IMAP accounts (dague.net, gmail, and work), plus news groups (all work ones). This gives me a really nice consolodated view of my email. I was pretty impressed by how well thunderbird handles the 4 identies, and routes outbound correctly quite nicely. For dague.net and gmail email is filtered server side, I’ve client side filters for work because it’s sieve, and I really don’t want to learn another filtering specification language.

On top of that I’ve got a ton of extensions. I found that thunderbird out of the box was ok, but I lost a lot of mutt functionality. After a hunt through the extensions I got most, if not all of that back. For the record here are the extensions I currently have installed:

  • Attachment Reminder – this fires off a warning and prompt if you hit it’s heuristic rules of an email that might need an attachment but you don’t have any. I’ve seen the warning 4 times now, though they were all false positives. I do like the idea though, so I’ll keep it around.
  • Colored Diffs – brilliant if you are on mailing lists where patches are sent around
  • Display Mail User Agent – because I’m curious on who uses what. I always had this header visible in my mutt configs.
  • Display mailing list header – way more useful than I thought. It basically puts a set of links across the top of the email for Subscribe, Unsubscribe, Archive, etc. It makes it a lot easier to get off lists that you realize you don’t really care about any more.
  • Enigmail (from ubuntu package) – there was no way I was giving up pgp. It also has the advantage of making pgp policy setting much simpler.
  • Extension Developer – more on this later
  • Import Export Tools – because I had a lot of saved off mbox files that I needed to get back into thunderbird.
  • keyconfig – actually works on all mozilla base tools, but I needed it to redo a few key bindings
  • Lightning – this is the Sunbird callendar program as an embedded addon. It’s actually quite nice for callendaring and task lists.
  • Mnenhy – this gave me more control over mail headers. IIRC display mailing list header needs it to function.
  • Mutt Keys – my own extension, more on that in a bit
  • Nostalgy – gives you a set of nice key bindings and input field for save & copy of email. Very handy.
  • Provider for Google Calendar – a lightning plugin that lets you have good 2-way google calendar support. This is something evolution promissed, but it didn’t work. It works great on thunderbird with this extension.
  • Quote Colors – if people both to follow standard quoting models for email this does a really nice job of coloring the different posters to make it much easier to read.
  • Track Package – gives you highlight + right click to track packages based on emails. While it’s not everything I want, it is pretty useful.

But it could be better…

Thunderbird is now very useful to me, but I have found ways in which I could make the whole thing better. Mutt keys was a quick dive into making my own thunderbird extension that was nothing much more than key bindings (based on the now unmaintained mouseless extension). It’s rough, but it let me figure out some of the basic structure of writing thunderbird extensions.

Since then I installed extension developer, which has a great tab completable javascript shell, and have been exploring making an extension that lets me quickly make a calendar task out of an email. I have a bunch of ideas queued up behind this, but that is a short term useful one to dig into. I actually quite like the component interface model that thunderbird has, though I wish there were a few more API docs or examples to figure out what possibilities exist.

As I figure out more, I’m sure I’ll post it here. I have definitely found that developing thunderbird extensions is pretty tall grass, as very few folks have really written down much on it. I’m going to try to be a good citizen and stick stuff in the mozilla wiki as I figure it out.

The switch from xemacs -> emacs

Monday, May 19th, 2008
I’ve always felt the root of the emacs vs. vi holy war (which is one of the longest standing holy wars in free software) basically came down to the following key point:

When you first were learning Linux / Unix, did your mentor use vi or emacs?

The answer to that is at least 90% correlation to your preferences. Much like most people share the politics of their parents, most people share the editor preferences of their mentors. Switches don’t tend to happen unless mentors switch as well.

And in that camp, I’m an emacs guy. I learned it in college when I took my first programming class (which was in lisp). Our professor gave us a starting .emacs file, pointed us at the tutorial, and built macros that helped us out in our efforts.

A near decade with XEmacs

Then I graduated from college, and my first mentor at IBM was also an emacs guy, except he was an xemacs guy. It was emacs, but prettier. So I piled on, and was there ever since. Over the years I tried a couple of times to go back to emacs, but their font handling was never as good. I love programming in arial, as it’s just really pleasant on the eyes (this shocks and horrifies people that line up = signs in declarations, but I don’t much care. :) ).

A few years ago, just as emacs was getting reasonable variable width font support, xemacs integrated anti-aliased fonts into their CVS tree, and now I had another reason to stay on xemacs, because now everything looks pretty. Using xemacs was sometimes a pain, as a number of modes didn’t really work right on it. I never had a reasonable html mode working that did indentation like I wanted.

Steve Yegge’s Rant

Last month Steve Yegge had a post entitled Xemacs is Dead, Long Live Xemacs which was basically a call to unify around emacs because it had finally caught up, and it is being very actively maintained. I was skeptical, but decided to try again. Using the Ubuntu packages I lost my anti-aliasing, which meant this was a failed experiment.

But, after some research, I realized that emacs cvs not only has xft support in the tree, but that since March it’s been the default. This is what will be emacs 23. I was already running xemacs out of cvs, so taking the same leap with emacs cvs wasn’t such a big deal.

It’s taken a couple of weeks to tweak my configuration to get me the same, or better, results with emacs as I had with xemacs. Last night I finally understood what I needed to get nxhtml to do my html.erb files correctly (ruby and html bits independently highlighted, and mode switching automatically when moving between code blocks). Minus 1 font issue with planner, I can definitely say I’m fully converted.

I’m also enjoying diving into elisp again. For whatever reason, life seems a bit more stable on emacs than it did on xemacs. And once emacs 23 actually makes it to distros, I won’t even need to have my own binary builds. :)