The Power of Perl: Converting an A4 PDF to Letter with Margins

Because I've moved into the more elegant waters of Ruby and Mono, I sometimes forget just how power Perl can be.  Sometimes 8 lines of perl is all you need to solve a problem.

The Problem

So, I've gotten back in to Blood Bowl with my friend Pyg, but that's mostly for another blog post.  As we've been relearning the rules, Pyg found a much better consolidated rule set on the web as a PDF, in A4, as it was created by Brits.  A4, is the far more logical way to make paper that's normal page size ish.  However, it is slightly longer and slightly narrower than our Letter paper standard.

The real challenge here was that I wanted to create something that was bindable at our local Office Max. "Printing" the A4 PDF to a Letter PDF in evince gave me something with about 3/4 inch of white margin on the right side of every page.  If I could get that to alternate between the right and left sides of the page, then I'd be golden.

As you can see, left as is, bindng double sided would both look silly, and actually bunch through some of the text on the right side pages.

The Solution - Hack the PDF

PDF is just a document standard.  That means a lot of it is in plain text, for a gracious definition of that word.  I openned the file up in emacs and started searching for words that might represent this.  Eventually I found the following snippet in the PDF:

<< /Type /Page
   /Parent 1 0 R
   /MediaBox [ 0 0 611.999983 791.999983 ]
   /Contents 196 0 R
   /Group <<
      /Type /Group
      /S /Transparency
      /CS /DeviceRGB
   >>
   /Resources 195 0 R
>>

This is part of the Page definition, and what's important is that MediaBox tag.  The 4 numbers there are X, Y, Width, Height of the content.  After some experimentation I determined that the values I needed for "right side pages" were: -52 0 559.999983 791.999983.  I need to set every other (of 84 pages) to that.  There are MediaBox definitions that have nothing to do with Page, so I can't just look for them.  It has to be a MediaBox in that Page definition.

Changing your line break

Perl has a lot of operations that are line oriented, so you get 1 line at a time.  But one of the greatest powers of perl is it's really easy to change what it considers a line break.  This is done with the line ending special variable $/.  A common trick is to $/ = undef; which means the first read of a file will read the entire thing into a string.  For this problem I decided that if I made << my seperator, I'd get the Page definition and MediaBox on the same line, making life much easier.  But enough of the details, here is the code:

#!/usr/bin/perl

use strict;

local $/ = '<<';

my $count = 0;

while (<>) {
    if ($_ =~ m{/Type /Page}) {
        if (($count % 2) == 0) {
            $_ =~ s{MediaBox \[.*?\]}{MediaBox [ -52 0 559.999983 791.999983 ]}gs;
        } else {
            $_ =~ s{MediaBox \[.*?\]}{MediaBox [ 0 0 611.999983 791.999983 ]}gs;
        }
        $count++;
    }
    print $_;
}

This reads in the first pdf from standard in, the second to standard out.  Because I could change the line seperator I don't have to keep track of state of if I've seen a Page seperator, and if I'm still in that block (i.e. proper formal parsing).  That lets me do it in 2 matches.

The results, as just as you would like, and made for some nice printouts for binding:

Sometimes you just need to roll up your sleeves and bang out some perl code. 🙂

Print Friendly, PDF & Email