Searching for the Holy Grail

I recently came to the conclusion that, when it comes to programming languages, there is no Holy Grail, just a bunch of cups, and perhaps we should just get on with drinking.

I've worked with a lot of languages over the years. I started as a 10-year-old writing in Commodore BASIC on the Commodore 64, progressed to AmigaBASIC, AMOS and finally 68000 assembly on the Amiga 500, and briefly dipped into x86 assembly during my short stint as the owner of a 486SX33. At college I did programming courses that used Ada.

Sadly, I don't have any records of any of the programs that I wrote back in those days, although I fondly recall some. Whatever remains of them, fragilely stored on flimsy floppy disks or inscribed on dot-matrix printouts, is probably decomposing under countless layers of land-fill by now.

I got my first Mac in the mid-90s. My earliest Mac programs were in REALbasic, but with the birth of the web I wrote "discussion boards" with flat-file "databases" in Perl, "weblogs" in PHP, and I learned HTML, CSS and JS, as you do. Almost all of that early stuff is lost too.

It's not until the 2000s that my personal fossil record has some surviving evidence. I used the CVS, Subversion, and SVK version control systems, and some of the artifacts that I created in those systems survived into the Git era, where I expect they'll enjoy a much longer digital half-life.

I wrote large projects in C and Objective-C on the desktop, and for the web I learned Ruby and went deep into JS. These were my professional languages that paid the bills for me, but along the way I explored and wrote side-projects in Haskell, Go and Lua. At work I sometimes have to dip into Hack. I've written my fair share of Vimscript and Bash (etc) stuff. Sometimes, I write snippets of Python. My latest experiments have been with Reason.

Looking back over that history, I envy the simplicity of the constrained choices that I used to make. I wrote BASIC because it came with the computer. I learned assembly because it was the only tool for the job (if by "the job" you meant writing games with fast graphics). Some of the choices were totally serendipitous and not really choices at all; I learned AMOS because someone gave me a copy of it. I learned Perl because it was basically "the way" that CGI was done back in the day. I learned C and Objective-C because those were the languages with which you could access Apple's APIs and build desktop apps.

Starting with Rails, the pace of innovation and renewal seemed to really take off. Languages and frameworks were spawned en masse, duking it out with one another to achieve domination and establish a "new paradigm". The choices became less clear, the list of contenders impractically long. For all the languages I did learn, there are many more that I didn't have time to even dabble in (not just the older languages — Erlang, Java, C++ — but countless others that they have spawned or influenced: Dart, Rust, Elm, PureScript, CoffeeScript, Elixir, Scala, Kotlin, D and many others).

I think there are some pretty clear conclusions to draw here: either we as language designers aren't very good at what we do, or as human beings we're afflicted with a crippling case of not-invented-here syndrome and an evolved inability to be happy and productive with the tools that we're given. Either way, the costs are immense: we spend a disproportionate amount of time and energy reinventing our tooling relative to how much we invest in actually building things with those tools. It's easy to mistake motion for progress, I guess.

Why am I writing about this now? I think it's because I've felt more and more restless about JavaScript. JavaScript is an increasingly multi-paradigm language that wants to be all things to all people. It has "Feature Envy". Its aggressively forward-looking development model (the whole TC39 process with its keen involvement by significant industry players with the power to effect meaningful change in the deployed ecosystem, and the availability and widespread use of tools like Babel that enable rapid and aggressive experimentation) together with the need to maintain backwards compatibility forever (you can't "break the web") mean that it accrues an ever-growing, never-shrinking set of functionality and syntax. It has — or will have — almost every feature available in any other language, up to and including a Kitchen Sink capability, but will probably stop short of the one thing that I really wish it had (a real, not-bolted-on type system). JavaScript is simultaneously the best language to teach to beginners (there is no cheaper way to build something with a UI that will run anywhere) and the worst (welcome to an ecosystem with a rapidly churning set of tooling that's metastisizing fast enough that it may well collapse into a black hole one day).

Yet, I've come to the conclusion that all this searching for something better is a fool's errand. You don't make something better by combining the best bits of other languages. The most you can do is to take an opinionated stance about something that you consider to be really important, focus on getting that one thing really right, and then get on with the business of building useful things with it. Note that it doesn't suffice to be just opinionated; you also have to be focused. Here are some examples of opinionated, focused stances:

  • Haskell:
    • Core thesis: Fully unlock the power of abstraction with a purely functional, lazily evaluated core.
    • Advantage: You get an expressive, sophisticated type system that allows you to succinctly materialize ideas with a high degree of machine-assisted verification.
    • Trade-off: Some things, like modifying deeply nested immutable data structures, are hard.
  • Go:
    • Core thesis: Simplicity is paramount.
    • Advantage: Out of simple primitives you can build robust, highly-performant concurrent solutions.
    • Trade-off: Code is "boring", "verbose", "pedestrian".
  • C:
    • Core thesis: Abstraction is overrated.
    • Advantage: Speak to the Von Neumann architecture in its native tongue (almost) to build fast things, without needing to learn processor-specific assembly language.
    • Trade-off: When you build stuff out of gun powder, wire, and spark-plugs, you just may singe off your eyebrows.

If you try to make these languages better by blending together their best elements you wind up with behemoths like JavaScript and C++. These are not bad languages; they've been extremely successful, and many great things get built using them. And yet, people can't resist somehow trying to "fix" them, either by augmentation or outright replacement. Something is rotten in the state of programming languages.

Inevitably with the good stuff comes some baggage. Sometimes the elements don't combine well. You can't make a better Ruby, for example, by adding Haskell's strong static typing to it, because what you'd get wouldn't be the loose, fluid, pleasant, expressiveness of Ruby: you'd just get Haskell with an awkward syntax. I've previously remarked that programming in Ruby is like driving a rubber car without a seatbelt; that sounds like fun in a weird kind of way, and it is. By the time you've added static typing the language is no longer a car, nor is it made of rubber, and I guess it doesn't really make sense to ask whether it has a seatbelt or not any more (whatever "it" is). Large, growing languages inevitably tend towards resembling Frankenstein's monster over time.

All of this may strike you as being perilously close to just plain old obvious common-sense, and you may wonder, why I am bothering to write it? Why would I ever have thought that there was a Holy Grail in the first place?

It crept up on me insidiously at first. About 10 years ago I first got exposed to the idea that you should learn a new programming language every year, not necessarily to add to your practical tool-kit, but to expand your mind. Haskell was a popular choice for this purpose back then. The notion was that you should seek out "novel" ideas — note this means novel to you and not necessarily something universally recent — and train your mind by grappling with them. At the very least you kept your mental axe sharp by exercising it, and at best you might stumble across something that subtly (or even dramatically) changed your world view and in some nebulous, hard-to-articulate way, "made you a better developer".

Fast forward ten years into this practice, I haven't learned ten new languages, but I have made significant incursions into about five. Even though I never lost sight of the reason why I was going through this whole exercise, there was a primitive, subconscious part of my brain that was wondering if I would end up finding "The One": the language that would somehow feel so "right" and enable me to be so effective that I'd be able to stop searching and settle down for a decade, or two, or three. I'd forgotten that I wasn't engaged in a search at all. This is what reading too many blog posts with titles like "Why we're rewriting everything from X to Y" will do to you, given enough time. You start to think like a believer.

This year I was looking around for a new Language of the Year to dive into, but was struggling to find something that met my novelty criterion. Reason/OCaml felt too similar to Haskell. Rust definitely had some novel ideas, but it failed my other criterion: practical applicability (in the sense that I needed to find a well-suited, useful, motivating side-project in which to try it out).

I finally decided to go with Reason despite the relative lack of novelty, and this meant honestly confronting myself with the fact that at least part of me — the irrational part — had been holding out, hoping to find a Holy Grail candidate. I had to let go of that. I went in knowing Reason would have some real strengths (eg. fast compile times, great developer experience, solid performance etc) and some down-sides (eg. spartan documentation, few examples, and so on). There'd also be some mixed-bag stuff, like the small community, which you can consider a blessing or a curse depending on how you look at it.

All languages are going to suck in some way. But the bright side is that we have so many choices available to us now that we can choose languages that suck in the ways that we can tolerate, and conversely, excel in the areas that really matter to us. For me, Reason is interesting because two of the things that it gets really right (having a solid type system, and language-level support for immutable records) happen to represent a couple of the things whose absence really annoys me about JavaScript. I can compromise a lot on the rest — take the good, and the bad — because ultimately it doesn't matter that much to me. I'm looking for a cup to drink out of, not a fountain of eternal youth.

In closing, I want to make sure that this post doesn't end up being yet another "Why I'm rewriting everything from X to Y". I'll continue to use the other "cups" I have in my cupboard. The cup I choose at any particular moment will depend on the circumstances. I am sure that every now and again I'll want to try out a new vessel or two. But if you ever see me with a distant, glazed-over look in my eyes, and I look like I'm about to ride off towards the horizon in search of some mythical language that doesn't exist — or worse still, I start showing signs of wanting to design my own — please try to snap me out of it. A slap in the face and, if that doesn't work, a bucket of cold water should do the trick.

Discuss: Twitter

Building Relay Modern

There's a long backstory about the development of Relay Modern that's been bubbling around in the back of my head for a while. As I write this, version 1.0.0 is out, we've published an official blog post introducing the new version, and people out in the community have had time to write some useful introductory posts about it. There are already quadrillions of Facebook users getting their data delivered to them via Relay Modern, and even more importantly, I've ported my blog over to it. Seems like as good a time as any to tell this story, or at least part of it.

If you're a GraphQL aficionado, recovering JS framework author (or user), or are simply interested in the question of how best to manage data flow in complex server/client applications, then I'm writing this for you.

Hello, Relay

I started working on Relay back in early 2014 when it wasn't open source, wasn't called Relay, and had only recently decided to be in the business of bridging the gap between React and GraphQL (it originally started off as a new routing solution, or so the legend has it). GraphQL was still a pretty young technology at that point, but it had seen rapid uptake and was used extensively across our native apps.

Like any emergent technology, GraphQL had some growing pains. Because of this, Relay set out not just to bring GraphQL to JavaScript — and note that that meant not just the web but also mobile, via the still-secret React Native project — but to rethink some of the assumptions that had been made in the native apps up to that point.

  • One of the big ideas was query colocation — the notion that you should be able to specify your data requirements for each view component inside the view itself and that the framework should transparently handle aggregation and efficient fetching.
  • Another was that we could totally eliminate overfetching by dynamically constructing queries at runtime based on a comparison of what data the developer had asked for in their component and the data that the framework had already stashed away in the cache as the result of previous queries.
  • Finally, we figured that GraphQL fragments — the basic unit of re-use that allows developers to assemble queries out of a bunch of otherwise redundant parts — should be parameterizable; that is, fragments should be able to take arguments, just like functions do, so that they could be used flexibly in multiple places without duplication.

This was a super exciting time to be at Facebook. React was taking off, and React Native, Relay, Flow and GraphQL were all angling towards open source release. There was a real sense that we had something awesome to share with the world.

The (First) Great Rewrite

As we approached the open source release, we realized it was time to rewrite a substantial part of Relay. GraphQL was going to come out in open source with a minimal but pretty rigorous spec, a new syntax and some subtle corrections and improvements from the organically grown internal implementation. We had some long standing bugs that we wanted to fix, and a bunch of ideas on how to improve performance by making use of immutable data structures. Not truly immutable ones, mind you, as JavaScript doesn't have those: but ones that we'd build out of standard old mutable JS objects, and with which we'd carefully implement structural sharing and copy-on-write semantics, with Flow providing some assurances that we weren't mutating things we didn't own.

This is where I must introduce Joe Savona. Joe was pretty new at Facebook at the time, but he joined the Relay team and dived into tackling some of the hardest problems we had to solve in the rewrite. In fact, his continual production of new ideas was one of the things that fueled the desire to actually go ahead and do the rewrite, for real. We had always lived with a long backlog of stuff that we'd love to get to some day, some of it quite "moonshotty", but Joe had a talent for translating those ideas into a series of ordered, achievable steps. We came up with some pleasant APIs for traversing and transforming trees (query ASTs, data trees), and set about rewriting the guts, heart, brain, and peripheral appendages of Relay. I presented a deep dive on some of this stuff back in 2015 that you can watch if you want to learn more.

This was some of the most intellectually interesting work I've done, working on hard problems among talented, inspiring, hard-working peers. My favorite part was adding support for nested "deferred" queries. For the first part of this, I adapted somebody else's very clever code for splitting apart a heterogenous tree into a version that could do so recursively. Tied my brain in knots doing it. I then got to rewrite it on top of our new APIs and the result was satisfyingly simple compared to the old version. The same was true for all the other traversals that we had to reimplement. We finished the rewrite, open sourced Relay, and rode off into the sunset.

The (Second) Great Rewrite

Not quite. The sunset bit. Releasing the project was only the beginning. We had an ever-growing internal user-base at FB with increasingly demanding and diverse workloads to fulfill. We were faced with a critical problem: despite the fact that Relay was recently re-written and much better architected, it was crumpling under the weight of its own complexity. As we added features such as query persistence (the ability to reduce query upload sizes by saving the query on the server and sending up an identifier instead of the full query text), garbage collection, integration with offline disk caches on native platforms, and sophisticated new APIs for dealing with "connections" (paginated collections), we found ourselves frustrated with the speed at which we could make progress. This thing was intricate and complicated, hard to modify, stupefyingly magical and unpredictable.

We still had that long backlog of ideas, but we knew that we were adding to the tail of the queue faster than we were shifting from the head of it. It was a scary prospect, but we came to the conclusion that it was time to burn it all down and rewrite the thing from scratch. We knew we had to unlock performance wins that would require drastic changes, and rewriting was the only way we were going to be able to do it before old age, senility and burn-out took us out of the game. A risky move — big rewrites are often warned against for a reason — but we felt like we had to take the gamble. In doing the rewrite, we knew that the risk of failure (in typical "Second System Syndrome" fashion) was real, but inaction would have led to certain failure.

Everything old is new again

I can still remember the day in early 2016 when Joe and I grabbed a room in MPK 20, that fancy, Frank Gehry-designed thing with a park on the roof, and stood in front of a whiteboard wall to try and imagine what "Relay 2" would look like if we let go of all our previously held assumptions.

What if every query in Relay were statically known?

Woah, that's crazy talk, Joe. What are you talking about? I'd been spending too long inside the bubble of the Relay philosophy — the one with the tenets about query colocation, dynamic query construction, and fragment parametrization — that I'd never really considered this. Those tenets were already in place before I joined the team, and I assumed — perhaps naively — that they must have been there for a good reason; people who'd been at Facebook much longer than I and had witnessed the birth and evolution of GraphQL had decided that there must be a better way, and something new should be tried. It never occurred to me that embracing the static, the rigid, the "inflexible" could be a step forward. Funny that I hadn't, seeing as I had just prior to this built a new static API for writing Relay mutations (data updates) that aimed to replace the magical dynamism of the existing Relay mutations with something more predictable, debuggable and teachable.

But Joe hadn't just considered it; he'd had the idea circling around in his concious and unconscious mind for possibly weeks or months. He'd given it deep, painstaking thought, and he was nearing the conclusion that it just might work. Fully static queries, known ahead of time, would unlock new kinds of performance optimization by allowing us to burn cycles at build time precomputing optimal structures that would allow us to go faster at run time. And with static queries, we'd get query persistence effectively "for free", just like the native apps.

So, back to that question.

What if every query in Relay were statically known?

It was heresy, but we went through the exercise anyway, figuring out what each of the existing APIs would look like if we wiped the slate clean and started from scratch without dynamic, runtime query construction. It meant giving up some features, jettisoning some magic. In return, users would get predictability, performance, and an execution model that mere mortals could understand. There would be a cost though: instead of having Relay figure out a minimal set of data to refetch when parameters change, we'd require users to specify a static query ahead of time. And we'd have to rewrite everything, again, in order to implement this.

On the flip side, rewriting would mean the ability to scratch some long-felt itches, like:

  • Switching to a purely POJO-based representation for cache data.
  • Abstracting all low-level record access behind a thin facade API that would allow us to plug-in different kinds of underlying storage (including native data structures, mediated by a JavaScript-to-native bridge).
  • Aligning our terminology, API shape, and data-flow with the latest thinking on the iOS and Android side (for better interoperability and communication).
  • Dropping support for legacy GraphQL (pre-open source) syntax.
  • Splitting the code up into separate "compiler", generic "core/runtime" and "React" packages.
  • Implementing deterministic, performant garbage collection.
  • And many others.

Relay Modern

As a tiny hat-tip to risk management, we decided to build a toy prototype before fully committing to the rewrite. Joe spent about two weeks building a little React Native app that could render and paginate through a list of friends, and navigate to a simple "permalink" view using two or three static queries. "I think it's going to work", he said. So Joe and I started again, this time for real. It took us about 3 months to implement the new core, while in the meantime other Relay team members continued adding features to the existing codebase.

We knew perf was going to matter, so I built a microbenchmarking framework that uses the Wilcoxon Signed Rank test to give us an accurate picture of whether any given change made things better or worse. We maintained great test coverage and made sure everything was thoroughly Flow-typed. I built a "golden" test runner (this predated Jest's "snapshots" feature) to enable us to maintain a large body of tests easily even as we made frequent changes to our internal query representation. I made a sample React Native app so that we could run on-device benchmarks. Basically, I was scrambling as fast as I could to lay down most of the supporting work while Joe built core abstractions on top. It was amazing to work with such a motivated, talented collaborator. Striving to keep up, providing deep code review on his diffs (that he knocked out at an humbling clip and quality), and the countless stimulating discussions around whiteboards: I know that the experience made me a better developer. It was deeply rewarding.

The moment of truth came when we were finally able to run an on-device normalization benchmark — normalization is the term we use for processing a query response from the network, transforming it from a hierarchical form into a flattened, "normalized" representation for storage in the on-device cache. We knew Relay Modern should be faster because it was drastically simpler, we'd taken great care to avoid performance anti-patterns, and we were simply doing much less work at run-time. When the benchmarks came in we were a little stunned. We ran them again. We sanity-checked them on multiple devices. The results were consistent: normalization in Relay modern was about 10 times faster. It's true that normalization is just one of the things that Relay has to do, but it was clear that we were onto something. Relay Modern was going to be great for mobile devices and spotty networks. It would perform great on desktop environments too, but we'd aimed to solve a harder problem and it looked like that's exactly what we'd done. The bet had paid off.

The happy ending

All this happened in the first half of 2016. We actually thought we were on the brink of shipping it. I spoke about it publicly for the first time in August — ill-advisedly calling it "Relay 2" because we didn't have a better name for it yet — and Joe followed soon after. We had a few road bumps on the way which led us to delay shipping; I'm sorry that it took so long, but I'm really happy to say that the product is finally out the door.

Between finishing the new core and shipping 1.0.0 there has been a lot of thankless grunt work done by a bunch of people on the team. It was a group effort, but in particular:

  • Yuzhi lead an amazing effort to migrate thousands of Classic components and educate teams.
  • Jennifer built out prefetching (the ability to have native code on a mobile start fetching a query for a React Native app before the JavaScript VM has even finished booting).
  • Jan did a fantastic job of making sure we had a great migration strategy and compatibility API for moving existing apps over from Relay Classic to Relay Modern.
  • Lee helped us prepare and package everything for an open source release.
  • Our manager Alex was a roving support agent who tirelessly helped out with anything and everything.

But this post is in large part a tribute to Joe Savona. Neither of us is working directly on Relay any more, but the experience will forever loom as an indelible and transformative part of my Facebook story. As a colleague, erstwhile neighbor, and friend, working on Relay with Joe was a once-in-a-lifetime experience. I'm sure that Relay will continue to be an important building block for teams at Facebook, and I hope that it's useful to teams in the external community as well, but no matter what direction the framework ends up evolving towards in the future, I know that the design and architecture will retain elements of Joe's brilliant touch for a long time to come. Thank you, Joe, and keep on hacking.

Discuss: FacebookTwitter

A tale of three filter-branches

TL;DR: I used git-filter-branch to rewrite the history of the repo containing this website's files, processing 4,980 commits and transforming 3,702 wikitext files to Markdown along the way. I wrote three separate versions: the first would have taken as long as 42 days to complete, the second perhaps 3 to 4 days, and the third and final version completed in about an hour.

This is the tale of how I spent a few hours implementing and re-implementing a slow transformation, reducing the runtime from a projected 42 days to an actual runtime of an hour, experiencing a joyful reminder of the thrill of problem-solving that makes me love programming so much.

The background

For a bunch of self-indulgent reasons, the material for this website lives as a collection of Markdown documents in the content branch of a Git Repo. That "Markdown" bit wasn't always true though. As a result of an unfortunate choice, I'd actually accumulated thousands of wikitext documents over the years, and I'd been serving them with a Rails app and a very fast wikitext translator gem.

When I got rid of the Rails app and moved the content out of a database and into Git, I wanted to preserve exactly the same output when rendered to HTML, so I made a "wikiserve" microservice in Ruby that simply wrapped the wikitext gem, and I had my Node GraphQL server hit the microservice whenever it needed to turn wikitext into HTML.

This worked remarkably well, but after leaving the whole thing untouched for a year or more I found that I couldn't run the microservice locally on my new laptop. You see, frustrated by the bitrot and flakiness that comes with the added complexity of having multiple installed Rubies (one from Apple, one via Homebrew, a bunch installed by ruby-install and switched between using chruby), I decided to just say "screw it" and uninstall all but the stock Apple one that comes baked into the system. There is simply no reason for me to muck about chasing anything like the bleeding edge of the Ruby ecosystem: I want it to be a dependable, stable, cross-platform workhorse that I can count on to get shit done. I rely on Ruby to power Command-T inside Vim (which I use many dozens of times a day), to process outbound email for me from mutt, and to handle things like encrypted content in my Git repos. In short, I am seldom actively iterating on any of this stuff, but I want it to just work, just like I do when I plug a fridge into the wall and expect it to dependably keep things cold for me.

The catch is that Apple introduced a thing called "System Integrity Protection" (SIP) with El Capitan which effectively breaks installation of any Ruby gem that has an executable component. These gems try to install their binaries under /usr/bin/, but that is off limits under SIP, even for the root user.

Ugh. There are workarounds, of course — disabling SIP temporarily or permanently, explicitly setting GEM_HOME and GEM_PATH to install into another location, or specifying -n /usr/local/bin when running gem install — but I was just left with the feeling that I wanted to accelerate my plan to reduce my dependency on Ruby on macOS by getting my website content out of wikitext and into Markdown.

Modeling content using Git primitives

This is a terrible but fun idea. When I created Masochist I took all of my content out of a database — with its indices, associations, efficient searches and cheap ORDER BY clauses — and shoved it into a bunch of plain text files sitting in a Git repo. There are some really nice properties of this design:

  • You get content versioning "for free".
  • Minimal vendor lock-in effects, only minor ones in terms of format (seeing as Markdown is easily translated into other formats).
  • Easy, distributed backups: just push to another (free) Git host, such as GitHub, GitLab, Bitbucket etc or your own server.

Nevertheless, there are some complications too. The first one is that you have to decide how you are going to model the content metadata that previously lived in your neat little DB tables: things like "created at" and "deleted at" timestamps, or which tags apply to which pieces of content.

Let's just consider the timestamps angle for a second. Some of your alternatives include:

  • Ignore it: Lose all the history and only ever look at the current state of the content tree. This is the dumbest and simplest thing that could possibly work, but it is a bit sad to lose or obscure the richness that an immutable history record can provide.
  • Shove it in headers inside the content documents: This would be relatively easy to access and index, but updating it would be a pain: Do you set up editor hooks to update the timestamps when you save a file? Do you have a separate pre-commit tool that enforces timestamps are correctly updated?
  • Treat the Git history itself as the source of metadata: That is, infer the "created at" timestamp by detecting when an item first entered the repository and the "updated at" timestamp by looking at when it was last modified. Note that you have to do it this way because Git doesn't manage timestamps on files, so you can't rely on the filesystem timestamps to tell you anything useful; even if they were always correct and consistent on your laptop (a big "if") there's not a chance they'll be right when you start push-ing, pull-ing and clone-ing as part of your deploy process.

That last one sure sounds the most elegant, doesn't it? But it also obliges us to accept a reality about Git's object database: it's made to be blazingly fast for certain common operations (git status, git commit etc) but not others. For example, answering that question of "detecting when an item first entered the repository" could require you to traverse back from the current HEAD all the way back to the root commit of the repository, which could mean examining a thousands-long commit chain. And note, even if you know how Git works and seek to minimize the number of git processes that you need fork and the number of commits that you actually need to examine (eg. by limiting git log with a pathspec), Git's internals will still need to traverse that thousands-long chain in the worst case.

We're going to need a secondary index then. For that, we'll use Redis, shoving in there all the things needed to make operations that Git would be slow at fast. Updating this index is complicated (you know the saying about cache invalidation and this is no exception).

And that's just timestamps. We haven't even talked about tags indices yet. If we stick the tags in the content document headers, we'll need to index those too. And we'll need to make sure that we carefully handle all sorts of workflows, some of them quite edge-case-y: adding tags, removing tags, removing the last instance of a tag anywhere in the repo, and so on.

The code to update these indices is hideous and gnarly. Note: the project is called Masochist for good reason. At least I can console myself that nobody else has to understand it...

Back to filter-branch

Given this architecture, if we want to port all our wikitext files to Markdown, we can't just do it in place. That will blow away our precious timestamp information. Moving "foo.wikitext" to "foo.markdown" would effectively look like the "foo" content coming into existence today.

We have to go back and re-write the history. This is why the README on the "content" branch has a big disclaimer in it:

WARNING

The content and history of this branch may be re-written at any time.

We're going to use git-filter-branch to go back over the entire history. For every commit that deals with a ".wikitext" file in any way, we're going to edit that file in place, renaming and porting it into an ".md" file, and we're going to recreate the commit with all the other metadata intact, timestamps and all. It will be as if the whole wikitext mistake never happened...

First thing is figuring out how to turn wikitext into Markdown as cheaply as possible. I wrote the wikitext parser in the wikitext gem; it is about 2,600 lines of hand-tweaked, performance-oriented C code. I don't want to have to turn all that into something that emits Markdown instead of HTML. It certainly could be done, but I am too impatient. I just want to bang something out as cheaply as possible while maintaining correctness.

The result is this abomination of a script. It's a Ruby script that forks out to node to read in files and make use of the unpack-content package (so that I wouldn't have to rewrite the error-prone logic for dealing with document metadata headers). It gets back JSON, parses it, preprocesses the markup with a nasty regex (and a copy-pasta'd one at that), translates the wikitext to HTML using the wikitext gem, turns the HTML into Markdown using pandoc, applies another layer of ugly postprocessing, and finally glues back on the metadata headers before spitting the result back out. Definitely an inglorious hack, but I felt delighted at the quality of the result I was able to get with such a quick little ditty of a script. Additionally, the way I could inject my custom pre-and-post processing so easily into the flow — among other things, fixing a number of long-broken links — meant that the quality of the results is arguably better than if I had done it the "right" way and taught the wikitext gem how to produce Markdown instead of HTML. I would say "worse is better", but that would be distorting the intent of the originator of the phrase.

My favorite function from the script is this 180-column monstrosity, which will surely break the layout on whatever device you happen to be reading this on:

# Prerequisite:
#
#     npm install -g unpack-content
#
def read(file)
  if file == '-'
    safe_file = '/dev/stdin'
  else
    safe_file = Shellwords.shellescape(file)
  end
  JSON[%x{NODE_PATH=/usr/local/lib/node_modules node -e "process.stdout.write(JSON.stringify(require('unpack-content').default(require('fs').readFileSync('#{safe_file}').toString())))"}]
end

This really exemplifies the spirit of the script. Hard-coded to assume a global install in a specific location, a single long line because I was too lazy to break it up, with only two small winks in the direction of software development professionalism: care taken to escape unsafe characters in the shell, and the flexibility to read from standard input rather than from a file on the disk.

With this in place and tested, we can move on to the real business: calling this thing from inside git-filter-branch.

Take 1: Render it like it's React, baby

My first try used the --tree-filter option, which basically allows you to mutate the filesystem any way you want, and which will drop you into a directory with a checked out work tree corresponding to each and every revision in the entire history, in order, one at a time.

The code looks more or less like this:

paths = Dir['content/**/*.wikitext']
paths.each_with_index do |path, idx|
  basename = File.basename(path)
  basename_without_extension = File.basename(basename, File.extname(basename))
  dirname = File.dirname(path)
  type = File.basename(dirname)
  safe_src = escape(path)
  safe_dest = escape("#{dirname}/#{basename_without_extension}.md")
  %x{wiki-to-markdown #{safe_src} > #{safe_dest}}
  FileUtils.rm path
end

Basically, it's going to find every file with a ".wikitext" extension and turn it into a transformed one with an ".md" extension. Couldn't be simpler. (When I first wrote it, I also had this copy each pre-transform file into an "archive" directory, because I was worried about losing it forever once I'd done a forced update of the "content" branch. I later realized that I could just create a separate "archive" branch and rely on its immutable history to keep the original state around for me forever.)

When I ran this with a bunch of logging in place to see what was happening — and also inspecting the .git-rewrite directory in which these temporary work trees were being manipulated, I realized something disturbing. If commit A added a wikitext file and I created a rewritten commit A' with a markdown file instead, when I came to the next commit — B, which also added a wikitext file — I'd find that the changes I'd made in A' weren't visible in B so I had to redo them. In other words, if the first commit transformed 1 file, then 100th commit ended up transforming 100 files, 99 of which it had already previously transformed (and not just once, either).

Of course, this should have been obvious to me. If you rewrite a chain of commits A → B → C → D to A' → B' → C' → D', you are operating at each step on A, then B, then C and so on, rather than A', B', C' etc. Evidently, it has to be this way. Git isn't magically going to merge changes from A → A' into B before handing the work tree over to you. On repos above a trivial size threshold, --tree-filter techniques are only going to scale if you can run your side-effects in basically constant time.

The verdict

Faced between the options of going to bed with the intention of seeing if the run had finished in the morning, or doing a back-of-the-napkin calculation right there to see whether the design was viable, I did the math.

The worst-case runtime of this approach is O(nm), where n = 4,980 (commits) and m = 3,702 (wikitext files). That's in the ballpark of 18 million transforms, and at about 5 transforms per second (the speed of my shitty conversion script), we're looking at 42 days to complete the filter run. Clearly unacceptable, and even more so when you consider that any bug discovered later in the run would require it to be restarted all the way from the beginning again.

This approach is pretty much like re-rendering the world for every frame from the top down. Super simple to reason about, but doesn't necessarily scale so well. Like in React, right? And just like in React, we need to find our way to short-circuit in shouldComponentUpdate and do less work. (Apologies to any non-JS-programming readers who might find this analogy non-illuminating. All should become clear in the next section.)

So, instead of going to bed, I decided to try another strategy.

Take 2: In which the author embraces stateful computation

Clearly we only want to run each transform once per unique wikitext input. That is, most files don't have edits, so they should get transformed exactly once, and the others should get transformed only as many times as they were edited.

To effect this magic, we need to maintain state about what things we've previously transformed. This is tricky though, because our script is going to be called separately, 4,980 times; once for each commit. Where to store the state in between invocations? Dump it to YAML or JSON and read it in and out? Sounds slow.

The approach I took is much more complicated than my initial attempt. The core idea is to use git ls-tree -r to get a list of files:

def ls(rev)
  tree = `git ls-tree -r -z #{rev}`.split("\u0000")
  tree.each_with_object({}) do |line, memo|
    mode, type, hash, name = line.match(/^(\d+) (\w+) ([a-f0-9]+)\t(.+)/).captures
    memo[name] = GitObject.new(mode, type, hash, name)
  end
end

And we do that for the current commit (the one being rewritten), the parent of the current commit, and the last commit that we actually rewrote. We know that if nothing changed about a file between the parent and the current revision, then we don't have to retransform it; we can just grab the blob from the last-rewritten commit.

The logic to do this kind of update is satisfyingly low-level and looks something like this (paraphrasing):

# Remove the ".wikitext" file from the index.
%x{git rm --cached -- #{escape(name)}}

# Shove the pre-existing blob into the index at the right place.
info = escape("#{mode},#{hash},#{markdown_file(name)}")
%x{git update-index --add --cacheinfo #{info}}

On those occasions when we do need to perform a transform (because a new file was added, or an existing one was edited) we do this, more or less:

# Get the contents of the previous source blob,
# transform it on the fly (no temporary files, just stdin/stdout and pipes),
# then add it to Git's object database.
hash = %x{
  git cat-file blob #{hash} |
  wiki-to-markdown - |
  git hash-object -w --stdin
}.chomp

Once we have obtained a hash in this way, we can perform an update like we did above using git update-index and --cacheinfo.

All of this is happening in the index, not in the worktree (there is no worktree now, as we're using --index-filter instead of --tree-filter) so this is much faster.

The verdict

Again we have the choice: go to bed and see whether this thing is done when I wake up, or do some estimation and make a judgment call.

This approach is much better, but still not fast enough. We now get through about 50 files per second because most of the files we look at don't need to be transformed. At that rate, we're now looking at just over 4 days in the worst case. Significantly better than our initial 42-day estimate, but still not good enough. Even if we note that the real total of files that we'll have to look at is closer to n(n + 1) / 2 (ie. the sum of consecutive integers between 1 and the number of commits, seeing as most commits add just one new file), we're still in the vicinity of 12 million operations, which adds up to nearly 3 days at our rate of 50 operations per second.

Back to the ol' drawing board. Bed can wait.

Take 3: In which we burn the haystack

One of my former colleagues said this:

Optimization problems are like finding a needle in a haystack: you can either try to find a way to sort through the hay faster, or you can burn the fucking haystack down.

I don't know where he originally got this from, but I like how visually evocative it is of one of the fundamental insights of performance optimization: rather than trying to work faster, it's better to not do the work in the first place. Avoided work is infinitely fast. You can't do better than that.

Here we're going to drastically reduce the number of operations we need to do from about 14 million down to around n (4,980, or the number of commits). What we want to do, in essence, is look only at files which actually got touched by a particular commit, and have our changes to all the other files carry forward automatically.

There is some hand-waving here as I'm essentially writing off the cost that git diff-tree incurs when we ask it to compare two tree objects and tell us what changed. It is a non-zero cost, but it ends up being insignificant in our use case because there are a number of tricks Git can and does employ to prune the search space (there you go, burning that haystack): specifically, Git can do things like shortcircuit traversal any time it encounters identical subtrees, and it can do that very fast indeed thanks to the use of SHA-1 digests to identify objects. If tree a/b has the same digest in two revisions then we know we need not recurse into it: no matter how large or deep it is, we know that the entire thing is the same in both revs just by looking at the hash. We could try implementing similar tricks in our user-land code, but the truth is that delegating to Git's highly optimized C implementation here is going to be as good as it gets and we can consider it effectively free for the purposes of evaluating our algorithm.

Again, we'll be using --index-filter for speed, and again we'll rely on access to the last-rewritten commit (seeing as we need to perform a bulk copy of it as cheaply as possible, then apply only our modifications on top of it, we need to know what to copy). The key trick here is making use of the map function that is available in the scope of the script text that git-filter-branch will eval each time it visits a commit. Unlike our first --tree-filter-based attempt which was as easily invoked as this:

git filter-branch \
  --tree-filter ~/bin/wiki-to-markdown-filter \
  HEAD

With the second and third --index-filter-based strategies our invocation is going to look like this:

git filter-branch --index-filter \
  'PREV=$(map $(git rev-parse -q --verify $GIT_COMMIT^)); ~/bin/wiki-to-markdown-filter $PREV' \
  HEAD

# Or, more naturally written without explicit $PREV:
git filter-branch --index-filter \
  '~/bin/wiki-to-markdown-filter $(map $(git rev-parse -q --verify $GIT_COMMIT^))' \
  HEAD

Our goal here is to calculate the value for this PREV variable: the hash of the last-rewritten commit. $GIT_COMMIT is the commit that's currently being rewritten and $GIT_COMMIT^ is its parent (we can use ^ here with impunity because the content branch has no multi-parent — ie. merge — commits). If there is no parent, git rev-parse -q --verify will just return an empty string, and our filter script — which will be performing the same check — won't bother with looking at that $PREV argument.

The map function is the secret sauce here. It takes that parent commit hash and, if the parent was rewritten, gives us the rewritten hash, which is exactly what we want. If it was not rewritten, it gives us the hash back as-is. Because commit hashes are invalidated by any change, changing the parent commit's hash by rewriting it will cause all of its descendent commits to by rewritten as well. This means that map will return the parent hash as-is only at the beginning of the filter run; as soon as we rewrite one commit, map will start returning rewritten parent commit hashes from there on.

From this point, it's easy. With $PREV, we can use git reset to set the index to look just like the last-rewritten commit looked like. We then use git diff-tree to get the changed files, and we apply our transformation to anything that needs it, and let anything else that changed but which does not need transformation go through unmodified. We are effectively doing something like a rebase with an additional, optional, automated transform along the way.

Just to be sure I wasn't going to be bitten by edge cases, I used git whatchanged --diff-filter=RD to find all commits that moved or deleted files. It turned out that there were only a few, noted in the comments, and that some simple logic would handle all cases. Perhaps the only remaining "gotcha" here is that git diff-tree doesn't report similarity statuses (eg. R91) like git whatchanged does, but rather shows them as paired A (add) and D (delete) operations. In the end, it all comes out in the wash.

The verdict

Version 3 gets us down to about 1 hour to rewrite the whole history. You can see the results on the current content branch (although if I ever rewrite the history again — something which is likely — you can expect that link to be broken; this is the evergreen link to the branch as it currently exists). I expect the next big filter-branch rewrite will be when I go back and scrape my oldest blog posts — which currently only exist as mirrored PHP/HTML files — and turn those into Markdown too, at which point I'll want to "prepend" them to my content branch, invalidating all the existing commits in the process.

Concluding remarks: in which our author revels in the joy of programming

So, we got a projected 42-day runtime down to an hour using nothing but simple techniques and reasoning. There's no genius in this, but damn is it rewarding. Not bad for a Friday night's work. This is a very strong reminder of why I enjoy programming in the first place: the ability to transport the intractable into the realm of the practical, and seemingly effortlessly. It doesn't always work out so easily — and I'm definitely spoilt here by the power and flexibility of the Git tools, which allowed me to solve the same problem in three significantly different ways — but when it does, it is magic.

And now, the icing on the cake: I get to rip out the "wikiserve" microservice. Deleting code is almost as much fun as making code run faster.

Discuss: Facebook - Twitter