Building Relay Modern

There's a long backstory about the development of Relay Modern that's been bubbling around in the back of my head for a while. As I write this, version 1.0.0 is out, we've published an official blog post introducing the new version, and people out in the community have had time to write some useful introductory posts about it. There are already quadrillions of Facebook users getting their data delivered to them via Relay Modern, and even more importantly, I've ported my blog over to it. Seems like as good a time as any to tell this story, or at least part of it.

If you're a GraphQL aficionado, recovering JS framework author (or user), or are simply interested in the question of how best to manage data flow in complex server/client applications, then I'm writing this for you.

Hello, Relay

I started working on Relay back in early 2014 when it wasn't open source, wasn't called Relay, and had only recently decided to be in the business of bridging the gap between React and GraphQL (it originally started off as a new routing solution, or so the legend has it). GraphQL was still a pretty young technology at that point, but it had seen rapid uptake and was used extensively across our native apps.

Like any emergent technology, GraphQL had some growing pains. Because of this, Relay set out not just to bring GraphQL to JavaScript — and note that that meant not just the web but also mobile, via the still-secret React Native project — but to rethink some of the assumptions that had been made in the native apps up to that point.

  • One of the big ideas was query colocation — the notion that you should be able to specify your data requirements for each view component inside the view itself and that the framework should transparently handle aggregation and efficient fetching.
  • Another was that we could totally eliminate overfetching by dynamically constructing queries at runtime based on a comparison of what data the developer had asked for in their component and the data that the framework had already stashed away in the cache as the result of previous queries.
  • Finally, we figured that GraphQL fragments — the basic unit of re-use that allows developers to assemble queries out of a bunch of otherwise redundant parts — should be parameterizable; that is, fragments should be able to take arguments, just like functions do, so that they could be used flexibly in multiple places without duplication.

This was a super exciting time to be at Facebook. React was taking off, and React Native, Relay, Flow and GraphQL were all angling towards open source release. There was a real sense that we had something awesome to share with the world.

The (First) Great Rewrite

As we approached the open source release, we realized it was time to rewrite a substantial part of Relay. GraphQL was going to come out in open source with a minimal but pretty rigorous spec, a new syntax and some subtle corrections and improvements from the organically grown internal implementation. We had some long standing bugs that we wanted to fix, and a bunch of ideas on how to improve performance by making use of immutable data structures. Not truly immutable ones, mind you, as JavaScript doesn't have those: but ones that we'd build out of standard old mutable JS objects, and with which we'd carefully implement structural sharing and copy-on-write semantics, with Flow providing some assurances that we weren't mutating things we didn't own.

This is where I must introduce Joe Savona. Joe was pretty new at Facebook at the time, but he joined the Relay team and dived into tackling some of the hardest problems we had to solve in the rewrite. In fact, his continual production of new ideas was one of the things that fueled the desire to actually go ahead and do the rewrite, for real. We had always lived with a long backlog of stuff that we'd love to get to some day, some of it quite "moonshotty", but Joe had a talent for translating those ideas into a series of ordered, achievable steps. We came up with some pleasant APIs for traversing and transforming trees (query ASTs, data trees), and set about rewriting the guts, heart, brain, and peripheral appendages of Relay. I presented a deep dive on some of this stuff back in 2015 that you can watch if you want to learn more.

This was some of the most intellectually interesting work I've done, working on hard problems among talented, inspiring, hard-working peers. My favorite part was adding support for nested "deferred" queries. For the first part of this, I adapted somebody else's very clever code for splitting apart a heterogenous tree into a version that could do so recursively. Tied my brain in knots doing it. I then got to rewrite it on top of our new APIs and the result was satisfyingly simple compared to the old version. The same was true for all the other traversals that we had to reimplement. We finished the rewrite, open sourced Relay, and rode off into the sunset.

The (Second) Great Rewrite

Not quite. The sunset bit. Releasing the project was only the beginning. We had an ever-growing internal user-base at FB with increasingly demanding and diverse workloads to fulfill. We were faced with a critical problem: despite the fact that Relay was recently re-written and much better architected, it was crumpling under the weight of its own complexity. As we added features such as query persistence (the ability to reduce query upload sizes by saving the query on the server and sending up an identifier instead of the full query text), garbage collection, integration with offline disk caches on native platforms, and sophisticated new APIs for dealing with "connections" (paginated collections), we found ourselves frustrated with the speed at which we could make progress. This thing was intricate and complicated, hard to modify, stupefyingly magical and unpredictable.

We still had that long backlog of ideas, but we knew that we were adding to the tail of the queue faster than we were shifting from the head of it. It was a scary prospect, but we came to the conclusion that it was time to burn it all down and rewrite the thing from scratch. We knew we had to unlock performance wins that would require drastic changes, and rewriting was the only way we were going to be able to do it before old age, senility and burn-out took us out of the game. A risky move — big rewrites are often warned against for a reason — but we felt like we had to take the gamble. In doing the rewrite, we knew that the risk of failure (in typical "Second System Syndrome" fashion) was real, but inaction would have led to certain failure.

Everything old is new again

I can still remember the day in early 2016 when Joe and I grabbed a room in MPK 20, that fancy, Frank Gehry-designed thing with a park on the roof, and stood in front of a whiteboard wall to try and imagine what "Relay 2" would look like if we let go of all our previously held assumptions.

What if every query in Relay were statically known?

Woah, that's crazy talk, Joe. What are you talking about? I'd been spending too long inside the bubble of the Relay philosophy — the one with the tenets about query colocation, dynamic query construction, and fragment parametrization — that I'd never really considered this. Those tenets were already in place before I joined the team, and I assumed — perhaps naively — that they must have been there for a good reason; people who'd been at Facebook much longer than I and had witnessed the birth and evolution of GraphQL had decided that there must be a better way, and something new should be tried. It never occurred to me that embracing the static, the rigid, the "inflexible" could be a step forward. Funny that I hadn't, seeing as I had just prior to this built a new static API for writing Relay mutations (data updates) that aimed to replace the magical dynamism of the existing Relay mutations with something more predictable, debuggable and teachable.

But Joe hadn't just considered it; he'd had the idea circling around in his concious and unconscious mind for possibly weeks or months. He'd given it deep, painstaking thought, and he was nearing the conclusion that it just might work. Fully static queries, known ahead of time, would unlock new kinds of performance optimization by allowing us to burn cycles at build time precomputing optimal structures that would allow us to go faster at run time. And with static queries, we'd get query persistence effectively "for free", just like the native apps.

So, back to that question.

What if every query in Relay were statically known?

It was heresy, but we went through the exercise anyway, figuring out what each of the existing APIs would look like if we wiped the slate clean and started from scratch without dynamic, runtime query construction. It meant giving up some features, jettisoning some magic. In return, users would get predictability, performance, and an execution model that mere mortals could understand. There would be a cost though: instead of having Relay figure out a minimal set of data to refetch when parameters change, we'd require users to specify a static query ahead of time. And we'd have to rewrite everything, again, in order to implement this.

On the flip side, rewriting would mean the ability to scratch some long-felt itches, like:

  • Switching to a purely POJO-based representation for cache data.
  • Abstracting all low-level record access behind a thin facade API that would allow us to plug-in different kinds of underlying storage (including native data structures, mediated by a JavaScript-to-native bridge).
  • Aligning our terminology, API shape, and data-flow with the latest thinking on the iOS and Android side (for better interoperability and communication).
  • Dropping support for legacy GraphQL (pre-open source) syntax.
  • Splitting the code up into separate "compiler", generic "core/runtime" and "React" packages.
  • Implementing deterministic, performant garbage collection.
  • And many others.

Relay Modern

As a tiny hat-tip to risk management, we decided to build a toy prototype before fully committing to the rewrite. Joe spent about two weeks building a little React Native app that could render and paginate through a list of friends, and navigate to a simple "permalink" view using two or three static queries. "I think it's going to work", he said. So Joe and I started again, this time for real. It took us about 3 months to implement the new core, while in the meantime other Relay team members continued adding features to the existing codebase.

We knew perf was going to matter, so I built a microbenchmarking framework that uses the Wilcoxon Signed Rank test to give us an accurate picture of whether any given change made things better or worse. We maintained great test coverage and made sure everything was thoroughly Flow-typed. I built a "golden" test runner (this predated Jest's "snapshots" feature) to enable us to maintain a large body of tests easily even as we made frequent changes to our internal query representation. I made a sample React Native app so that we could run on-device benchmarks. Basically, I was scrambling as fast as I could to lay down most of the supporting work while Joe built core abstractions on top. It was amazing to work with such a motivated, talented collaborator. Striving to keep up, providing deep code review on his diffs (that he knocked out at an humbling clip and quality), and the countless stimulating discussions around whiteboards: I know that the experience made me a better developer. It was deeply rewarding.

The moment of truth came when we were finally able to run an on-device normalization benchmark — normalization is the term we use for processing a query response from the network, transforming it from a hierarchical form into a flattened, "normalized" representation for storage in the on-device cache. We knew Relay Modern should be faster because it was drastically simpler, we'd taken great care to avoid performance anti-patterns, and we were simply doing much less work at run-time. When the benchmarks came in we were a little stunned. We ran them again. We sanity-checked them on multiple devices. The results were consistent: normalization in Relay modern was about 10 times faster. It's true that normalization is just one of the things that Relay has to do, but it was clear that we were onto something. Relay Modern was going to be great for mobile devices and spotty networks. It would perform great on desktop environments too, but we'd aimed to solve a harder problem and it looked like that's exactly what we'd done. The bet had paid off.

The happy ending

All this happened in the first half of 2016. We actually thought we were on the brink of shipping it. I spoke about it publicly for the first time in August — ill-advisedly calling it "Relay 2" because we didn't have a better name for it yet — and Joe followed soon after. We had a few road bumps on the way which led us to delay shipping; I'm sorry that it took so long, but I'm really happy to say that the product is finally out the door.

Between finishing the new core and shipping 1.0.0 there has been a lot of thankless grunt work done by a bunch of people on the team. It was a group effort, but in particular:

  • Yuzhi lead an amazing effort to migrate thousands of Classic components and educate teams.
  • Jennifer built out prefetching (the ability to have native code on a mobile start fetching a query for a React Native app before the JavaScript VM has even finished booting).
  • Jan did a fantastic job of making sure we had a great migration strategy and compatibility API for moving existing apps over from Relay Classic to Relay Modern.
  • Lee helped us prepare and package everything for an open source release.
  • Our manager Alex was a roving support agent who tirelessly helped out with anything and everything.

But this post is in large part a tribute to Joe Savona. Neither of us is working directly on Relay any more, but the experience will forever loom as an indelible and transformative part of my Facebook story. As a colleague, erstwhile neighbor, and friend, working on Relay with Joe was a once-in-a-lifetime experience. I'm sure that Relay will continue to be an important building block for teams at Facebook, and I hope that it's useful to teams in the external community as well, but no matter what direction the framework ends up evolving towards in the future, I know that the design and architecture will retain elements of Joe's brilliant touch for a long time to come. Thank you, Joe, and keep on hacking.

Discuss: FacebookTwitter

A tale of three filter-branches

TL;DR: I used git-filter-branch to rewrite the history of the repo containing this website's files, processing 4,980 commits and transforming 3,702 wikitext files to Markdown along the way. I wrote three separate versions: the first would have taken as long as 42 days to complete, the second perhaps 3 to 4 days, and the third and final version completed in about an hour.

This is the tale of how I spent a few hours implementing and re-implementing a slow transformation, reducing the runtime from a projected 42 days to an actual runtime of an hour, experiencing a joyful reminder of the thrill of problem-solving that makes me love programming so much.

The background

For a bunch of self-indulgent reasons, the material for this website lives as a collection of Markdown documents in the content branch of a Git Repo. That "Markdown" bit wasn't always true though. As a result of an unfortunate choice, I'd actually accumulated thousands of wikitext documents over the years, and I'd been serving them with a Rails app and a very fast wikitext translator gem.

When I got rid of the Rails app and moved the content out of a database and into Git, I wanted to preserve exactly the same output when rendered to HTML, so I made a "wikiserve" microservice in Ruby that simply wrapped the wikitext gem, and I had my Node GraphQL server hit the microservice whenever it needed to turn wikitext into HTML.

This worked remarkably well, but after leaving the whole thing untouched for a year or more I found that I couldn't run the microservice locally on my new laptop. You see, frustrated by the bitrot and flakiness that comes with the added complexity of having multiple installed Rubies (one from Apple, one via Homebrew, a bunch installed by ruby-install and switched between using chruby), I decided to just say "screw it" and uninstall all but the stock Apple one that comes baked into the system. There is simply no reason for me to muck about chasing anything like the bleeding edge of the Ruby ecosystem: I want it to be a dependable, stable, cross-platform workhorse that I can count on to get shit done. I rely on Ruby to power Command-T inside Vim (which I use many dozens of times a day), to process outbound email for me from mutt, and to handle things like encrypted content in my Git repos. In short, I am seldom actively iterating on any of this stuff, but I want it to just work, just like I do when I plug a fridge into the wall and expect it to dependably keep things cold for me.

The catch is that Apple introduced a thing called "System Integrity Protection" (SIP) with El Capitan which effectively breaks installation of any Ruby gem that has an executable component. These gems try to install their binaries under /usr/bin/, but that is off limits under SIP, even for the root user.

Ugh. There are workarounds, of course — disabling SIP temporarily or permanently, explicitly setting GEM_HOME and GEM_PATH to install into another location, or specifying -n /usr/local/bin when running gem install — but I was just left with the feeling that I wanted to accelerate my plan to reduce my dependency on Ruby on macOS by getting my website content out of wikitext and into Markdown.

Modeling content using Git primitives

This is a terrible but fun idea. When I created Masochist I took all of my content out of a database — with its indices, associations, efficient searches and cheap ORDER BY clauses — and shoved it into a bunch of plain text files sitting in a Git repo. There are some really nice properties of this design:

  • You get content versioning "for free".
  • Minimal vendor lock-in effects, only minor ones in terms of format (seeing as Markdown is easily translated into other formats).
  • Easy, distributed backups: just push to another (free) Git host, such as GitHub, GitLab, Bitbucket etc or your own server.

Nevertheless, there are some complications too. The first one is that you have to decide how you are going to model the content metadata that previously lived in your neat little DB tables: things like "created at" and "deleted at" timestamps, or which tags apply to which pieces of content.

Let's just consider the timestamps angle for a second. Some of your alternatives include:

  • Ignore it: Lose all the history and only ever look at the current state of the content tree. This is the dumbest and simplest thing that could possibly work, but it is a bit sad to lose or obscure the richness that an immutable history record can provide.
  • Shove it in headers inside the content documents: This would be relatively easy to access and index, but updating it would be a pain: Do you set up editor hooks to update the timestamps when you save a file? Do you have a separate pre-commit tool that enforces timestamps are correctly updated?
  • Treat the Git history itself as the source of metadata: That is, infer the "created at" timestamp by detecting when an item first entered the repository and the "updated at" timestamp by looking at when it was last modified. Note that you have to do it this way because Git doesn't manage timestamps on files, so you can't rely on the filesystem timestamps to tell you anything useful; even if they were always correct and consistent on your laptop (a big "if") there's not a chance they'll be right when you start push-ing, pull-ing and clone-ing as part of your deploy process.

That last one sure sounds the most elegant, doesn't it? But it also obliges us to accept a reality about Git's object database: it's made to be blazingly fast for certain common operations (git status, git commit etc) but not others. For example, answering that question of "detecting when an item first entered the repository" could require you to traverse back from the current HEAD all the way back to the root commit of the repository, which could mean examining a thousands-long commit chain. And note, even if you know how Git works and seek to minimize the number of git processes that you need fork and the number of commits that you actually need to examine (eg. by limiting git log with a pathspec), Git's internals will still need to traverse that thousands-long chain in the worst case.

We're going to need a secondary index then. For that, we'll use Redis, shoving in there all the things needed to make operations that Git would be slow at fast. Updating this index is complicated (you know the saying about cache invalidation and this is no exception).

And that's just timestamps. We haven't even talked about tags indices yet. If we stick the tags in the content document headers, we'll need to index those too. And we'll need to make sure that we carefully handle all sorts of workflows, some of them quite edge-case-y: adding tags, removing tags, removing the last instance of a tag anywhere in the repo, and so on.

The code to update these indices is hideous and gnarly. Note: the project is called Masochist for good reason. At least I can console myself that nobody else has to understand it...

Back to filter-branch

Given this architecture, if we want to port all our wikitext files to Markdown, we can't just do it in place. That will blow away our precious timestamp information. Moving "foo.wikitext" to "foo.markdown" would effectively look like the "foo" content coming into existence today.

We have to go back and re-write the history. This is why the README on the "content" branch has a big disclaimer in it:


The content and history of this branch may be re-written at any time.

We're going to use git-filter-branch to go back over the entire history. For every commit that deals with a ".wikitext" file in any way, we're going to edit that file in place, renaming and porting it into an ".md" file, and we're going to recreate the commit with all the other metadata intact, timestamps and all. It will be as if the whole wikitext mistake never happened...

First thing is figuring out how to turn wikitext into Markdown as cheaply as possible. I wrote the wikitext parser in the wikitext gem; it is about 2,600 lines of hand-tweaked, performance-oriented C code. I don't want to have to turn all that into something that emits Markdown instead of HTML. It certainly could be done, but I am too impatient. I just want to bang something out as cheaply as possible while maintaining correctness.

The result is this abomination of a script. It's a Ruby script that forks out to node to read in files and make use of the unpack-content package (so that I wouldn't have to rewrite the error-prone logic for dealing with document metadata headers). It gets back JSON, parses it, preprocesses the markup with a nasty regex (and a copy-pasta'd one at that), translates the wikitext to HTML using the wikitext gem, turns the HTML into Markdown using pandoc, applies another layer of ugly postprocessing, and finally glues back on the metadata headers before spitting the result back out. Definitely an inglorious hack, but I felt delighted at the quality of the result I was able to get with such a quick little ditty of a script. Additionally, the way I could inject my custom pre-and-post processing so easily into the flow — among other things, fixing a number of long-broken links — meant that the quality of the results is arguably better than if I had done it the "right" way and taught the wikitext gem how to produce Markdown instead of HTML. I would say "worse is better", but that would be distorting the intent of the originator of the phrase.

My favorite function from the script is this 180-column monstrosity, which will surely break the layout on whatever device you happen to be reading this on:

# Prerequisite:
#     npm install -g unpack-content
def read(file)
  if file == '-'
    safe_file = '/dev/stdin'
    safe_file = Shellwords.shellescape(file)
  JSON[%x{NODE_PATH=/usr/local/lib/node_modules node -e "process.stdout.write(JSON.stringify(require('unpack-content').default(require('fs').readFileSync('#{safe_file}').toString())))"}]

This really exemplifies the spirit of the script. Hard-coded to assume a global install in a specific location, a single long line because I was too lazy to break it up, with only two small winks in the direction of software development professionalism: care taken to escape unsafe characters in the shell, and the flexibility to read from standard input rather than from a file on the disk.

With this in place and tested, we can move on to the real business: calling this thing from inside git-filter-branch.

Take 1: Render it like it's React, baby

My first try used the --tree-filter option, which basically allows you to mutate the filesystem any way you want, and which will drop you into a directory with a checked out work tree corresponding to each and every revision in the entire history, in order, one at a time.

The code looks more or less like this:

paths = Dir['content/**/*.wikitext']
paths.each_with_index do |path, idx|
  basename = File.basename(path)
  basename_without_extension = File.basename(basename, File.extname(basename))
  dirname = File.dirname(path)
  type = File.basename(dirname)
  safe_src = escape(path)
  safe_dest = escape("#{dirname}/#{basename_without_extension}.md")
  %x{wiki-to-markdown #{safe_src} > #{safe_dest}}
  FileUtils.rm path

Basically, it's going to find every file with a ".wikitext" extension and turn it into a transformed one with an ".md" extension. Couldn't be simpler. (When I first wrote it, I also had this copy each pre-transform file into an "archive" directory, because I was worried about losing it forever once I'd done a forced update of the "content" branch. I later realized that I could just create a separate "archive" branch and rely on its immutable history to keep the original state around for me forever.)

When I ran this with a bunch of logging in place to see what was happening — and also inspecting the .git-rewrite directory in which these temporary work trees were being manipulated, I realized something disturbing. If commit A added a wikitext file and I created a rewritten commit A' with a markdown file instead, when I came to the next commit — B, which also added a wikitext file — I'd find that the changes I'd made in A' weren't visible in B so I had to redo them. In other words, if the first commit transformed 1 file, then 100th commit ended up transforming 100 files, 99 of which it had already previously transformed (and not just once, either).

Of course, this should have been obvious to me. If you rewrite a chain of commits A → B → C → D to A' → B' → C' → D', you are operating at each step on A, then B, then C and so on, rather than A', B', C' etc. Evidently, it has to be this way. Git isn't magically going to merge changes from A → A' into B before handing the work tree over to you. On repos above a trivial size threshold, --tree-filter techniques are only going to scale if you can run your side-effects in basically constant time.

The verdict

Faced between the options of going to bed with the intention of seeing if the run had finished in the morning, or doing a back-of-the-napkin calculation right there to see whether the design was viable, I did the math.

The worst-case runtime of this approach is O(nm), where n = 4,980 (commits) and m = 3,702 (wikitext files). That's in the ballpark of 18 million transforms, and at about 5 transforms per second (the speed of my shitty conversion script), we're looking at 42 days to complete the filter run. Clearly unacceptable, and even more so when you consider that any bug discovered later in the run would require it to be restarted all the way from the beginning again.

This approach is pretty much like re-rendering the world for every frame from the top down. Super simple to reason about, but doesn't necessarily scale so well. Like in React, right? And just like in React, we need to find our way to short-circuit in shouldComponentUpdate and do less work. (Apologies to any non-JS-programming readers who might find this analogy non-illuminating. All should become clear in the next section.)

So, instead of going to bed, I decided to try another strategy.

Take 2: In which the author embraces stateful computation

Clearly we only want to run each transform once per unique wikitext input. That is, most files don't have edits, so they should get transformed exactly once, and the others should get transformed only as many times as they were edited.

To effect this magic, we need to maintain state about what things we've previously transformed. This is tricky though, because our script is going to be called separately, 4,980 times; once for each commit. Where to store the state in between invocations? Dump it to YAML or JSON and read it in and out? Sounds slow.

The approach I took is much more complicated than my initial attempt. The core idea is to use git ls-tree -r to get a list of files:

def ls(rev)
  tree = `git ls-tree -r -z #{rev}`.split("\u0000")
  tree.each_with_object({}) do |line, memo|
    mode, type, hash, name = line.match(/^(\d+) (\w+) ([a-f0-9]+)\t(.+)/).captures
    memo[name] =, type, hash, name)

And we do that for the current commit (the one being rewritten), the parent of the current commit, and the last commit that we actually rewrote. We know that if nothing changed about a file between the parent and the current revision, then we don't have to retransform it; we can just grab the blob from the last-rewritten commit.

The logic to do this kind of update is satisfyingly low-level and looks something like this (paraphrasing):

# Remove the ".wikitext" file from the index.
%x{git rm --cached -- #{escape(name)}}

# Shove the pre-existing blob into the index at the right place.
info = escape("#{mode},#{hash},#{markdown_file(name)}")
%x{git update-index --add --cacheinfo #{info}}

On those occasions when we do need to perform a transform (because a new file was added, or an existing one was edited) we do this, more or less:

# Get the contents of the previous source blob,
# transform it on the fly (no temporary files, just stdin/stdout and pipes),
# then add it to Git's object database.
hash = %x{
  git cat-file blob #{hash} |
  wiki-to-markdown - |
  git hash-object -w --stdin

Once we have obtained a hash in this way, we can perform an update like we did above using git update-index and --cacheinfo.

All of this is happening in the index, not in the worktree (there is no worktree now, as we're using --index-filter instead of --tree-filter) so this is much faster.

The verdict

Again we have the choice: go to bed and see whether this thing is done when I wake up, or do some estimation and make a judgment call.

This approach is much better, but still not fast enough. We now get through about 50 files per second because most of the files we look at don't need to be transformed. At that rate, we're now looking at just over 4 days in the worst case. Significantly better than our initial 42-day estimate, but still not good enough. Even if we note that the real total of files that we'll have to look at is closer to n(n + 1) / 2 (ie. the sum of consecutive integers between 1 and the number of commits, seeing as most commits add just one new file), we're still in the vicinity of 12 million operations, which adds up to nearly 3 days at our rate of 50 operations per second.

Back to the ol' drawing board. Bed can wait.

Take 3: In which we burn the haystack

One of my former colleagues said this:

Optimization problems are like finding a needle in a haystack: you can either try to find a way to sort through the hay faster, or you can burn the fucking haystack down.

I don't know where he originally got this from, but I like how visually evocative it is of one of the fundamental insights of performance optimization: rather than trying to work faster, it's better to not do the work in the first place. Avoided work is infinitely fast. You can't do better than that.

Here we're going to drastically reduce the number of operations we need to do from about 14 million down to around n (4,980, or the number of commits). What we want to do, in essence, is look only at files which actually got touched by a particular commit, and have our changes to all the other files carry forward automatically.

There is some hand-waving here as I'm essentially writing off the cost that git diff-tree incurs when we ask it to compare two tree objects and tell us what changed. It is a non-zero cost, but it ends up being insignificant in our use case because there are a number of tricks Git can and does employ to prune the search space (there you go, burning that haystack): specifically, Git can do things like shortcircuit traversal any time it encounters identical subtrees, and it can do that very fast indeed thanks to the use of SHA-1 digests to identify objects. If tree a/b has the same digest in two revisions then we know we need not recurse into it: no matter how large or deep it is, we know that the entire thing is the same in both revs just by looking at the hash. We could try implementing similar tricks in our user-land code, but the truth is that delegating to Git's highly optimized C implementation here is going to be as good as it gets and we can consider it effectively free for the purposes of evaluating our algorithm.

Again, we'll be using --index-filter for speed, and again we'll rely on access to the last-rewritten commit (seeing as we need to perform a bulk copy of it as cheaply as possible, then apply only our modifications on top of it, we need to know what to copy). The key trick here is making use of the map function that is available in the scope of the script text that git-filter-branch will eval each time it visits a commit. Unlike our first --tree-filter-based attempt which was as easily invoked as this:

git filter-branch \
  --tree-filter ~/bin/wiki-to-markdown-filter \

With the second and third --index-filter-based strategies our invocation is going to look like this:

git filter-branch --index-filter \
  'PREV=$(map $(git rev-parse -q --verify $GIT_COMMIT^)); ~/bin/wiki-to-markdown-filter $PREV' \

# Or, more naturally written without explicit $PREV:
git filter-branch --index-filter \
  '~/bin/wiki-to-markdown-filter $(map $(git rev-parse -q --verify $GIT_COMMIT^))' \

Our goal here is to calculate the value for this PREV variable: the hash of the last-rewritten commit. $GIT_COMMIT is the commit that's currently being rewritten and $GIT_COMMIT^ is its parent (we can use ^ here with impunity because the content branch has no multi-parent — ie. merge — commits). If there is no parent, git rev-parse -q --verify will just return an empty string, and our filter script — which will be performing the same check — won't bother with looking at that $PREV argument.

The map function is the secret sauce here. It takes that parent commit hash and, if the parent was rewritten, gives us the rewritten hash, which is exactly what we want. If it was not rewritten, it gives us the hash back as-is. Because commit hashes are invalidated by any change, changing the parent commit's hash by rewriting it will cause all of its descendent commits to by rewritten as well. This means that map will return the parent hash as-is only at the beginning of the filter run; as soon as we rewrite one commit, map will start returning rewritten parent commit hashes from there on.

From this point, it's easy. With $PREV, we can use git reset to set the index to look just like the last-rewritten commit looked like. We then use git diff-tree to get the changed files, and we apply our transformation to anything that needs it, and let anything else that changed but which does not need transformation go through unmodified. We are effectively doing something like a rebase with an additional, optional, automated transform along the way.

Just to be sure I wasn't going to be bitten by edge cases, I used git whatchanged --diff-filter=RD to find all commits that moved or deleted files. It turned out that there were only a few, noted in the comments, and that some simple logic would handle all cases. Perhaps the only remaining "gotcha" here is that git diff-tree doesn't report similarity statuses (eg. R91) like git whatchanged does, but rather shows them as paired A (add) and D (delete) operations. In the end, it all comes out in the wash.

The verdict

Version 3 gets us down to about 1 hour to rewrite the whole history. You can see the results on the current content branch (although if I ever rewrite the history again — something which is likely — you can expect that link to be broken; this is the evergreen link to the branch as it currently exists). I expect the next big filter-branch rewrite will be when I go back and scrape my oldest blog posts — which currently only exist as mirrored PHP/HTML files — and turn those into Markdown too, at which point I'll want to "prepend" them to my content branch, invalidating all the existing commits in the process.

Concluding remarks: in which our author revels in the joy of programming

So, we got a projected 42-day runtime down to an hour using nothing but simple techniques and reasoning. There's no genius in this, but damn is it rewarding. Not bad for a Friday night's work. This is a very strong reminder of why I enjoy programming in the first place: the ability to transport the intractable into the realm of the practical, and seemingly effortlessly. It doesn't always work out so easily — and I'm definitely spoilt here by the power and flexibility of the Git tools, which allowed me to solve the same problem in three significantly different ways — but when it does, it is magic.

And now, the icing on the cake: I get to rip out the "wikiserve" microservice. Deleting code is almost as much fun as making code run faster.

Discuss: Facebook - Twitter

Goodbye Wunderlist, hello Todoist

I've been pretty happy with Wunderlist over the past year or so, but visible signs of development have been few and far between since they were acquired by Microsoft in 2015. Today, they dropped a bombshell:

Wunderlist and Wunderlist Pro will eventually be retired ... There won't be any new features brought to Wunderlist and Wunderlist Pro.

Bummer. The Wunderlist team has been working on Microsoft To-Do instead. It maintains elements of the distinctively attractive Wunderlist UI, but it doesn't have a macOS app yet and it doesn't have sub-tasks yet, which is a pretty big backwards step. (Especially for me, seeing as improvements to the sub-task handling in Wunderlist were one of the main reasons I was anxiously watching them for signs of progress.)

I signed up for the To-Do preview, ran the Wunderlist importer, watched it churn for about 15 minutes before failing with an informationless error message, and decided it was time to evaluate my other options.

I opened up 2Do again, the app I'd used a couple years back before switching to Wunderlist. It is still the same, immensely powerful but quirky app that it was then. It still has polished macOS and iOS apps, and it still has that subtly un-Apple UI and UX that made me give up on it back then. It has a very complicated mental model involving groups, projects, lists and tasks. Even as I type that, I'm not sure I've got the names right, and I sure as hell couldn't explain the difference between them.

This was what drove me to Wunderlist in the first place — the not-quite-right UX/UI and the complex model — even though it meant going to something less powerful, and also forgoing the sync with Apple's built in iCloud/Reminders functionality (in its place, handing my data over to yet another private cloud). So, I knew I couldn't go back to 2Do.

I looked at the OmniFocus website and knew that that was out of the running too. I had played with it a couple years back but found it to be too complicated and opinionated. After the simplicity of Wunderlist, OmniFocus was always going to be too heavyweight.

That brings me to Todoist. It's free to try, so I ran the Wunderlist importer (which worked flawlessly) and took it for a spin. Unlike Wunderlist, which offered most of its functionality for free, or 2Do, which had a one-time license fee, Todoist has most of the interesting features behind an (admittedly low) subscription paywall. Things like labels, reminders and so on, which are pretty much essential for all but the most trifling use cases.

It has a lot of integrations with other services and is available on every platform you're likely to care about. The iOS app is a bit ugly for my tastes, but it probably serves well enough for capturing todos on the go and getting notifications; all real editing is likely to be done on the desktop or web. I don't know what the tech is, but the desktop app feels like a webview, so does lack some polish, but it is pretty serviceable. There's a pretty useful Chrome extension, by the looks of things.

Unlike 2Do, the model is simple and unconstrained. You have projects and tasks and that's it. You can arbitrarily nest projects inside other projects, and likewise you can nest tasks within tasks. With this simple set of primitives you can put together pretty much any organizational structure because you additionally have a couple of tools for looking at different views of your todos: "labels" are effectively tags that you can apply to any task to create arbitrary collections, and "filters" are basically saved searches that allow you to query for things based on keywords, project membership, creation date and so on. There is a lot of power here, but it's not in your face like it is in 2Do, and there's also a pleasing absence of arbitrary restrictions.

There are some delightful little features, like natural language metadata processing (you can type #foo to target a project, @bar to target a label, p1/p2 etc to assign a priority level, as well as natural language date specifications like "every friday start apr 28" and so on), well-integrated emoji support, styled text capability, some handy keyboard shortcuts and a fast, minimal, globally available quick-task-creation window.

There are also some annoying limitations and oddities, but I haven't found any yet that I'd call deal-breakers. Examples include the lack of Vim-like keybindings for movement, the way clicking and dragging on todo text will cause a displaced edit field to appear and end up selecting the wrong text, a small drag target area for moving todos, no ability to disclose collapsed subprojects while dragging, a clunky UI for operating on multiple selected items, some inconvenient overloading of standard text-editing motions (which instead of making the cursor jump a word or move to the start or end may indent/dedent an item or do nothing at all), and some unseemly pauses when you access parts of the app — such as the settings screen — and there is a lengthy delay while it (apparently) loads something like a webview. But like I said, none of these bad enough to be considered deal-breakers, so I am going to give it a shot for a while.

My only doubt now is how to actually organize my work. For a start, I've mostly kept things organized into projects and subprojects that correspond to the groupings that I had in Wunderlist. Todoist provides an "Inbox", and I've got everything else grouped under "Work", "Shopping", "Travel" and so on. I additionally created a "Next" project into which I am dragging stuff that I want to work on today (there is a "Today" folder provided by the app too, but the intention of the "Next" project is that it be a hand-curated list of stuff that I explicitly want to make a priority in the short term). Still not sure if this is the right™ thing to do, but we'll see how it goes.

I've also created a couple of labels, "someday" (for low-priority "wish list" items) and "easy" (simple/short filler tasks that I can grab when I need a break), and will wait to see if any more useful ones emerge organically in the coming days. I have some filters too, but I'm not sure how often I'll use them. Probably the most interesting ones are "Newish" an "Oldish", which show me tasks created less and more than 30 days ago, respectively. Finally, I've set a custom default view that shows "Next" tasks, high-priority (p1, p2, p3) tasks, and "Inbox" tasks, in that order. We'll see how that goes too.

Let's hope that this solution works out and can provide me with some utility for at least a few months/years. I think if I have to migrate away from another task list, I might just bite the bullet and start keeping my own todo lists in text files on a computer. I'm not an Emacs user, so I don't have a good org mode, but at some point I'm just going to have to give up and forgo the fancy web UIs, the slick iOS and macOS apps, and the seamless (proprietary) cloud sync, and just stick it all in a plain-text file that I edit using Vim.

Discuss: Twitter