File system organization guidelines

The other day I was lamenting on Twitter how tricky it can be to do something as simple as organize your Dropbox folder:

So speaking of organizing my Dropbox folder… I’m facing again the whole taxonomy vs folksonomy thing. Taxonomies are seductive hierarchical classification systems that lure you with their siren song of logic, order, and sense-making.

The Yahoo! directory (composed of hierarchical categories) was an early famous example of an attempt at ordering the world’s knowledge (well, really just the world’s Internet at that time) in a taxonomy that provided intuitive navigability, discoverability, rapid look-up.

Trouble was, it wasn’t super scalable, and lots of things refused to be neatly pigeon-holed in a single place with[in] the hierarchy. (Right now I can’t remember if the same content could appear in multiple places). File-systems (generally hierarchical) face the same problem.

This has been the case ever since you could put a folder or directory inside another, and it exists today in the age of Dropbox. You can, of course, have something "live" in more than one place at a time if you create a symbolic link (or a hard link, or an alias, or a shortcut).

I found that when I used to do that on HFS filesystems, the aliases weren’t super reliable and could easily rot over time [I expect APFS is no different]. Layer a cloud-based synchronizing abstraction like Dropbox and the probability of breakage gets much worse.

There have been lots of attempts at overcoming the pain points of working with large taxonomies by approaching the problems from a different angle. Google aims not to make the world’s knowledge conform to a hierarchy, but rather to make it searchable.

Microsoft made a "database" filesystem. BeOS kind of did too, with its rich metadata. Apple did as well with Spotlight (long before APFS). All of these attempt to present you with an abstract view of the volume that elides the structural location [of things] on disk.

Gmail is another great example. You can make "folders" and even nest them inside each other, but they are called labels, and you can apply multiple labels to a single message, causing it to show up in multiple places.

The idea is that the search should be so good — and it generally is — that you can find what you are looking for based on what it is and not where it is. This idea of tagging first came up on sites like Digg (I think?), AFAIK one of the early great examples of a "folksonomy".

These tag-based classification systems work great IMO. Unlike hierarchies, which you have to pretty much get right from the get-go (because they are hard to change, because major conceptual flaws are difficult to correct, because some stuff is intrinsically ambiguous) …

…tag-based schemes are easily modified and lend themselves well to incremental improvement. Major conceptual deadlocks are rare. You can invest in a tag-based system and it will become more valuable over time. The investment still makes sense even if the search technology…

…is weak, because the latent value of the tagging can be capitalized upon "for free" whenever the technology improves. They’re great. Alas, none of this helps much with the question of how to organize your Dropbox, because you still have to decide where to put your stuff.

Some illustrative examples: should I create folders for my kids and nest their stuff inside it? (eg. a folder for health insurance, a folder for doctor visits) or should I create top level folders for insurance and per-person folders within those?

What if I have a major medical episode? Does that warrant its own top-level folder, and what should go inside it?

How much of this should I redundantly encode in the name (eg. "John annual medical checkup 2018-09-10.pdf") and how much in the path (eg. "Medical/John/Annual checkup 2018-09-10.pdf")? Which is going to be more searchable?

Also, do you notice how I don’t really trust the OS filesystem to preserve the dates or make them usefully exposed when searching or browsing? When should the date go in the file name (if ever), and when should you encode it in folder structure (eg. "2018/10/…")?

Unlike many other areas in my life where I have relentlessly optimized and refined my practices and processes to a point where I am relatively happy with them (no in the sense that they are perfect, but in the sense that they have solid rationales and are reasonably effective)…

…when it comes to file organization, all I have is a massive corpus of artifacts manifesting a many-year style drift where you can see how my patterns and practices mutated over time, never really settling. At this point, the only thing I haven’t fully tried is throwing it…

…all in one massive directory and periodically trashing or archiving the oldest files into subdirectories, relying on search and times winged chariot to sort out all the rest. I was kind of hoping that I’d have an epiphany while writing this thread, but sadly I did not.

I decided that I’m not going to let a lack of an epiphany stop me from trying to improve my situation in a definitive, lasting way (you see, when you’re between jobs like I am you need to manufacture challenges to keep yourself engaged). Let me warn you though: if you are the sort of person who allows files to pile up on your desktop by the hundred until the icons resemble so many layers of carpet, then this post is probably not for you. On the other hand, if you’re a geek who’s deeply into structure and order then there might be something in here that you can relate to.

I did some searching for prior art on the subject and found a couple of articles (here and here) that confirmed my recollection that Dropbox didn’t cope well with symbolic links or aliases. I went through my Dropbox folder to see if I had any symlinks (I didn’t) or aliases (I did, and they were broken), confirming that this is still a problem in September 2018.

So this leads me to my first recommendation:

Recommendation: Never use symlinks or aliases inside Dropbox folders.

You could make the argument for using symlinks from your Dropbox folder to items outside it, but I’d much rather just have a simple rule that is easily expressed and simple to follow.

If you’ve made the mistake of adding symlinks or aliases in the past, you can find them all with:

find ~/Dropbox -type l

and:

mdfind -onlyin ~/Dropbox kMDItemKind=Alias

respectively.

A corollary to this first recommendation is this:

Corollary: Files and folders must exist in one place and one place only in your Dropbox folder.

As such, this means that your unavoidable choice in the whole taxonomy vs folksonomy question is taxonomy. Hierarchy. A place for every thing, and every thing in its place. A bunch of rules and guidelines for deciding which place applies will follow.

An exception to the "must exist in one place" rule is that you may wish to place a copy of something in another place. Clearly, doing this with anything that you might edit later is a recipe for madness due to divergence. This leads to a second recommendation:

Recommendation: Only make duplicates of files that are immutable in nature, to avoid the headaches caused by divergence.

Examples of files that you could make redundant copies of without harm: long-lived documents like your birth certificate; documents which you will never edit (eg. scans of inbound email); and so on… Of course, the question remains, why would you want to make redundant copies of any of these things? After all, they’ll eat into your quota, and if they proliferate, they may lead to a counterproductive sense of clutter and make things harder to find.

Nevertheless, I can see that you might want to create copies under some circumstances: for instance, you might want to have a folder (or a series of date-based folders) that constitute a kind of "append-only log" where you drag copies of all the things you’ve printed, or processed in a given period. Or maybe you have a document that you tend to need to use over and over in combination with other documents, so you allow multiple copies of it to exist where needed (a good example of this one for me would be the proof of my government-issued ID number which I end up needing to include in almost every official application or bureaucratic process that I go through).

Antipatterns

Sadly, most of the top search results I landed on in my search for best practices contained some really terrible advice, including tips such as:

  • Put the last modified date in file name.
  • Put the name of the last user to edit the file in file name.
  • Number folders and files to enforce sort order.

But perhaps there is some good that we can spin out from this web of bad advice. Let’s talk about dates and sorting.

Dates

We certainly don’t want to put the last modified date in the file name. Fundamentally, we want to preserve the ability to sort the files within a folder by multiple criteria: for example, by the information that we encode in the name, by the modification date recorded on the filesystem, or by the creation date (again recorded by the filesystem).

When should you use dates?

Sometimes the date is so fundamental to the identity and nature of the document that omitting it from the name itself wouldn’t make any sense. For example, there is no way you’d save a document called "Tax return.pdf" and leave out the year to which it corresponds.

Likewise, you might have multiple documents that would all have the same name if it weren’t for a distinguishing date. Examples include things like utility bills or other documents related to periodic or recurring events. If these ever live in the same folder, they are going to need dates.

Recommendation: Use dates only when they are intrinsic to the nature of the document, or necessary to disambiguate multiple items which would otherwise have the same name.

When should dates go in the folder name and when in the file name itself?

We want to keep noise and clutter out of our file names, so if we have a group of related documents which all share a common date, it is better to put them in a folder containing the date rather than put the date in the names of each of the individual files. To return to our taxation example again, it makes sense to have a top-level folder "Tax" containing folders for each year (eg. "2017", "2018" etc), and inside each folder we can put the documents for that year. It’s a judgment call as to whether you encode the date redundantly in both the file name and the folder name in a situation like this; that is, you could either go with "Tax/2018/Tax return.pdf" or "Tax/2018/2018 Tax return.pdf". The benefit of the latter is that when you perform a search for "2018 tax", all the critical information can be seen in the file name alone; conversely, looking at a search result like "Tax return.pdf", you can only fully make sense of it if you can see the enclosing path. On the other hand, a more obscure document like some esoteric tax form (say, a 1099-G), probably isn’t something you’ll ever go looking for my name, so it would be fine for it to be called "Tax/2018/1099-G.pdf" instead of "Tax/2018/2018 1099-G.pdf". More on the subject of whether dates should be placed at the beginning or at the end below.

Recommendation: When a date applies to a group of items, put the date in the containing folder name rather than on each individual item. Only encode the date redundantly in the file name if you expect to be searching for the document by date and by name.

How should the date be formatted?

Recommendation: Base all dates on the YYYY-MM-DD format.

This avoids ambiguity — eg. does "4/5" refer to April the 5th (US style) or the 4th of May (everywhere else) — and makes a lexicographical sort equivalent to a chronological sort.

Recommendation: Use the briefest date format that provides sufficient specificity for a given need (eg. prefer YYYY over YYYY-MM, and YYYY-MM over YYYY-MM-DD).

Documents that turn up infrequently, like tax returns, can easily make do with YYYY date information. But if you scan all incoming post, you might want to batch it in folders for each month (eg. "YYYY-MM" or "YYYY/MM") or even each day (eg. "YYYY-MM-DD" or "YYYY/MM/DD") if the volume is large enough.

The decision on whether to embed the date in a single folder name with hyphens, or whether to build it up with nested folders, likely depends on the total number of documents. For example, if you have a lot of items concentrated into a small time interval, you can probably use hyphens without the total number of items in a single file becoming unmanageably large. On the other hand, if you have a sustained, high concentration of items spread over a long period of time, you’re probably going to be forced to use nested folders.

Recommendation: Break dates up into nested folders (eg. "YYYY/MM/DD") only when the total volume of contained items makes the alternative (eg. "YYYY-MM-DD") impractical.

Where in the file name should the data appear?

The obvious options here are to put the date at the beginning or the end of the file name.

Recommendation: If multiple items of the same type, differentiated only by their date, are grouped together in a folder, put the date at the beginning. Otherwise put the date at the end, to preserve sortability by name as well as by date.

One exception to this guideline would be items where the year is usually a prefix as a matter of convention; here again we recur to the example of the tax return, which we would unconditionally prefer to store as "2018 Tax return.pdf" rather than "Tax return 2018.pdf", because the former is how people would generally talk or write about the document.

When to use capital letters

Recommendation: File and folder names should start with capital letters, but subsequent words need not start with capital letters unless required by grammar or convention.

We employ the initial capital for consistency (the "Dropbox" folder itself starts with a capital, as do all the default folder names visible in the macOS Finder), but we don’t go so far as demanding "title case" (capitalization of all words in the file or folder names) because it is more cumbersome to type and more visually heavy (for the same reason that writing in all-caps is considered tantamount to "shouting" on the internet).

This last guideline does put us into conflict with Apple’s conventions (where folders may have names like "Application Support", for example), but as the Apple marketing stuff used to say, think different.

When to use spaces

Recommendation: Use spaces freely in file and folder names.

It is true that many scripts over the years have had bugs in them because they assumed that they would only ever run from paths without spaces, or operate on files without spaces in their paths, so for a long time I use to use hyphens instead of spaces in my names out of sheer paranoia; it was a way to ensure that buggy scripts would work, email attachments could be opened, autocompletion and command-line editing would work without any surprises in the terminal.

But the world has moved on, and both Dropbox and macOS handle spaces with aplomb. Autocompletion in the terminal works just fine with items with spaces in their names, and we are talking about organizing our Dropbox folder for goodness sake, not writing some secure build system for updating the nuclear launch codes.

Is Unicode safe?

It seems that it is. In the past I have seen some weirdness with Dropbox mangling file names that had non-ASCII characters in their names, but that was in another country, and besides, the wench is dead, by which I mean, I haven’t run into problems with Unicode for a long time now. Unicode displays fine in the Finder, works in the terminal via typed input or autocomplete, and doesn’t see to cause any issues with Dropbox.

Recommendation: Use Unicode in file names as required.

What about language?

Recommendation: Use whatever language you want, but be consistent.

For me, living in Spain, that means a lot of the subject matter of daily life is in Spanish, and that means my folders and file names are in Spanish. My "Tax" folder, for example, is actually "Impuestos", because that’s what they’re generally referred to around here.

Separators

Once I overcame my fear of using Unicode in file names I began to use em dashes to separate the main components of file names (eg. "Company — Topic.pdf"). But since then I’ve seen the light:

Recommendation: Use commas as separators within file names. They occupy less space than an em dash, and they require only a trailing space rather than a space before and after.

So, "Company — Topic.pdf" becomes "Company, topic.pdf". In fact in many cases, you might find that the file name reads just fine without the separator (eg. "Company topic.pdf"):

Recommendation: If you can, drop the separator: it’s cleaner, and you might not really need it after all.

Subordination

Why did we talk about "Company, topic" above and not "Topic, company"? In what sense is the thing on the right subordinate to the thing on the left? The truth is, it could have been either way, depending on the context. In general, we want the thing that is most discriminating to go on the left, because that means we can leverage a lexicographical sort to most quickly find the item we’re looking for.

Recommendation: Place the information with the most discriminating power at the beginning. This applies to both files and paths.

Let’s make this a bit more concrete with an example. Imagine we have items with color and size. There are 100 colors (red, blue, beige, salmon pink etc) but only 2 sizes (small, large). Assuming an even distribution of colors and sizes, if we have a thousand items, there will be about 10 of each color and 500 of each size. We put the color on the left and the size on the right (eg. "Red small buttons.jpg"). By putting the information that most decisively narrows the search space to the left, we can reduce our set of 1,000 possible items down to 10 candidates in a single step; if we’d put the least-discriminating datum on the left then we’d only have narrowed our search space down from 1,000 to 500.

Of course, even this is a guideline. You may have a frequent task that you have to do that requires you to group items by size and go from there. In that case, maybe grouping into subfolders named by size is the way to go, or maybe you can consider changing the order. This is just a guideline, so use your best judgment.

People vs subjects

This is a specific case of subordination. Should people be subordinate to topics (ie. nested inside them) or the other way around? In other words, should I have a top-level folder for each family member and nest stuff inside that, or should I instead have top-level folders for "School", "Taxes", "Insurance", "Health" etc and nest inside those?

According to our discussion of discriminating power above, we should put people towards the right of our file names. If we have 4 people in our family, their names won’t have a lot of discriminating power at all, compared to, say, the universe of appointment types, bureaucratic procedures, and other subject matters.

Recommendation: Don’t organize by people; organize by topic.

Note that this doesn’t mean that we can’t have top-level folders for people, just that those top-level folders shouldn’t be the root of some large hierarchy. For example, we might have a top-level "School" folder with subfolders for each child (and top-level folders for other topics), but we can also have top-level folders for each child where we put things that are universal in nature and not topic-specific (things like birth certificates, for instance).

And while we’re talking about top-level folders:

Recommendation: Keep the top-level topic list short.

In general, we don’t want to "fan out" too aggressively if we can help it. For mere mortals, about 20 separate topics is about as much as we would want to deal with at any given level.

Preserving original file names

Often, you’ll get a document from another source and it will, naturally, be named according to conventions of its own. You might get a bill titled "050800947920180823210421132394.pdf", or an informational leaflet titled "2018 OE Medical Plan Chart_FINAL1.pdf".

Recommendation: Preserve original file names by placing documents from other sources inside a folder that you name according to your own conventions.

In this way, if you ever need to find a document by its original name — for example, because it is referenced by name in an email — you can still look it up, but you can also find it by using words of your own that describe it, or by looking for it in the location within the hierarchy that these conventions indicate it should occupy.

Keep it simple

When I set out to put these guidelines in writing I wanted it to be simple. Alas, it turns out that there is a lot of nuance (bummer, that, huh). Nevertheless, as a general principle I’m going to echo my point above about keeping the number of top-level folders (and sub-folders within each other folder) manageably small and make the following recommendation:

Recommendation: Keep things as simple as possible: don’t go too far with structure.

We don’t want too much nesting. It is probably better to have an "Invoices/2018" folder than a complex nested mess of folders organizing invoices by type, by provider, and so on. In general, don’t be seduced by the desire to hierarchize everything: introduce structure only when and where it is required to make the quantity of information manageable.

Be consistent

There are a lot of rules here and a lot of subtlety. It can be easy to forget what rules should apply in a given situation, and sometimes a judgment call may be required. The point is not to apply the guidelines with flawless perfection, but to make your best effort. Try to be consistent, and while it won’t be perfect, it will be good enough.

Recommendation: Be as consistent as possible.

Summary

So, with all that out the way, here’s what we have:

  • Never use symlinks or aliases inside Dropbox folders.
  • Items should exist in exactly one place.
  • If you must duplicate a file redundantly, make sure it is an immutable one.
  • Use dates only when they are intrinsic to the nature of the document, or necessary to disambiguate multiple items which would otherwise have the same name.
  • When a date applies to a group of items, put the date in the containing folder.
  • Only encode the date redundantly in the file name if you expect to be searching for the document by date and by name.
  • Base all dates on the YYYY-MM-DD format.
  • Use the briefest date format that provides sufficient specificity for a given need (eg. prefer YYYY over YYYY-MM, and YYYY-MM over YYYY-MM-DD).
  • Break dates up into nested folders (eg. "YYYY/MM/DD") only when the total volume of contained items makes the alternative (eg. "YYYY-MM-DD") impractical.
  • If multiple items of the same type, differentiated only by their date, are grouped together in a folder, put the date at the beginning. Otherwise put the date at the end, to preserve sortability by name as well as by date.
  • File and folder names should start with capital letters, but subsequent words need not start with capital letters unless required by grammar or convention.
  • Use spaces freely in file and folder names.
  • Using Unicode in file names as required.
  • Use whatever language you want, but be consistent.
  • If you must use a separator, use a comma.
  • Don’t use a separator unless you must.
  • Place the information with the most discriminating power at the beginning of files and paths.
  • Don’t organize by people; organize by topic.
  • Keep the top-level topic list short.
  • Preserve original file names by placing documents from other sources inside folders.
  • Keep things as simple as possible: don’t go too far with structure.
  • Be as consistent as possible.

And this will be my last word on the subject. Until, that is, I change my mind or forget.

Discuss: Twitter

Passphrase entropy

A while back I was interested in coming up with a passphrase that would result in the same keypresses when typed on Colemak and Qwerty keyboard layouts. I concluded that it would be too hard to get a reasonable amount of entropy because there are only 13 keys that hold the same position in both layouts.

Tonight on a whim I went back to perform a more precise calculation using this quick and dirty script:

#!/usr/bin/env node

const {readFileSync} = require('fs');

const qwerty = `
  qwertyuiop
  asdfghjkl;
  zxcvbnm,./
`;

const colemak = `
  qwfpgjluy;
  arstdhneio
  zxcvbkm,./
`;

const normalize = layout => layout.replace(/\s+/g, '').split('');
const getCommon = (a, b) => a.filter((char, index) => char === b[index]);
const escape = char => char === '.' ? '\\.' : char;

const chars = getCommon(normalize(qwerty), normalize(colemak));
const regexp = new RegExp('^[' + chars.map(escape).join('') + ']+$');

const words = readFileSync('/usr/share/dict/words')
  .toString()
  .split(/\s+/)
  .map(s => s.toLowerCase());

const filtered = words.filter(word => word.match(regexp));

console.log(
  `Of ${words.length} words,\n` +
  `${filtered.length} words contain only ${chars.length} common keys\n` +
  `(${chars.join(', ')}).\n`
);

filtered.forEach(word => {
  console.log(`  ${word}`);
});

console.log('\nEntropy (bits) for an n-word passphrase:\n');
const bitsPerWord = Math.floor(Math.log2(filtered.length));
for (let i = 1; i < 10; i++) {
  console.log(`${i} word${i > 1 ? 's': ''}: ${bitsPerWord * i} bits`);
}

console.log(
  '\nFor comparison, dictionary words each have about 14 bits of entropy\n' +
  '(source: https://security.stackexchange.com/a/62911/151988).'
);

What’s this script doing? It’s scanning through the 235,887 words in /usr/share/dict/words and collecting the pool of just 132 words that contain only characters common to both Colemak and Qwerty, then printing out some entropy info at the end for passphrases of different word lengths.

For those too lazy to run it, here’s the output:

Of 235887 words,
132 words contain only 13 common keys
(q, w, a, h, z, x, c, v, b, m, ,, ., /).

  a
  a
  aa
  aam
  ab
  aba
  abac
  abaca
  abama
  abb
  abba
  abwab
  acca
  ach
  ah
  ah
  aha
  am
  ama
  amah
  amba
  amma
  amma
  ava
  aw
  awa
  ax
  azha
  b
  b
  ba
  baa
  bab
  baba
  bac
  bacaba
  bacach
  bacca
  bach
  bah
  baham
  bahama
  bam
  baw
  bhava
  c
  c
  ca
  caam
  caama
  cab
  caba
  caca
  cacam
  cachaza
  cam
  camaca
  camb
  cava
  caw
  caza
  cha
  chaa
  chab
  chac
  chacma
  cham
  cham
  chama
  chamma
  chaw
  cwm
  h
  h
  ha
  haab
  habab
  hah
  ham
  hammam
  hamza
  haw
  hawm
  hwa
  m
  m
  ma
  ma
  maam
  mab
  maba
  mac
  mac
  macaca
  macaw
  mah
  maha
  mam
  mamba
  mamma
  maw
  max
  maza
  mazama
  mwa
  q
  q
  v
  v
  w
  w
  wa
  wa
  waac
  wab
  wac
  wah
  waw
  wawa
  wawah
  wax
  waxhaw
  wha
  wham
  x
  x
  z
  z
  za
  zac
  zach
  zax

Entropy (bits) for an n-word passphrase:

1 word: 7 bits
2 words: 14 bits
3 words: 21 bits
4 words: 28 bits
5 words: 35 bits
6 words: 42 bits
7 words: 49 bits
8 words: 56 bits
9 words: 63 bits

For comparison, dictionary words each have about 14 bits of entropy
(source: https://security.stackexchange.com/a/62911/151988).

What can we conclude from all this?

  • If we drew words directly from /usr/share/dict/words without regard to layout, we could get an excellent 17 bits of entropy per word (for comparison, the word list used by 1Password is apparently only large enough to deliver about 14 bits per word). Unfortunately, many of the words in this list aren’t practical to use (consider an early example like "abdominohysterectomy", which nobody is ever going to accept), so we’re not really claiming 17 bits of entropy in the real world.
  • Our Colemak/Qwerty hybrid words have about half the entropy per word (a measly 7 bits), meaning that you need your passphrase to be twice as long to match the entropy you’d get with standard dictionary words: for 56 bits of entropy, for example, you’d need an 8-word passphrase instead of a 4-word one. It’s not going to be particularly memorable either, as it will end up being something like "mamba waxhaw zax macaca habab cachaza wab azha". I’ll grant that that’s pretty fun to say out loud, but that’s not a redeeming quality for a passphrase.
  • I guess we could inject more entropy by adding numbers and symbols, but the base set of words to draw from is still sucky.

I’m going to stick to my boring existing passphrase for now. For reference, and in case I forget it, it is "rosemary horde shotgun portrait".

Discuss: Twitter

Searching for the Holy Grail

I recently came to the conclusion that, when it comes to programming languages, there is no Holy Grail, just a bunch of cups, and perhaps we should just get on with drinking.

I’ve worked with a lot of languages over the years. I started as a 10-year-old writing in Commodore BASIC on the Commodore 64, progressed to AmigaBASIC, AMOS and finally 68000 assembly on the Amiga 500, and briefly dipped into x86 assembly during my short stint as the owner of a 486SX33. At college I did programming courses that used Ada.

Sadly, I don’t have any records of any of the programs that I wrote back in those days, although I fondly recall some. Whatever remains of them, fragilely stored on flimsy floppy disks or inscribed on dot-matrix printouts, is probably decomposing under countless layers of land-fill by now.

I got my first Mac in the mid-90s. My earliest Mac programs were in REALbasic, but with the birth of the web I wrote "discussion boards" with flat-file "databases" in Perl, "weblogs" in PHP, and I learned HTML, CSS and JS, as you do. Almost all of that early stuff is lost too.

It’s not until the 2000s that my personal fossil record has some surviving evidence. I used the CVS, Subversion, and SVK version control systems, and some of the artifacts that I created in those systems survived into the Git era, where I expect they’ll enjoy a much longer digital half-life.

I wrote large projects in C and Objective-C on the desktop, and for the web I learned Ruby and went deep into JS. These were my professional languages that paid the bills for me, but along the way I explored and wrote side-projects in Haskell, Go and Lua. At work I sometimes have to dip into Hack. I’ve written my fair share of Vimscript and Bash (etc) stuff. Sometimes, I write snippets of Python. My latest experiments have been with Reason.

Looking back over that history, I envy the simplicity of the constrained choices that I used to make. I wrote BASIC because it came with the computer. I learned assembly because it was the only tool for the job (if by "the job" you meant writing games with fast graphics). Some of the choices were totally serendipitous and not really choices at all; I learned AMOS because someone gave me a copy of it. I learned Perl because it was basically "the way" that CGI was done back in the day. I learned C and Objective-C because those were the languages with which you could access Apple’s APIs and build desktop apps.

Starting with Rails, the pace of innovation and renewal seemed to really take off. Languages and frameworks were spawned en masse, duking it out with one another to achieve domination and establish a "new paradigm". The choices became less clear, the list of contenders impractically long. For all the languages I did learn, there are many more that I didn’t have time to even dabble in (not just the older languages — Erlang, Java, C++ — but countless others that they have spawned or influenced: Dart, Rust, Elm, PureScript, CoffeeScript, Elixir, Scala, Kotlin, D and many others).

I think there are some pretty clear conclusions to draw here: either we as language designers aren’t very good at what we do, or as human beings we’re afflicted with a crippling case of not-invented-here syndrome and an evolved inability to be happy and productive with the tools that we’re given. Either way, the costs are immense: we spend a disproportionate amount of time and energy reinventing our tooling relative to how much we invest in actually building things with those tools. It’s easy to mistake motion for progress, I guess.

Why am I writing about this now? I think it’s because I’ve felt more and more restless about JavaScript. JavaScript is an increasingly multi-paradigm language that wants to be all things to all people. It has "Feature Envy". Its aggressively forward-looking development model (the whole TC39 process with its keen involvement by significant industry players with the power to effect meaningful change in the deployed ecosystem, and the availability and widespread use of tools like Babel that enable rapid and aggressive experimentation) together with the need to maintain backwards compatibility forever (you can’t "break the web") mean that it accrues an ever-growing, never-shrinking set of functionality and syntax. It has — or will have — almost every feature available in any other language, up to and including a Kitchen Sink capability, but will probably stop short of the one thing that I really wish it had (a real, not-bolted-on type system). JavaScript is simultaneously the best language to teach to beginners (there is no cheaper way to build something with a UI that will run anywhere) and the worst (welcome to an ecosystem with a rapidly churning set of tooling that’s metastisizing fast enough that it may well collapse into a black hole one day).

Yet, I’ve come to the conclusion that all this searching for something better is a fool’s errand. You don’t make something better by combining the best bits of other languages. The most you can do is to take an opinionated stance about something that you consider to be really important, focus on getting that one thing really right, and then get on with the business of building useful things with it. Note that it doesn’t suffice to be just opinionated; you also have to be focused. Here are some examples of opinionated, focused stances:

  • Haskell:
    • Core thesis: Fully unlock the power of abstraction with a purely functional, lazily evaluated core.
    • Advantage: You get an expressive, sophisticated type system that allows you to succinctly materialize ideas with a high degree of machine-assisted verification.
    • Trade-off: Some things, like modifying deeply nested immutable data structures, are hard.
  • Go:
    • Core thesis: Simplicity is paramount.
    • Advantage: Out of simple primitives you can build robust, highly-performant concurrent solutions.
    • Trade-off: Code is "boring", "verbose", "pedestrian".
  • C:
    • Core thesis: Abstraction is overrated.
    • Advantage: Speak to the Von Neumann architecture in its native tongue (almost) to build fast things, without needing to learn processor-specific assembly language.
    • Trade-off: When you build stuff out of gun powder, wire, and spark-plugs, you just may singe off your eyebrows.

If you try to make these languages better by blending together their best elements you wind up with behemoths like JavaScript and C++. These are not bad languages; they’ve been extremely successful, and many great things get built using them. And yet, people can’t resist somehow trying to "fix" them, either by augmentation or outright replacement. Something is rotten in the state of programming languages.

Inevitably with the good stuff comes some baggage. Sometimes the elements don’t combine well. You can’t make a better Ruby, for example, by adding Haskell’s strong static typing to it, because what you’d get wouldn’t be the loose, fluid, pleasant, expressiveness of Ruby: you’d just get Haskell with an awkward syntax. I’ve previously remarked that programming in Ruby is like driving a rubber car without a seatbelt; that sounds like fun in a weird kind of way, and it is. By the time you’ve added static typing the language is no longer a car, nor is it made of rubber, and I guess it doesn’t really make sense to ask whether it has a seatbelt or not any more (whatever "it" is). Large, growing languages inevitably tend towards resembling Frankenstein’s monster over time.

All of this may strike you as being perilously close to just plain old obvious common-sense, and you may wonder, why I am bothering to write it? Why would I ever have thought that there was a Holy Grail in the first place?

It crept up on me insidiously at first. About 10 years ago I first got exposed to the idea that you should learn a new programming language every year, not necessarily to add to your practical tool-kit, but to expand your mind. Haskell was a popular choice for this purpose back then. The notion was that you should seek out "novel" ideas — note this means novel to you and not necessarily something universally recent — and train your mind by grappling with them. At the very least you kept your mental axe sharp by exercising it, and at best you might stumble across something that subtly (or even dramatically) changed your world view and in some nebulous, hard-to-articulate way, "made you a better developer".

Fast forward ten years into this practice, I haven’t learned ten new languages, but I have made significant incursions into about five. Even though I never lost sight of the reason why I was going through this whole exercise, there was a primitive, subconscious part of my brain that was wondering if I would end up finding "The One": the language that would somehow feel so "right" and enable me to be so effective that I’d be able to stop searching and settle down for a decade, or two, or three. I’d forgotten that I wasn’t engaged in a search at all. This is what reading too many blog posts with titles like "Why we’re rewriting everything from X to Y" will do to you, given enough time. You start to think like a believer.

This year I was looking around for a new Language of the Year to dive into, but was struggling to find something that met my novelty criterion. Reason/OCaml felt too similar to Haskell. Rust definitely had some novel ideas, but it failed my other criterion: practical applicability (in the sense that I needed to find a well-suited, useful, motivating side-project in which to try it out).

I finally decided to go with Reason despite the relative lack of novelty, and this meant honestly confronting myself with the fact that at least part of me — the irrational part — had been holding out, hoping to find a Holy Grail candidate. I had to let go of that. I went in knowing Reason would have some real strengths (eg. fast compile times, great developer experience, solid performance etc) and some down-sides (eg. spartan documentation, few examples, and so on). There’d also be some mixed-bag stuff, like the small community, which you can consider a blessing or a curse depending on how you look at it.

All languages are going to suck in some way. But the bright side is that we have so many choices available to us now that we can choose languages that suck in the ways that we can tolerate, and conversely, excel in the areas that really matter to us. For me, Reason is interesting because two of the things that it gets really right (having a solid type system, and language-level support for immutable records) happen to represent a couple of the things whose absence really annoys me about JavaScript. I can compromise a lot on the rest — take the good, and the bad — because ultimately it doesn’t matter that much to me. I’m looking for a cup to drink out of, not a fountain of eternal youth.

In closing, I want to make sure that this post doesn’t end up being yet another "Why I’m rewriting everything from X to Y". I’ll continue to use the other "cups" I have in my cupboard. The cup I choose at any particular moment will depend on the circumstances. I am sure that every now and again I’ll want to try out a new vessel or two. But if you ever see me with a distant, glazed-over look in my eyes, and I look like I’m about to ride off towards the horizon in search of some mythical language that doesn’t exist — or worse still, I start showing signs of wanting to design my own — please try to snap me out of it. A slap in the face and, if that doesn’t work, a bucket of cold water should do the trick.

Discuss: Twitter