Simplifying my Ansible set-up

Ansible is the worst automation platform out there, except for all others.

— Sir Winston Churchill

You could say I have a "love-hate" relationship with Ansible. After using Puppet and Chef in work environments, I found them to be utter overkill for any personal projects. In contrast, Ansible promised to be a "radically simple IT automation platform" (filthy lies!), and compared to the others, it is. For my use case (maintaining a couple of EC2 instances), it works pretty well. There is no "Chef server" or "Puppet master" orchestrating things at the heart of the system: there is just a Git repo with some configuration files in it (just on my local laptop) and an Ansible executable that I can run directly and which will ssh up into EC2 to do the work.

But it is still pretty complicated. The project itself is huge, and its dependency footprint is big too. The whole thing is in Python, limiting my ability to debug or modify it when things go wrong (seeing as I am not a "Pythonista"). And it is pretty slow: every little command you run requires a new SSH connection to the server (even if you reduce the overhead by using SSH’s ControlMaster functionality, it’s still slow). In the end I’ve had to implement cumbersome workarounds to address the performance issues, like telling Ansible to upload a Bash script to the server that does something to 40 different Git repos all at once, instead of telling Ansible itself to do the work. It kind of feels like having a fancy mesh WiFi network in your home, but then running ethernet cables all over the floor connecting all the rooms together.

The sheer amount of code involved in Ansible makes upgrades scary. Last time I looked, a clean copy of the Ansible repo clocked in at well over 200 megabytes. For a while I was even using Ansible to set up my local laptop, but my trepidation about its footprint and the fear of things breaking on updates eventually led me to throw it out and write my own framework instead. All I need to do on my local machine is edit a file here and there, set up some links, maybe install some things or run some scripts, so my tiny home-grown tool suffices.

For my EC2 use case, however, I’m still not ready to throw out Ansible. I don’t want to have to deal with platform differences and network communications, which are two of the things that Ansible basically has totally figured out at this point.

Amazon has "Amazon Linux 2" now, and the "Amazon Linux" machines that I’ve been using for many years need to be migrated. You can’t just upgrade; you have to set up everything again. There have been some reasonably important changes between versions (like switching to systemd), which mean I may as well start from scratch and take the opportunity to redo, update and simplify things as much as possible. This is an opportunity to pay off technical debt, do some upgrades, and set things up "The Right Way™".

Before starting, I sought to simplify my arrangements on the instances as much as possible. For example, I had some static sites hosted on one of these machines which could be offloaded to GitHub pages. And I had some private Git repos that I was backing up by taking EBS snapshots of their volumes, which I could also just mirror off to GitHub as private repos (and once I had that offsite backup, I could stop doing the EBS snapshots). And this in turn meant that I could simplify the volume structure: instead of having a separate XFS-formatted /data volume, I could just have everything on the root filesystem (XFS is now the default format, and I don’t even care about keeping things separate as I can now recreate any instance from scratch based on data available elsewhere).

I’ve always been skeptical of putting too many eggs into corporate baskets, taking great pains to minimize my dependence on Google, for example. For the longest time I didn’t push anything private to GitHub for this reason, even though their servers are most certainly safer and better maintained than my "lone wolf" amateur EC2 instances. But over the years, I’ve also realized that the real value of a lot of this private data that I’ve been pushing to my secret repos isn’t actually so great after all. It could be irrecoverably lost to virtually no consequence, and it could be leaked or exposed with only a little discomfort and inconvenience. Added to that, I actually started working for GitHub last month and I figure that if a company with a multi-trillion-dollar market cap like Microsoft is prepared to place a bet on GitHub, then little old me shouldn’t have any qualms about it — I have much less to lose, after all.

One of these EC2 instances hosts this blog, and I was able to simplify that too. When I set up the old instance (back in 2015) a large chunk of the content in the blog was written in "wikitext" format, and that was turned into HTML using a Sinatra (Ruby) microservice. Since then, I migrated all the wikitext to Markdown (a fun story in itself) and spun down the microservice. That means the instance no longer needs Ruby or RubyGems.

The other EC2 instance was running PHP for a couple of domains (wincent.dev and wincent.dev). I simplified my set-up on that instance by making a static mirror of all the files and folding them into wincent.dev itself, running on the other instance). This is the 5,000-file/1,000,000-line commit where I brought all that content across. The follow-up commit where I ran all the static HTML/"PHP" through Prettier is pretty epic, clocking in at over 3,000,000 lines. I also updated a quarter of a million links in this commit. Fun times.

The great thing about all these simplifications and migrations is that my instances are now close to being, effectively, "stateless". That is, I don’t really have to worry about backing them up any more because I can recreate them from scratch by a combination of bootstrapping with Ansible, and git push-ing data to them to seed them with content. If I lose my laptop and GitHub destroys my data then I’m in trouble, but I feel reasonably safe with three-fold redundancy (ie. the instance + my local copy + GitHub’s). It’s not infallible by any means, but it definitely meets the bar of "good enough"; at least, good enough that I’m not going to lose any sleep over all this.

Moving to Amazon Linux 2 was a pain in some ways (ie. having to rewrite Upstart scripts as systemd units) and great in others (eg. having access to recent versions of Monit, Redis and other software without having to build from source; in the end, the only software I had to actually build was a recent version of NodeJS on one of the hosts). Along the way, I also moved from acme.sh (which recently sold out to commercial interests) to acme-tiny (which sounds like my personal Let’s Encrypt spirit animal, being "a tiny, auditable script … currently less than 200 lines"), and made numerous improvements to make the certificate renewal process more robust. I even went so far as to finally set up a proper "non-root" IAM user for doing my admin work in the AWS console. Key pairs were rotated, security groups cleaned up, Subject Alternate Names trimmed, and so on. Basically, took the opportunity to pay off as much tech debt as I could as I went.

The above simplifications meant that my overall requirements were now basic enough that I could dispense with most of the abstractions that Ansible provides (like group variables, roles, and so on) and just put everything in a single playbook. This is really marvellous for maintenance: it is a 1.5k-line YAML file, but that includes everything (tasks, handlers and variables for two hosts), and it all reads 100% linearly with no abstraction clouding or concealing what’s actually happening — you can just read it from top to bottom and see exactly what is going to happen on both hosts. Now, there is some repetition in there that could be factored out, but the repetition in this case is what keeps the whole thing simple. I’m probably not going to touch it. Additionally, getting rid of roles means that all of my templates and files are consolidated in a single location in the repo root instead of being dispersed over a dozen or so subdirectories hidden three-levels deep.

I was a bit worried that in moving from Ansible 2 to Ansible 4 I was going to have to deal with a huge amount of breakage, but in the end it wasn’t too bad at all. Most stuff still works, and I was able to do almost everything I need using the ansible.builtin collection alone (only dipping into the ansible.posix collection for one task on each host, concretely, using the ansible.posix.authorized_key module). I do find the whole collections thing to be unpleasantly over-engineered, and I wish I didn’t have to know that "Ansible Galaxy" was a thing, but in the end I was able to mostly pretend that Galaxy doesn’t exist, by adding the ansible.posix repo as a Git submodule checked out at vendor/ansible_collections/ansible/posix, and setting collections_paths = ./vendor in my ansible.cfg.

A similar dance with Python, moving from virtualenv (a separate tool) to venv (bundled with Python) for creating a sandbox environment, allowed me to use the aws-cli tool from a submodule without having to reach out over the network with pip every time I wanted to do something. I still wish that isolation and reproducibility were easier to achieve in the Python ecosystem (and maybe it is, for experts), but I was able to get done what I needed to do, in the end.

So with that, that brings to a conclusion my migration from a pair of trusty EC2 instances that had been launched all the way back in 2015. We’ll see whether their 2021 successors also last nearly 6 years, and whether the move to "Amazon Linux 3" ends up being any more straightforward thanks to the simplification and updates I’ve undertaken now. Hopefully, major system components like systemd and yum will still be there, so the next update will be a breeze.