Maintenance Postmortem: systemctl-resolved

My first emergency maintenance post-mortem! Oh, how fun!

Maintenance Postmortem: systemctl-resolved

Here's a quick one for ya. If you didn't realize already, my Pi cluster crashed tonight and I had to do some emergency maintenance!

What Happened

All I know is that I heard my cluster's fans spin up real high, and typically when that happens, something really CPU-intensive just picked up, like pulling an image from Docker. When I went to investigate, I couldn't. Any interactions with my cluster via kubectl resulted in net/http: TLS handshake timeout errors, so I assume my cluster's control plane was locked up. I decided to restart the master Pi, in hopes that "turning it off and on again" would fix the issue. Alas, my problem was not that simple.

After restarting, I found that kubectl worked once again, so I tried doing my usual routine of using kubectl proxy to get a Kubernetes dashboard session up and running. To my surprise, I was only met with a JSON error stating that the kubernetes-dashboard pod was unavailable. I was deep in it this time, but it was time to dig even deeper.

I did a good 'ol kubectl get pods -A to see what the status of all of my pods was, and pretty much any pod that used the "Always" image pull policy was unable to start up, due to an ImagePullBackOff error.

Now I've experienced this before. Most of the time this means that DNS isn't resolving. Digging into one of the unavailable pods reveals that this is in fact the case.

Briefly looking at these logs, we can see some text about a lookup failing on port 53, which is reserved for DNS. So, let's start with the regular culprits: what does /etc/resolv.conf look like?

It's using a loopback address 127.0.0.53 as the DNS server. This is no real surprise, as Ubuntu has used systemd-resolved for a while now, which in my case is managed by cloud-init. Let's check those files next. Typically they're on the FAT32 partition of the Pi's SD card, but that partition can also be found mounted at /boot/firmware in a live environment.

Those DNS servers seem to be fine. These Pis are hard-coded to Cloudflare DNS at the moment, but while I'm here I might as well remove the nameservers section so I can control DNS via DHCP in the future. I have a local DNS server that's already advertised via DHCP, so this will allow the Pis to use it, in effect.

Regardless, now that we know the setup, let's literally try dig-ing a bit.

Interesting, systemd-resolved seems to not work, while contacting our upstream servers directly does work. Let's confirm that resolve isn't functional by asking it to do a domain lookup directly:

Interesting. When looking up via loopback address, resolved doesn't want to respond. But when asking the daemon to look up an address directly, it works. Maybe the server isn't binding to port 53 for some reason?

I tried restarting systemd-resolved a bunch of times and looking up it's logs using journalctl, but nothing looked out of order. Even looking at debug logs via the override.conf trick didn't prove fruitful. Something was preventing domain resolution on port 53 from working, and honestly? It just seemed like resolved was just causing more issues than solutions at this point.

I spent about another hour scouring the internet for more potential solutions when I finally found this post on Stack Overflow. I pieced it together with this other answer I saw earlier to compose what I thought was a solution.

Right now, /etc/resolv.conf was symlinked to /run/systemd/resolve/stub-resolv.conf, which is the default configuration on Ubuntu that tells apps to forward DNS requests to systemd-resolved. If I were to replace this symlink with one that points to /run/systemd/resolve/resolv.conf, the generated file that has my desired configuration, then I might be able to let systemd handle updating DNS server addresses for me from DHCP while also bypassing the systemd-resolved DNS proxy altogether. Worth a shot.

We're back in business! After applying the "patch" to all of the Pis in the cluster, then rebooting and looking back at my watch kubectl command, we can see that pods are slowly starting to come back online.

This totally isn't the uncropped version of the first image in the article, btw. Definitely took plenty of screenshots in the moment.

Lessons Learned

At this point, honestly... I don't know? Don't use systemd-resolved? That doesn't seem right.

Earlier in the article, I used the word "patch" in quotes because I don't feel that the solution I implemented is a true solution, but rather a bodge. I still need to look into exactly what went wrong with resolved and why it isn't listening on 127.0.0.53:53 as expected. There is more research and poking around to be done before this issue is truly closed. But it's 3:37 AM. I'll do it tomorrow.

For now, if any of you have thoughts or experience with systemd that may be relevant, please leave a comment below! Any help is welcomed at this point since Stack Overflow doesn't seem to be really useful at the moment. I'll update this post as I discover more and hopefully reveal the real issue at play here.

Anyways, back to writing the next blog post. Maybe sleep. Who knows? I'm pretty much an insomniac at this point. I'll see you all in the next one.