It’s been a really good day for me.
Though, I’ve been programming since the morning, and all the code has now started to look like mush… I’ve been able to implement features, improve performance, testing and feel good about the code I’ve written.
At work, I fixed a critical bug by deleting 2 lines and changing a word on another. Piece of cake right? As they say, knowing to use a hammer takes a second, but figuring out where to hit… that takes forever.
So what did I even do?
I’ve been working on a project to speedup deployments. We’re going from 4 hours to 2 minutes. This makes me really happy. The hard part is that we’re using windows. On windows you don’t have your familiar tools for testing, automating and debugging. So we’re writing an automation tool using deno with a sprinkle of busybox and a few unrelenting lines of powershell.
Today, I was dealing with a bug that was preventing deployments from going through after the first one. Clearly a problem if the whole point of the tool is to make a deployment :). Approaching the problem, the first show stopper is that iterations take 40 minutes. Each time I modified my tool I had to wait 40 minutes before I could even run it. This is because we are integrating with existing infrastructure that runs on virtual machines. These VMs can take a while to build and start. I realized, we could use containers for speeding up the testing. Windows containers aren’t perfect, but they’re far enough along that you can patch through the rest (chocolatey, busybox and vim help). We’ve had a way to test the containers locally for a while, but only one container at a time.
Using windows containers, I was able to reproduce my bug and iterate every 5 minutes. This was much better. But we still had a problem. My implementation of integration tests using a docker container used ssh. If you’ve tried this before then you may be surprised it even works… Chocolatey again helps with getting ssh running inside a windows container, and a healthy dose of praying to the docker gods. But how did we know what port to reach? Initially, we could only run one container because we used a set port to ssh. Today, I found some time to update the system to run multiple containers on randomised (available) ports.
Needless over-engineering? Not this time! We needed this change to allow us to run tests of both scenarios.
Scenario 1: The machine is empty, run the deployment script to create deployment A
Scenario 2: The machine is not empty, and has deployments from before. Run a new deployment and make sure it replaced the old one.
Previously, we could have hacked together a serial test plan where we run scenario 1 and then scenario 2. This would double our iteration time from 5 minutes to 10 minutes… Not the worst thing in the world but even 5 minutes was difficult when debugging our complex powershell scripts. There’s no running away from powershell on windows!
What we ended up implementing was a way to launch a new container for each scenario, in parallel. This mostly just meant figuring out what port you could assign for ssh to each container. We tried using powershell Get-NetTCPConnection and netstat. But neither would show us the ports from docker. By noon I was running out of time, I still needed to write the actual test scenario… and fix that bug that was holding up the project. I hacked it together by assuming a certain port range would be used only by my test runs. A reasonable guess as we control the device where the tests are running.
By 4pm:
- I had upgraded our integration tests to run in parallel
- written a test to detect the current bug, doubles as a regression test moving forward
- almost fixed the actual bug
Now that I had a test to check my work, I could put all my energy into the bug. By this point, my colleague joined me for some pair programming. With two pairs of eyes we quickly zeroed in on the three lines that needed changing. After a few iterations we committed a fix, backed up by our test. It took ~40 minutes to find and fix the actual bug because in that time we were able to iterate ~20 times. A partial iteration took ~1 minutes while a full iteration took 5 minutes. This is compared to our initial time of ~40 minutes for a full iteration (there was no partial iteration). And then you add in pipelines, release processes… It would have taken a week to reproduce the bug, forget fixing it.