Mon, 12 Jun 2006

There are no harmless race conditions [11:01]

I spent most of Friday tracking down a problem we had been seeing with the Mugshot Windows client for months, where important parts of the client (Mugshot.exe) would mysteriously vanish from the system. This happened to both Bryan and Havoc when we pushed a new client release last week. Using information from that, combined with uninstalling and installing things about 200 times, I was able to track the problem down to the following sequence:

  1. When the upgrade starts, we tell the old client to exit
  2. The Windows Installer goes to remove the old version, and finds that Mugshot.exe is still in use, so it schedules a delete after the next reboot
  3. The old client actually finishes exiting.
  4. The Windows Installer installs the new files, and since Mugshot.exe is no longer in use, immediately replaces it with the new version, instead of scheduling that action to happen after reboot.
  5. The user is prompted to reboot the system, and after reboot, Mugshot.exe is deleted.

The fix, once tracked down, was pretty simple ... just actually wait for the old client to finish exiting before proceeding. Now, I knew when I originally wrote the code the race condition could happen, but convinced myself that It Will Be Very Rare: the Windows installer has all sorts of other work to do as well, and the client will exit quickly. And also It Will Be Harmless: maybe the user will be told to reboot unnecessarily, but that's all. As is almost always the case, neither was in fact true.

Other thing I should have known better about this weekend: when attending an outdoor music festival in the rain, waterproof footware will make the experience vastly more enjoyable, even if the music is worth it.