The second Sunday of March has come to pass, which means if you217;re a North American reader, you217;re getting this an hour earlier than normal. What a bonus! That217;s right, we all got to experience the mandatory clock-changing event known as Daylight Saving Time. While the sun, farm animals, toddlers, etc. don217;t care about an arbitrary changing of the clock, computers definitely do.
Early in my QA career, I had the great (dis)pleasure of fully regression testing electronic punch clocks on every possible software version every time a DST change was looming. It was every bit as miserable as it sounds but was necessary because if punches were an hour off for thousands of employees, it would wreak havoc on our clients’ payroll processing.
Submitter Iain would know this all too well after the financial services company he worked for experienced a DST-related disaster. As a network engineer, Iain was in charge of the monitoring systems. Since their financial transactions were very dependent on accurate time, he created a monitor that would send him an alert if any of the servers drifted three or more seconds from what the domain controllers said the time should be. It rarely ever went off since the magic of NTP was in use to keep all the server clocks correct.
One fateful early morning of the 2nd Sunday in March, Iain’s phone exploded with alerts from the monitor. Two load-balanced web servers were alternately complaining about being an entire hour off from the actual time. The servers in question were added in recent months and had never caused an issue before.
He rolled out of bed to grab his laptop to begin troubleshooting. The servers were supposed to connect to time sync with their domain controller, which would NTP with an external stratum 1 time server. He figured one or more of the servers were having network connectivity issues when the time change occurred and were now confused as to who had the right time.
Iain sent an NTP packet to each of the troubled servers expecting to see the domain controller as the reference server. Instead, he saw the IP addresses of TroublesomeServer1 and TroublesomeServer2. Thinking he did something wrong in an early morning fog, he ran it again only to get the same result. It seemed that the two servers were pointed to each other for NTP.
While that was a ridiculous setup, it wouldn’t explain why they were off by an entire hour and kept switching their times. Iain noticed that the old-fashioned clock on his desk showed the time was a bit after 2 AM, while the time on his laptop was a bit after 3 AM. It dawned on him that the time issues had to be related to the Daylight Saving Time change. The settings for that were kept in the load balancer, which he had read-only access to.
In the load balancer console, he found that TroublesomeServer1 was correctly set to update its time for Daylight Saving, while TroublesomeServer2 was not. Since they were incorrectly set to each other for NTP, when TroublesomeServer1 jumped ahead an hour, TroublesomeServer2 would follow. But then TroublesomeServer2 would realize it wasn’t supposed to adjust for DST, so it would jump back an hour, bringing TroublesomeServer1 with it. This kept repeating itself, which explained the volume of alerts Iain got.
Since he was powerless to correct the setting on the load balancer, he made a call to his manager, who escalated to another manager and so on until they tracked down who had access to make the setting change. Three hours later, the servers were on the correct time. But the mess of correcting all the overnight transactions that happened during this window were just beginning. The theoretical extra hour of daylight provided was negated by everyone spending hours in a windowless conference room adjusting financial data by hand.