Follow-up on the Sept 8 service outage

Follow-up on the Sept 8 service outage

  • Comments 3
  • Likes

This blog post is a follow-up on the outage that occurred on September 8th. Just before 8:00 PM PDT that day, we became aware of a Domain Name Service (DNS) issue that was causing a service interruption for some some Microsoft services, including Windows Live services such as Hotmail and SkyDrive. No customer data was lost or compromised during this outage. The team has investigated the root cause and has taken immediate steps to improve.

So, what happened? A tool that helps balance network traffic was being updated and the update did not work correctly. As a result, configuration settings were corrupted, which caused a service disruption.

At 10:23 PM PDT we began to see service restoration. We confirmed that the incident was resolved by 11:35 PM PDT, although it took some time for the changes to replicate around the world and reach all our customers.

We determined the cause to be a corrupted file in Microsoft’s DNS service.  The file corruption was a result of two rare conditions occurring at the same time.  The first condition is related to how the load balancing devices in the DNS service respond to a malformed input string (i.e., the software was unable to parse an incorrectly constructed line in the configuration file). The second condition was related to how the configuration is synchronized across the DNS service to ensure all client requests return the same response regardless of the connection location of the client.  Each of these conditions was tracked to the networking device firmware used in the Microsoft DNS service.

After restoring service, we have identified two streams of work to drive specific service improvements around monitoring, problem identification, and recovery.  Along with these service improvements, Microsoft is focused on further hardening the DNS service to improve its overall redundancy and fail-over capability.

We are also developing an additional recovery process that will allow a specific property the ability to fail over to restore service and then fail back when the DNS service is restored.  In addition, we are reviewing the recovery tools to see if we can make more improvements that will decrease the time it takes to resolve outages. 

We are determined to deliver the very best possible service to our customers and regret any inconvenience caused by this outage.

Arthur de Haan
Vice President Windows Live Test and Service Engineering

3 Comments
You must be logged in to comment. Sign in or Join Now
  • I am still trying to delete three old dns sites , they all interact , with updates all three sites use the hard disk space, if I delete old files , it does so to all sites.   If I  no longer use outlook express version 6 and internet explorer 6, can I delete some of the old dll files in my IBM Winx pro  spk3 preload space.  My r40 Thinkpad. feels like it has a mind of its own jumping all over the place.  Sept 8th, Hurricane Irene hovered over our coastline, due to our wonderful progress in technogoly I knew  I was in its direct path.  On Sept 9th, the slow moving eye  remained  over  us , my Thinkpad worked perfectly  the outer bands of  the storm were so wide that we  had perfect weather and reception locally, none of the oil spill waters came ashore ,the Mississippi River did not break thorough the levee.  I started getting a message stating I was running out of hard drive space.  My WL Hotmail Plus and Skydrive are running. fine now. Can anyone help me with this problem?   Merci!! CADjReine

  • Will the problem with Zune not loading profiles be fixed soon?

  • Windows Live team has to establish their credibility now by delivering very good service with no outages ( or at least very few) because in the past few weeks there have been many many outages for all MS cloud services including office 365