Microsoft

Microsoft has revealed that Thursday's worldwide outage was caused by a code defect that allowed the Azure DNS service to become overwhelmed and not respond to DNS queries.

At approximately 5:21 PM EST on Thursday, Microsoft experienced a global outage that prevented users from accessing or signing into numerous services, including Xbox Live, Microsoft Office, SharePoint Online, Microsoft Intune, Dynamics 365, Microsoft Teams, Skype, Exchange Online, OneDrive, Yammer, Power BI, Power Apps, OneNote, Microsoft Managed Desktop, and Microsoft Streams.

The service was so wide-spread within Microsoft's infrastructure that even their Azure status page, which is used to provide outage info, was inaccessible.

Azure status page unreachable
Azure status page unreachable
Source: Twitter

Microsoft's eventually resolved the outage at approximately 6:30 PM EST, with some services taking a bit longer to function again properly.

At the time, Microsoft stated that the outage was caused by a DNS issue but did not provide further information.

Azure DNS service became overloaded

Last night, Microsoft published a root cause analysis (RCA) for this week's outage and explained that it was caused by their Azure DNS service becoming overloaded.

Microsoft's Azure DNS is a global network of redundant name servers that provides high availability and fast DNS services.

According to Microsoft, the Azure DNS service began receiving an "anomalous surge" of DNS queries from all over the world that were targeting certain domains hosted on Azure. While Microsoft does not explain what this anomalous surge was, it may have been a DDoS attack targeting certain domains.

Microsoft states that their DNS service could typically handle a large number of requests through DNS caches and traffic shaping. However, a code defect prevented their DNS Edge caches from working correctly.

"Azure DNS servers experienced an anomalous surge in DNS queries from across the globe targeting a set of domains hosted on Azure. Normally, Azure’s layers of caches and traffic shaping would mitigate this surge. In this incident, one specific sequence of events exposed a code defect in our DNS service that reduced the efficiency of our DNS Edge caches."

"As our DNS service became overloaded, DNS clients began frequent retries of their requests which added workload to the DNS service. Since client retries are considered legitimate DNS traffic, this traffic was not dropped by our volumetric spike mitigation systems. This increase in traffic led to decreased availability of our DNS service," Microsoft explained in the RCA for this week's outage.

As almost all Microsoft domains are resolved through Azure DNS, it was no longer possible to resolve hostnames on these domains and access associated services when the DNS service became overloaded.

For example, the xboxlive.com domain uses the following Azure DNS name servers to resolve hostnames on this domain.

NS1-205.AZURE-DNS.COM
NS2-205.AZURE-DNS.NET
NS3-205.AZURE-DNS.ORG
NS4-205.AZURE-DNS.INFO

Since xboxlive.com is hosted on Azure DNS, and that service became unavailable, users were no longer able to login to Xbox Live.

To prevent this type of outage in the future, Microsoft states that they are repairing the code defect in Azure DNS so that the DNS cache can adequately handle large amounts of requests. They also plan on improving the monitoring and mitigations of anomalous traffic.

In response to our queries regarding the anomalous surge of DNS traffic, Microsoft stated they had nothing further to share at this time.

Update 4/4/21: Added response from Microsoft.

Related Articles:

Panera Bread experiencing nationwide IT outage since Saturday

It's not just you: ChatGPT is down for many worldwide

UK bakery Greggs is latest victim of recent POS system outages

McDonald's IT systems outage impacts restaurants worldwide

McDonald's: Global outage was caused by "configuration change"