A subset of user in Europe, Middle East and the Americas may be experiencing degraded performance and issues accessing RMS

Incident Report for RMS Cloud

Postmortem

The issue effecting RMS users was due to a global issue effecting Microsoft services including the Azure data centres where the RMS applications are housed and deployed from.

See Microsoft Incident response below.

https://status.azure.com/en-au/status

Starting at 07:05 UTC on 25 January 2023, customers may experience issues with networking connectivity, manifesting as network latency and/or timeouts when attempting to connect to Azure resources in Public Azure regions, as well as other Microsoft services including M365, PowerBI.
We've determined the network connectivity issue is occurring with devices across the Microsoft Wide Area Network (WAN). This impacts connectivity between clients on the internet to Azure, as well as connectivity between services in datacenters, as well as ExpressRoute connections. The issue is causing impact in waves, peaking approximately every 30 minutes.
We have identified a recent WAN update as the likely underlying cause, and have taken steps to roll back this update. Our latest telemetry shows signs of recovery across multiple regions and services, and we are continuing to actively monitor the situation.

This message was last updated at 09:43 UTC on 25 January 2023

25/1

Azure Networking - Multiple regions - Mitigated (Tracking ID VSG1-B90)

Summary of Impact: Between 07:05 UTC and 09:45 UTC on 25 January 2023, customers experienced issues with networking connectivity, manifesting as network latency and/or timeouts when attempting to connect to Azure resources in Public Azure regions, as well as other Microsoft services including M365 and PowerBI.

‌

Preliminary Root Cause: We determined that a change made to the Microsoft Wide Area Network (WAN) impacted connectivity between clients on the internet to Azure, connectivity between services within regions, as well as ExpressRoute connections.

‌

Mitigation: We identified a recent change to WAN as the underlying cause and have rolled back this change. Networking telemetry shows recovery from 09:00 UTC onwards across all regions and services, with the final networking equipment recovering at 09:35 UTC. Most impacted Microsoft services automatically recovered once network connectivity was restored, and we worked to recover the remaining impacted services.

‌

Next Steps: We will follow up in 3 days with a preliminary Post Incident Report (PIR), which will cover the initial root cause and repair items. We'll follow that up 14 days later with a final PIR where we will share a deep dive into the incident.

You can stay informed about Azure service issues, maintenance events, or advisories by creating custom service health alerts (https://aka.ms/ash-videos for video tutorials and https://aka.ms/ash-alerts for how-to documentation) and you will be notified via your preferred communication channel(s).

25/1

Azure Databricks - West Europe - Mitigated (Tracking ID QS45-B80)

Summary of impact: Between 01:35 UTC and 04:06 UTC on January 25, users may have experienced failures to render the account console page, notebooks, or may have experienced failures creating new workspaces, users, and Databricks user interfaces. Cluster CRUD operations and workspace authentication might have timed out or failed. The running jobs, along with the jobs submitted through APIs and schedulers, might have failed as well.

‌

Preliminary Root Cause: Azure has identified a power event that caused an outage to a portion of the storage system in the West Europe region. The outage led to failures in database systems backing the aforementioned Databricks services in that region.

‌

Mitigation: The storage service was recovered as soon as the power maintenance event had been completed, mitigating the downstream impact for the Azure Databricks service.

‌

Next Steps: Databricks will follow up with Azure Engineering to establish the full root cause and prevent further occurrences. Stay informed about Azure service issues by creating custom service health alerts: https://aka.ms/ash-videos for video tutorials and https://aka.ms/ash-alerts for how-to documentation.

Posted Jan 25, 2023 - 19:23 UTC

Resolved

This incident has been resolved.

Posted Jan 25, 2023 - 09:22 UTC

Monitoring

Services are returning to normal. we will continue to monitor.

Posted Jan 25, 2023 - 09:20 UTC

Update

Update - Azure Networking - Multiple regions - Investigating

Starting at 07:05 UTC on 25 January 2023, customers may experience issues with networking connectivity, manifesting as network latency and/or timeouts when attempting to connect to Azure resources in multiple regions, as well as other Microsoft services.

We are actively investigating and will share updates as soon as more is known.

Posted Jan 25, 2023 - 09:02 UTC

Update

Azure Networking - Multiple regions - Investigating

Starting at 07:05 UTC on 25 January 2023, customers may experience issues with networking connectivity, manifesting as network latency and/or timeouts when attempting to connect to Azure resources in multiple regions, as well as other Microsoft services.

We are actively investigating and will share updates as soon as more is known.

This message was last updated at 08:53 UTC on 25 January 2023

Posted Jan 25, 2023 - 08:57 UTC

Update

We are awaiting updates from Microsoft due to the global issues all Microsoft Services

Posted Jan 25, 2023 - 08:42 UTC

Update

Update from Microsoft:

Azure Networking - Multiple regions - Investigating

Starting at 07:30 UTC, we're aware of a networking issue impacting connectivity to Azure for a subset of users. We are actively investigating and will share updates as soon as more is known.

This message was last updated at 08:29 UTC on 25 January 2023

Posted Jan 25, 2023 - 08:33 UTC

Identified

Issue identified as relating to a global issues effecting all Microsoft and azure systems.

Posted Jan 25, 2023 - 08:22 UTC

Investigating

A subset of users are experiencing degraded performance within RMS which is related to issues effecting all Microsoft systems globally.

Posted Jan 25, 2023 - 08:18 UTC

This incident affected: North America (RMS9+ (Live), RMS 9+ (Release Candidate), RMS9+ (Beta)) and Europe (RMS9+ (Live), RMS 9+ (Release Candidate), RMS9+ (Beta)).