User Impact
During the incident users would not be able to access the portal, login or release print jobs at the device. User activities would either present an immediate error or be delayed and eventually time out.
Scope of Impact:
US Deployment
Incident Start Date and Time
September 18, 2024, 10:30 AM UTC
Incident End Date and Time
September 18, 2024, 16:15 PM UTC
Root Cause:
Preliminary Findings: 30-09-2024
During a planned scaling operation triggered by uniFLOW Online an issue was encountered on the Microsoft Azure platform. The event was not presented as a hard error but stalled the scale out of resources. This is an ongoing investigation between NT-ware Operations and Microsoft engineers to identify the cause of the platform failure.
NT-ware will update this Postmortem with the results of this investigation. In the meantime, we are putting in place changes to mitigate such issues as listed below in the ‘Next Steps’.
Updated RCA Findings: 17-10-2024
Microsoft performed an exhaustive investigation into the issue experienced on the Azure platform but are unable to provide a conclusive RCA. While it is accepted there was a faulting component further investigation is not possible as recovery efforts on the day required the faulting cloud services to be deallocated.
Together with Microsoft engineers NT-ware have identified improvements to provide resilience against such an issue in the future. Spanning multiple cloud services instances, we are not dependent on a single cloud service cluster. Azure cloud services are very reliable but as this case showed they can faulter. These improvements are already with our planning and development departments and listed below in ‘Next Steps’.
How did we respond:
10:30 UTC: Time-of-day scaling occurred to scale out as planned to accommodate for upcoming load. Some of these instances were stuck in a “Starting” state during the scale event. This “Starting” state did not allow additional scaling to attempt to start additional instances up.
Alerting also did not trigger at this point as the current alerting was set to trigger Fail or Stop events. As this was a standard transition event from stop state to start state this is not part of our alert platform. This will be treated as an improvement and listed below.
11:40 UTC: There is an additional scale out that takes place that created the instances, but they did not start due to the above-mentioned instance being stuck. Additional autoscaling based on CPU load also did not start instances like the above.
12:00 UTC The Webrole CPU usage started growing due to the start of the business day (8am Eastern Standard Time).
12:10 UTC The running Webrole instances were at 100% CPU usage and a delay would start to be noticed.
12:48 UTC The Operations team was notified by Support that there may be an ongoing issue, and an investigation started immediately.
13:36 UTC A high-priority incident was raised with Azure Support for additional assistance as the problem appeared to be on the Azure side.
14:20 UTC A process was started to allow us to execute our disaster recover strategy to recover services.
15:45 UTC We could see a recovery of services started following the execution of our disaster recovery strategy.
16:15 UTC Services have fully recovered.
Emergency Mode
It should be noted that Emergency mode was not enabled during this incident. Emergency mode performs a detection of the web service hard failure not a delayed outage. In this case the check returned that the service was available.
Next Steps:
We are continuously taking steps to improve the uniFLOW Online Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):
Was this incident related to previous incidents?
No, this is not related to previous incidents.