User Impact
Devices switched to emergency mode, slow response or timeouts on printing and scanning actions. The general tenant UI would have also been slow or unresponsive.
Scope of Impact
Singapore Deployment
Incident Start Date and Time
11 March 2024 00:30 UTC
Incident End Date and Time
11 March 2024 02:15 UTC
Root Cause
Following an investigation with Microsoft it was confirmed that the uniFLOW Online Cloud Services were located on a physical node which was experiencing faulted disk issues.
10 March 2024 10:25 UTC uniFLOW Online preemptive scaling was initiated for the morning load. This started early (6:25 SGT) however failed due to an infrastructure issue.
On detection, Microsoft initiated a self-healing process automatically. Feedback from Microsoft engineers was that this required all resources to be moved from the affected node to a new node. This issue was not isolated to NT-ware but all Microsoft customers on the impacted node, this impacted the transfer and recovery time.
11 March 2024 00:30 UTC the load in the deployment grew beyond the available resources. At this time user requests too our services were delayed in processing and continued to grow as the morning load grew.
11 March 2024 01:30 UTC resource allocation from Microsoft's initiated self-healing began recovering our services and we saw immediate scaling and reduction in queued jobs.
11 March 2024 02:15 UTC We monitored this for some time till all scaling was back to normal and the request queues were back to nominal.
Next Steps
We apologize for the impact on affected customers. We are continuously taking steps to improve the uniFLOW Online Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):