User Impact
Users would have experienced login failures to their Devices and SmartClients.
Scope of Impact
UK Deployment
Incident Start Date and Time
21 November 2023 09:20 UTC
Incident End Date and Time
21 November 2023 11:30 UTC
Root Cause
On investigation this incident was caused by a failure on the Microsoft Azure platform. During a scale out of services Azure will lock the resource, in this case the process caused an internal exception blocking the scale out from performing successfully. In this condition the service’s remaining resources were consumed which caused requests to not be handled in a timely manner and ultimately failing. The operations team worked to perform a manual scale out that was also blocked by the failed lock process which was eventually released at 10:45.
The impacted Azure service is responsible for the uniFLOW Online authentication relating to devices and SmartClients which degraded our ability to accept and process authentication requests.
This was identified by internal telemetry identifying the failed components and the field reports raised.
Resolution:
At 10:45 UTC the service accepted scaling requests and the NT-ware Operations team manually scaled the service to the required instances which started the recovery process with authentication requests being handled successfully. This was monitored further and confirmed to be fully restored at 11:30 UTC.
The NT-ware Operations team redeployed the service to separate resources overnight to move away from the Microsoft Azure resources that were impacted by the failure.
Next Steps
We apologize for the impact to affected customers. We are continuously taking steps to improve the uniFLOW Online Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):