UK: Notification of new incident. Authentication failures affecting logins.
Incident Report for uniFLOW Online
Postmortem

User Impact 

Users would have experienced login failures to their Devices and SmartClients. 

Scope of Impact 

UK Deployment

Incident Start Date and Time 

21 November 2023 09:20 UTC 

Incident End Date and Time 

21 November 2023 11:30 UTC 

Root Cause 

On investigation this incident was caused by a failure on the Microsoft Azure platform. During a scale out of services Azure will lock the resource, in this case the process caused an internal exception blocking the scale out from performing successfully. In this condition the service’s remaining resources were consumed which caused requests to not be handled in a timely manner and ultimately failing. The operations team worked to perform a manual scale out that was also blocked by the failed lock process which was eventually released at 10:45. 

The impacted Azure service is responsible for the uniFLOW Online authentication relating to devices and SmartClients which degraded our ability to accept and process authentication requests. 

This was identified by internal telemetry identifying the failed components and the field reports raised. 

Resolution: 

At 10:45 UTC the service accepted scaling requests and the NT-ware Operations team manually scaled the service to the required instances which started the recovery process with authentication requests being handled successfully. This was monitored further and confirmed to be fully restored at 11:30 UTC. 

The NT-ware Operations team redeployed the service to separate resources overnight to move away from the Microsoft Azure resources that were impacted by the failure. 

Next Steps 

We apologize for the impact to affected customers. We are continuously taking steps to improve the uniFLOW Online Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to): 

  • We are reviewing the service failure with Microsoft engineers. 
  • NT-ware Operations are further reviewing our service architecture usage.
  • The incident was internally reviewed by Operations and Management.
Posted Nov 23, 2023 - 10:24 UTC

Resolved
Hello Everyone,

Update: Incident Resolved.

Date/Time:
21 November 2023
11:30 UTC

We will review the findings and collected information from this incident to further improve our online services. There will be a postmortem published for Major (max 20 business days) incidents once a thorough investigation has been completed.

We are sorry for the inconvenience this has caused.

Kind Regards
Online Operations Team
Posted Nov 21, 2023 - 11:43 UTC
Monitoring
Update:
The issue has been resolved and authentication services are now busy recovering.
We are monitoring the recovery process and will provide an update once the services are fully recovered.

Next Update:
The next update will be in 1 hour or once services have fully recovered.
Posted Nov 21, 2023 - 11:14 UTC
Identified
Incident details:

Start Time:
21 November 2023
09:20 UTC

Incident Scope: 
UK Deployment

Description:
Customers may have issues with logging into their tenants and printers.

Next Update:
The next update will be in 1 hour.
Posted Nov 21, 2023 - 10:45 UTC
This incident affected: UK Deployment (Identification).