Notification of incident. Device latency and tenants unavailable: Now Resolved: US
Incident Report for uniFLOW Online
Postmortem

User Impact

During the incident users would not be able to access the portal, login or release print jobs at the device. User activities would either present an immediate error or be delayed and eventually time out.

Scope of Impact:

US Deployment

Incident Start Date and Time

September 18, 2024, 10:30 AM UTC

Incident End Date and Time

September 18, 2024, 16:15 PM UTC

Root Cause:

Preliminary Findings: 30-09-2024

During a planned scaling operation triggered by uniFLOW Online an issue was encountered on the Microsoft Azure platform. The event was not presented as a hard error but stalled the scale out of resources. This is an ongoing investigation between NT-ware Operations and Microsoft engineers to identify the cause of the platform failure.

NT-ware will update this Postmortem with the results of this investigation. In the meantime, we are putting in place changes to mitigate such issues as listed below in the ‘Next Steps’.

Updated RCA Findings: 17-10-2024 

Microsoft performed an exhaustive investigation into the issue experienced on the Azure platform but are unable to provide a conclusive RCA. While it is accepted there was a faulting component further investigation is not possible as recovery efforts on the day required the faulting cloud services to be deallocated. 

Together with Microsoft engineers NT-ware have identified improvements to provide resilience against such an issue in the future. Spanning multiple cloud services instances, we are not dependent on a single cloud service cluster. Azure cloud services are very reliable but as this case showed they can faulter. These improvements are already with our planning and development departments and listed below in ‘Next Steps’. 

How did we respond:

10:30 UTC: Time-of-day scaling occurred to scale out as planned to accommodate for upcoming load. Some of these instances were stuck in a “Starting” state during the scale event. This “Starting” state did not allow additional scaling to attempt to start additional instances up.

Alerting also did not trigger at this point as the current alerting was set to trigger Fail or Stop events. As this was a standard transition event from stop state to start state this is not part of our alert platform. This will be treated as an improvement and listed below.

11:40 UTC: There is an additional scale out that takes place that created the instances, but they did not start due to the above-mentioned instance being stuck. Additional autoscaling based on CPU load also did not start instances like the above.

12:00 UTC The Webrole CPU usage started growing due to the start of the business day (8am Eastern Standard Time).

12:10 UTC The running Webrole instances were at 100% CPU usage and a delay would start to be noticed.

12:48 UTC The Operations team was notified by Support that there may be an ongoing issue, and an investigation started immediately.

13:36 UTC A high-priority incident was raised with Azure Support for additional assistance as the problem appeared to be on the Azure side.

14:20 UTC A process was started to allow us to execute our disaster recover strategy to recover services.

15:45 UTC We could see a recovery of services started following the execution of our disaster recovery strategy.

16:15 UTC Services have fully recovered.

Emergency Mode

It should be noted that Emergency mode was not enabled during this incident. Emergency mode performs a detection of the web service hard failure not a delayed outage. In this case the check returned that the service was available.

Next Steps:

We are continuously taking steps to improve the uniFLOW Online Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):

  • Operations will review our current alerting as this was not triggered. There were alerts targeting scaling but looking for a failure in the scaling. Azure never reported an error only that the scaling was ‘Starting’. The alerting will be revised to alert on transition states of they take an abnormal time to complete, stalled state).
  • We have also prioritised the measurement of endpoint to additionally alert on the user experience to supplement the platform alerting.
  • While emergency mode is a very robust feature and recommended to all customers, in this specific fail condition it did not work as required. An improvement to the detection methods and triggering of Emergency Mode has been ticketed and will be introduced early 2025.
  • We are prioritizing improvements to advance our load balancing and fault tolerance methodology. This will allow the uniFLOW Online platform to draw on resources in cases where services enter a faulty state outside of our control.

Was this incident related to previous incidents?

No, this is not related to previous incidents.

Posted Oct 17, 2024 - 05:12 UTC

Resolved
Hello Everyone,

Update: Incident Resolved.

Date/Time:
Start = September 18th, 12:45 PM UTC
End = September 18th, 4:16 PM UTC

We are sorry for the inconvenience this has caused.

Additional update:
There will be a post-mortem published once we have concluded our investigation. We will publish a preliminary report within 24 hours and a full post-mortem will be published in no longer then 10 business days.

Preliminary report:
Our initial investigation pointed to a potential Azure infrastructure degradation. During the incident we reached out to Microsoft for assistance in resolving the issue. Due to the impact and the duration of the incident it was decided to execute our disaster recovery strategy in order to recover services.
We have now started a full investigation into the incident, and we have also involved Microsoft for further assistance. A full post-mortem will be published within the next 10 business days.

Kind Regards
Online Operations Team
Posted Sep 18, 2024 - 16:33 UTC
Update
We are continuing to monitor for any further issues.
Posted Sep 18, 2024 - 16:10 UTC
Update
Update:
Recovery is in progress; we continue to monitor, system metrics show improvement.

Next Update:
The next update will be in 60 minutes.
Posted Sep 18, 2024 - 16:09 UTC
Update
Hello Everyone,

we're sorry for the time this is taking. We still working diligently to recover the system into normal operation.

The next update will be in ~60 minutes.

Kind regards,
Online Operations Team
Posted Sep 18, 2024 - 15:07 UTC
Update
Hello Everyone,

recovery is still ongoing and Operations continue to help speed up the process.

The next update will be in ~30 minutes.

Kind regards,
Online Operations Team
Posted Sep 18, 2024 - 14:33 UTC
Update
Hello Everyone,

recovery is still ongoing and Operations are actively working to help speed up the process.

The next update will be in ~30 minutes.

Kind regards,
Online Operations Team
Posted Sep 18, 2024 - 14:02 UTC
Monitoring
Incident details:

Recovering: The issue has been identified and mitigations have been applied. The system is currently recovering and we are monitoring it closely.

Start Time:
Sept, 18th, 12:45 PM UTC

Incident Scope: 
US Deployment

Description:
Monitoring has shown increased latency which was also confirmed by several customers on the US deployment.
As a consequence, affected users could not log into devices or their tenants.

Next Update:
The next update will be in 30 minutes.
Posted Sep 18, 2024 - 13:27 UTC
This incident affected: US Deployment (Identification, Printing, Email print, Scanning, Reporting, Other services).