Notification of new incident: Performance degradation, device connectivity issues: SG Deployment
Incident Report for uniFLOW Online
Postmortem

User Impact 

Devices switched to emergency mode, slow response or timeouts on printing and scanning actions. The general tenant UI would have also been slow or unresponsive. 

Scope of Impact 

Singapore Deployment 

Incident Start Date and Time 

11 March 2024 00:30 UTC 

Incident End Date and Time 

11 March 2024 02:15 UTC 

Root Cause 

Following an investigation with Microsoft it was confirmed that the uniFLOW Online Cloud Services were located on a physical node which was experiencing faulted disk issues.  

10 March 2024 10:25 UTC uniFLOW Online preemptive scaling was initiated for the morning load. This started early (6:25 SGT) however failed due to an infrastructure issue. 

On detection, Microsoft initiated a self-healing process automatically. Feedback from Microsoft engineers was that this required all resources to be moved from the affected node to a new node. This issue was not isolated to NT-ware but all Microsoft customers on the impacted node, this impacted the transfer and recovery time. 

11 March 2024 00:30 UTC the load in the deployment grew beyond the available resources.  At this time user requests too our services were delayed in processing and continued to grow as the morning load grew. 

11 March 2024 01:30 UTC resource allocation from Microsoft's initiated self-healing began recovering our services and we saw immediate scaling and reduction in queued jobs.  

11 March 2024 02:15 UTC We monitored this for some time till all scaling was back to normal and the request queues were back to nominal. 

Next Steps 

We apologize for the impact on affected customers. We are continuously taking steps to improve the uniFLOW Online Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to): 

  • Discuss with Microsoft, hardware availability and resource utilization so NT-ware and uniFLOW Online are not impacted by platform issues of this type in the future. 
  • Review and discuss improvements to our monitoring and alerting to provide faster indications of this issue in the future.
Posted Mar 17, 2024 - 21:51 UTC

Resolved
Hello Everyone,

Update: Incident Resolved.

Date/Time: March11th, 2:15 AM UTC

This incident is now resolved with all services back online and the resources all showing as normal. Preliminary investigation between the NT-ware Operations and Microsoft engineers point to a transient issue in the web role scaling. Currently this is being reviewed by the Microsoft back-end engineering team.

We will review the findings and collected information from this incident to further improve our online services. There will be a postmortem published for this incident rated 'Major' (within max 20 business days) once a thorough investigation has been completed.

We are sorry for the inconvenience this has caused.

Kind Regards
Online Operations Team
Posted Mar 11, 2024 - 02:24 UTC
Monitoring
Update:
The Operations team are seeing a strong recovery across our service and devices should be already reconnecting.
Monitoring will continue as we validate and confirm the recovery process.

Next Update:
The next update will be in 30 minutes.
Posted Mar 11, 2024 - 02:06 UTC
Investigating
Incident details: Our monitoring systems have alerted us to a performance degradation.

Start Time:
0:40 UTC 11-03-2024, Monday

Incident Scope: 
Singapore Deployment.

Description:
The NT-ware Online Operations team are investigating this issue and the cause.
Recovery and remediation processes have started automatically.
The team is monitoring this closely to see if manual intervention is required.

Next Update:
The next update will be in 30 minutes.
Posted Mar 11, 2024 - 01:43 UTC
This incident affected: SG Deployment (Identification, Printing, Email print, Other services).