Notification of new incident. Failure and Delay of Print and Scan Jobs : EU
Incident Report for uniFLOW Online
Postmortem

User Impact 

Users were unable to release print jobs on the device. The job could be seen but on release this would timeout. Scanning was also impacted with documents scanned not reaching the selected destination. 

Scope of Impact 

Europe (EU) Deployment 

Incident Start Date and Time 

August 12, 2024, 7:25 UTC 

Incident End Date and Time 

August 12, 2024, 8:50 UTC 

Root Cause 

uniFLOW Online worker roles reached an operational limit resulting in requests being queued waiting for resources. The operational limit was the result of an architectural migration made a week earlier.  

Following Microsoft EOL directives, we had scheduled maintenance to move to new architecture as part of standard azure component lifecycle. This was completed a week earlier and tested without issue and continued in production for the next week (starting 5th August). This migration did not carry across a specific scaling limit previously set and was not identified during our testing and validation. The new value was below uniFLOW Online required operational limits. 

On the 12th we saw a very high load as many people in Europe returned to work from summer holidays. The summer holiday period was the reason this was not seen in the week of the 5th as the load was much lower. 

How did we respond 

8:02 UTC: Support and Operations teams members collaborated on validation and remediation actions. 

8:23 UTC: Operations team raise a public status page. Resources supporting the worker roles were over provisioned to quickly work through the queued request and return the system to normal operation.  

8:50 UTC: On full-service recovery the operational limits were re-established. 

Next Steps 

We apologize for the impact on affected customers. We are continuously taking steps to improve the uniFLOW Online Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to): 

  • We have reviewed this incident across the global operations team and key management members. 

  • Monitoring and alerting improvements have been identified and scheduled for implementation.

Posted Aug 14, 2024 - 07:03 UTC

Resolved
Hello Everyone,

Update: Incident Resolved.

Date/Time: August 12th, 8:50 am UTC">

Printing and Scanning has returned to normal.

We will review the findings and collected information from this incident to further improve our online services. There will be a postmortem published within 10 business days once a thorough investigation has been completed.

Preliminary finding:
The scaling of resources within the EU deployment was prematurely stopped resulting is a large number of jobs queueing waiting to be processed. Immediate mitigation actions were started, to meet the job demand and process the backlog quickly.

We are sorry for the inconvenience this has caused.

Kind Regards
Online Operations Team
Posted Aug 12, 2024 - 08:52 UTC
Investigating
Incident details:
Our telemetry and customer field reports have confirmed extended delays, or failure with print and scan jobs.

Start Time:
6:00 am Monday the 12th (UTC)

Incident Scope: 
All tenants on the Europe (EU) Deployment.

Description:
Print jobs and scan jobs are not processing, you can see them on the device UI but the release process fails.
This is being treated with the highest priority but the NT-ware Operations Team.

Next Update:
The next update will be in 30 minutes
Posted Aug 12, 2024 - 08:23 UTC
This incident affected: EU Deployment (Printing, Scanning).