The issue was seen through any interaction with our web role service. This resulted in user requests such as a Secure Print job list or print / scan action timing out. If the actions were cancelled and attempted a few moments later, it would work. This intermittent condition compounded the investigation and sadly prolonged the eventual resolution.
Scope of Impact
This was only happening on the US deployment. The effects were felt unfortunately by a small number of customers and happened intermittently.
Incident Start Date and Time:
May 10th, 2022, 15:30 UTC
Incident End Date and Time:
May 21st, 2022, 07:00 UTC
While this incident was reported and logged on the 10th of May we have, through the course of our investigation found it to have begun around the 2nd of May and gradually occurred more frequently since. The first validated report on the 10th provided an insight to the outage and from the 11th – 13th subsequent reports assisted us in identifying the issue as a faulting web role.
The issues however were intermittent and in working with Microsoft we needed to capture and log the event helping identify the cause. Several days were invested in the week of the 16th on this action. To mitigate impact to customers as much as possible we automated the recovery of the faulting web role the moment it was detected. We also took steps to reduce the load on the web roles but as this was not the root cause was reverted as it had no overall benefit.
The automated recovery also provided the opportunity to keep our release date of the 21st upgrading the US deployment in line with our global deployment to 2022.2. The upgrade proved 100% that the issues was not code or uniFLOW Online related as it was still visible after the upgrade and only in the US.
The decision was made on the 22nd by the NT-ware team to perform another deployment however with the addition of moving to a new underlying hardware set within Azure. This action ultimately resolved the incident and conclusively confirmed the faulting issues to be within the Azure infrastructure. This was confirmed with the great support of Microsoft and their continued investigation of the issue. Microsoft has since informed us that the identified problem is now with the relevant engineering teams and being worked on.
We apologize for the impact to affected customers. We are continuously taking steps to improve the uniFLOW Online Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):
As always, any incident is an opportunity to learn. This incident has been reviewed by the Operations teams and Management. We thank Microsoft for their strong support and in-depth investigation of this matter with our Development and Operations teams.