Intermittent failures in performing some requests
Incident Report for uniFLOW Online
Postmortem

User Impact

The issue was seen through any interaction with our web role service. This resulted in user requests such as a Secure Print job list or print / scan action timing out. If the actions were cancelled and attempted a few moments later, it would work. This intermittent condition compounded the investigation and sadly prolonged the eventual resolution.

Scope of Impact

This was only happening on the US deployment. The effects were felt unfortunately by a small number of customers and happened intermittently.

Incident Start Date and Time:

May 10th, 2022, 15:30 UTC

Incident End Date and Time:

May 21st, 2022, 07:00 UTC

Root Cause

While this incident was reported and logged on the 10th of May we have, through the course of our investigation found it to have begun around the 2nd of May and gradually occurred more frequently since. The first validated report on the 10th provided an insight to the outage and from the 11th – 13th subsequent reports assisted us in identifying the issue as a faulting web role.

The issues however were intermittent and in working with Microsoft we needed to capture and log the event helping identify the cause. Several days were invested in the week of the 16th on this action. To mitigate impact to customers as much as possible we automated the recovery of the faulting web role the moment it was detected. We also took steps to reduce the load on the web roles but as this was not the root cause was reverted as it had no overall benefit.

The automated recovery also provided the opportunity to keep our release date of the 21st upgrading the US deployment in line with our global deployment to 2022.2. The upgrade proved 100% that the issues was not code or uniFLOW Online related as it was still visible after the upgrade and only in the US.

The decision was made on the 22nd by the NT-ware team to perform another deployment however with the addition of moving to a new underlying hardware set within Azure. This action ultimately resolved the incident and conclusively confirmed the faulting issues to be within the Azure infrastructure. This was confirmed with the great support of Microsoft and their continued investigation of the issue. Microsoft has since informed us that the identified problem is now with the relevant engineering teams and being worked on.

Next Steps

We apologize for the impact to affected customers. We are continuously taking steps to improve the uniFLOW Online Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):

  • As we now have the exact error messages and failure conditions recognized, we have early alerting in place and can take more decisive action if this were to happen again in the future.
  • Microsoft engineers are still looking into further improvements and working on the underlying issues (which no longer affects our deployment). Any corrective actions that we can take from this to improve our service will be evaluated.

As always, any incident is an opportunity to learn.  This incident has been reviewed by the Operations teams and Management. We thank Microsoft for their strong support and in-depth investigation of this matter with our Development and Operations teams.

Posted May 30, 2022 - 20:44 UTC

Resolved
Hello All,

We are pleased to report that this issue has now been confirmed as resolved.

After 4 days and multiple attempts to recreate the issues we can no longer do so. The problem was caused by a faulting infrastructure layer within the Microsoft AZURE Service. We greatly appreciate the strong support from Microsoft during this investigation. We have feedback that the Microsoft team are continuing to look at the underlying problem but as for the uniFLOW Online service it is no longer impacted.

We will be preparing a Post Mortem which will be published here in no longer than 20 working days.

Kind Regards
NT-ware uniFLOW Online Operations Team
Posted May 25, 2022 - 01:05 UTC
Monitoring
Hello All

With the actions earlier today we are seeing positive results and our testing and telemetry is not showing an issue. We are moving forward cautiously and will progress the status to 'Monitoring'. NT-ware Support and Operations teams will continue to monitor and provide further updated early next week.

Kind Regards
uniFLOW Online Operations Team
Posted May 21, 2022 - 10:01 UTC
Update
Hello All

This notification is to inform you that there will be a brief outage of between 1-5 minutes. This outage will being at 7:50 UTC (2 minutes from the sending of this notification) and is necessary to further resolve this issue.

Sorry for the inconvenience.

Kind Regards
NT-ware uniFLOW Online Operations.
Posted May 21, 2022 - 07:48 UTC
Update
Hello All

Just a short update to keep you informed. We have established automated corrective actions and hope this results in as little interruption as possible to our customers. The investigation is ongoing between Microsoft and NT-ware with resources working hard on both sides.

While the deployment update took place on Friday we are continuing further post deployment steps as part of this case.

Thanks for your patience during this long and difficult case.

Kind Regards
NT-ware uniFLOW Online Operations Team
Posted May 21, 2022 - 04:37 UTC
Identified
Hello Everyone,

We will be moving this issue up to identified at this time but we are still working with Microsoft on a Root Cause. The issues is presenting itself are a web resource failing to resource. However it only impacts one out of many web role depending on the time of day and load the system is automatically scaled for. We are also only seeing this on this one role in the US and not on our 8+ other deployments globally.

We can detect the issue very quickly and working to reduce the reaction time to minimize the impact to the field.

While this is looking very much like a Microsoft hardware / environment issues we cannot put our finger on it and will continue to investigation with Microsoft. We have a planned deployment update this Friday for the US. We will take this opportunity to re-deploy into a new environment within the maintenance window.

We greatly appreciate your patience during this investigation, Our Dev and Operations teams are actively working to reduce the impact in the field.

Kind Regards,
uniFLOW Online Operations Team
Posted May 17, 2022 - 12:32 UTC
Update
Dear all,

please note that the root cause analysis is ongoing between NT-ware and Microsoft. We have implemented a proven method of monitoring and detecting known symptoms at this stage, which allows us to mitigate the issue and keep any interruptions at a minimum. We apologize for the inconvenience caused and will keep you updated about any progress via this page.

Kind Regards
uniFLOW Online Operations Team
Posted May 16, 2022 - 14:57 UTC
Update
Hello All,

During our continued investigation we have found a potential cause for this incident. Measures have been put in place to mitigate the issue for now and we are monitoring further. So far, the results are positive and we will keep you posted about our next steps.

Kind Regards
uniFLOW Online Operations Team
Posted May 12, 2022 - 21:44 UTC
Update
Hello All

We are continuing to investigate this issue.

At this time we have monitoring and mitigation actions in place to reduce the impact to customers. Our teams are working with high priority on this problem and additionally raised support requests to Microsoft while we work to identify the root cause.

Further updates will be provided as information becomes available.

Kind Regards
uniFLOW Online Operations Team.
Posted May 12, 2022 - 00:46 UTC
Investigating
Hello All

Through our telemetry and field feedback we are seeing intermittent failures of some requests to our US deployment. This is only affecting a very small number of tenants and not a permanent condition.

Impact:
This can result in a resources or job lists not loading or presenting an error. In most cases performing the function a second time will resolve the problem.

The Operations team will continue to investigate this issue, we will provide an update as further information becomes available.


Regards,
uniFLOW Online Operations Team
Posted May 11, 2022 - 07:18 UTC
This incident affected: US Deployment (Identification, General Printing, Mobile Printing, Scanning, Reporting, Other Services).