Performance Degradation
Incident Report for uniFLOW Online
Postmortem

User Impact

Any user on the EU deployment will have noticed severe service degradation throughout the system. Users were unable to log into their uniFLOW Online tenant as well as devices connected to uniFLOW Online. Due to the nature of the incident the Emergency Mode on MEAP devices did not kick in where configured/available.

Scope of Impact

  • Device login was not possible
  • Log in to the tenant website was not possible
  • Submission and release of print jobs were heavily impacted
  • Important: No jobs were lost as our telemetry indicates jobs sent prior to and during the outage were processed and delivered for release.

Incident Start Date and Time

Feb 02, 2021 - 09:04 UTC

Incident End Date and Time

Feb 02, 2021 - 19:31 UTC

Root Cause

The uniFLOW Online data storage was receiving an abnormal amount of requests pushing it beyond its predefined operational working limits. While the limits under normal conditions are never reached we saw a series of events that encountered a ‘runaway’ condition that consumed the defined max. throughput. A combination of the number of uniFLOW Online SmartClients in the field, load on the system, and a registration retry mechanism saw uniFLOW Online flooded by registration requests. Important to note is that the number of SmartClients on their own is not an issue but the registration process under this fail condition.

This high load subsequently slowed down the web roles causing the uniFLOW SmartClients to not receive a successful response and causing a retry. Additional load was created by the uniFLOW Online SmartClient retrying until the data storage hit its limit subsequently affecting other services such as device login etc.

While protective measures, backend configuration improvements as well as scaling adjustments have been put in place within the first hour of the incident being reported, the system took a long time to recover due to the still high amount of requests coming in. All requests had to be processed and handled by the system.

Incident Response

Feb 02, 2021 - 09:04 UTC

The issue was reported from the field via our support channel and confirmed via our telemetry.

Feb 02, 2021 - 09:30 UTC

NT-ware Operations Team confirmed field reports, collected metrics and escalated to DevOps.

Feb 02, 2021 – 10:00 UTC

First corrective actions were taken to manage the load and number of requests. These initial actions did not correct the issues as expected.

A deeper investigation highlighted the nature of the issue and the combination of events that were leading the outage. The initial action to scale up our capacity to accept SmartClient and Device requests just placed more strain on the data store.

Feb 02, 2021 – 12:00 UTC

The balance of the system had been restored but, by this stage, we had a large number of requests to process that continued till early evening. Due to the nature of the issue, we could not risk dropping these requests and unbalancing the system further and decided to allow our web service to recover naturally. This lengthened the recovery but was the safest option available at the time.

Feb 02, 2021 - 19:31 UTC

Full operation control was restored and, with mitigations in place, we saw no report of the event in the following days of the incident.

Lessons Learnt

  1. Extend alerting metrics to be able to react faster and avoid a situation where the system enters a state of abnormal high load for the affected components.
  2. 2021.2 will introduce a mechanism allowing us to throttle incoming requests to ensure the backend remains responsive.
  3. The uniFLOW Online SmartClient and device applets will receive further improvements to reduce the overall communication footprint.
  4. Emergency Mode behavior will be reviewed to more reliably detect performance degradation and treat it as an emergency.
  5. Devices connected to a local uniFLOW Server in combination with uniFLOW Online (hybrid setup) will receive an improvement for better fall back handling in case uniFLOW Online is unresponsive.

Customer Post Actions

  1. Although not a cause for this outage, we would like to generally encourage customers to update their on-premise components for uniFLOW Online such as the uniFLOW Online SmartClient to benefit from the most recent and future improvements.
Posted Feb 10, 2021 - 12:50 UTC

Resolved
This incident has been resolved. All systems are back to normal operation.
Posted Feb 02, 2021 - 19:34 UTC
Update
Systems are still processing and recovery is going well. However, the overall system performance is still impacted.

We'll continue to monitor and update you later today when we're back to normal.
Posted Feb 02, 2021 - 18:07 UTC
Update
The system is still recovering and we're continuing to monitor the recovery.

We'll post another update in about 60 minutes.

We're sorry for the inconvenience caused!
Posted Feb 02, 2021 - 16:40 UTC
Update
The system is still recovering and we're continuing to monitor the recovery.

We'll post another update in about 60 minutes.
Posted Feb 02, 2021 - 15:26 UTC
Monitoring
The system is still recovering and you will currently still experience timeouts and slowness.
We're monitoring the situation to ensure the recovery is continuing as expected.

We'll post another update once we're fully back to normal.
Posted Feb 02, 2021 - 12:53 UTC
Update
We are continuing to work on a fix for this issue.
Posted Feb 02, 2021 - 11:45 UTC
Identified
Incident Scope:
uniFLOW Online portal
Device Login


Description:
We identified the cause of the issue and applied a fix. The systems are now stabilizing and going back to normal.
Web requests are still processed slower than usual but this will also be back to normal shortly.

Next Update:
The next update will be in 60

Kind regards,
The uniFLOW Online team
Posted Feb 02, 2021 - 10:38 UTC
Investigating
Incident details:

Identified: Feb 2nd, 2021 - 9:19am

Incident Scope:
uniFLOW Online portal
Device Login


Description:
We have received several reports that the device login as well as the login to uniFLOW Online is slow.
We are already investigating the cause for this and will update you as quickly as possible.


Next Update:
The next update will be in 60 minutes.

Kind regards,
The uniFLOW Online team
Posted Feb 02, 2021 - 09:42 UTC
This incident affected: EU Deployment (Identification).