Any user on the EU deployment will have noticed severe service degradation throughout the system. Users were unable to log into their uniFLOW Online tenant as well as devices connected to uniFLOW Online. Due to the nature of the incident the Emergency Mode on MEAP devices did not kick in where configured/available.
Feb 02, 2021 - 09:04 UTC
Feb 02, 2021 - 19:31 UTC
The uniFLOW Online data storage was receiving an abnormal amount of requests pushing it beyond its predefined operational working limits. While the limits under normal conditions are never reached we saw a series of events that encountered a ‘runaway’ condition that consumed the defined max. throughput. A combination of the number of uniFLOW Online SmartClients in the field, load on the system, and a registration retry mechanism saw uniFLOW Online flooded by registration requests. Important to note is that the number of SmartClients on their own is not an issue but the registration process under this fail condition.
This high load subsequently slowed down the web roles causing the uniFLOW SmartClients to not receive a successful response and causing a retry. Additional load was created by the uniFLOW Online SmartClient retrying until the data storage hit its limit subsequently affecting other services such as device login etc.
While protective measures, backend configuration improvements as well as scaling adjustments have been put in place within the first hour of the incident being reported, the system took a long time to recover due to the still high amount of requests coming in. All requests had to be processed and handled by the system.
Feb 02, 2021 - 09:04 UTC
The issue was reported from the field via our support channel and confirmed via our telemetry.
Feb 02, 2021 - 09:30 UTC
NT-ware Operations Team confirmed field reports, collected metrics and escalated to DevOps.
Feb 02, 2021 – 10:00 UTC
First corrective actions were taken to manage the load and number of requests. These initial actions did not correct the issues as expected.
A deeper investigation highlighted the nature of the issue and the combination of events that were leading the outage. The initial action to scale up our capacity to accept SmartClient and Device requests just placed more strain on the data store.
Feb 02, 2021 – 12:00 UTC
The balance of the system had been restored but, by this stage, we had a large number of requests to process that continued till early evening. Due to the nature of the issue, we could not risk dropping these requests and unbalancing the system further and decided to allow our web service to recover naturally. This lengthened the recovery but was the safest option available at the time.
Feb 02, 2021 - 19:31 UTC
Full operation control was restored and, with mitigations in place, we saw no report of the event in the following days of the incident.