Service Degredation
Incident Report for uniFLOW Online
Postmortem

Issue Date: 06-08-2020

Region Affected: EU

Outage Description: The EU deployment experienced slow response across all services. This resulted in but not limited to delays logging into devices or the portal, device and smart client registration and job release. Service delivery was impacted by high load on the system.

Root Cause Analysis: This was traced to an excessive number of SmartClient registration requests. It was found that a series of events and conditions exposed a flaw in the SmartClient registration retry function. The condition allowed the service to get into a loop and as a result flooded the service queue and storage accounts beyond their defined scaling limits. This took place during the rollout of a large customer multiplying the effect. Our investigation found that the initial creation of a location(s) was corrupted resulting in orphaned entries that the smart client could not resolve. This forced a re-registration across thousands of users trying to load 700+ locations. The registration often would time out and force the SmartClient to keep retrying resulting in an endlessly growing number of requests.

Mitigation Action Plan: Once identified an appropriate mitigation plan was put in place and tested internally. This resulted in a configuration update to address the SmartClient retry method and remove orphaned location.

Future Proof and Monitoring: The incident will be reviewed in detail and any required improvements will be built into the service on the next service deployment. Monitoring will also be reviewed for any improvements that will highlight a similar issue in the future.

Final Status: System is fully operational, and the mitigation steps put in place have addressed this issue and returned the service to normal operation.

Outage Window: 07:20 am UTC - 2:30pm UTC

Posted Aug 11, 2020 - 12:00 UTC

Resolved
We have performed corrective measures to address the load issue and bring the system response time back to normal. Performance monitoring has confirmed that this has taken immediate affect. We are performing manual tests against this but equally these are showing a positive result. It is anticipated that accounting data will be delayed as the system work through the backlog of queued jobs for the next hour.

The status is set to resolved but we will review the telemetry and details of this incident in closer detail. A postmortem will be posted to this incident when detailed review is complete.
Posted Aug 06, 2020 - 16:25 UTC
Update
We have identified what is believed to be the cause of the issue and development are working corrective measure to mitigate this issue. This is planned to be put in place at approximately 5pm (DE). We will monitor this work and preform post implementation testing and review performance data.

Further updated will follow once the deployment is in place.
Posted Aug 06, 2020 - 14:24 UTC
Monitoring
We are still working to identity and rectify the root cause of this issue. Currently we are performing adjustment to resource scaling increasing throughput and reduce the delays experienced in the field. Performance data is indicating this is having a positive effect but delays are still being seen for login and web base actions.

The team is continuing to work on this with high priority, we will keep you updated as we gain further details.
Posted Aug 06, 2020 - 10:50 UTC
Update
We are continuing to work on a fix for this issue.
Posted Aug 06, 2020 - 08:37 UTC
Identified
Incident Information

Start time: 7:20pm UTC
Affected deployment(s): EU
Current status: Identified / Mitigating


Description:

Performance monitoring has identified that the EU deployment is experiencing unusually high load. Dynamic scaling is taking place and we are investigation the source of the load and what actions need to be taken.

User Impact:

User logins may be delayed and there may be delays in processing accounting data for reporting purposes.

Special Notes
This is a continuation of the 5-08-2020 Service Degradation incident. This was stabilized late yesterday evening but has reappears as the EU usage load begins increasing.
Posted Aug 06, 2020 - 08:15 UTC
This incident affected: EU Deployment (Identification, General Printing, Mobile Printing, Scanning, Reporting).