Issue Date: 06-08-2020
Region Affected: EU
Outage Description: The EU deployment experienced slow response across all services. This resulted in but not limited to delays logging into devices or the portal, device and smart client registration and job release. Service delivery was impacted by high load on the system.
Root Cause Analysis: This was traced to an excessive number of SmartClient registration requests. It was found that a series of events and conditions exposed a flaw in the SmartClient registration retry function. The condition allowed the service to get into a loop and as a result flooded the service queue and storage accounts beyond their defined scaling limits. This took place during the rollout of a large customer multiplying the effect. Our investigation found that the initial creation of a location(s) was corrupted resulting in orphaned entries that the smart client could not resolve. This forced a re-registration across thousands of users trying to load 700+ locations. The registration often would time out and force the SmartClient to keep retrying resulting in an endlessly growing number of requests.
Mitigation Action Plan: Once identified an appropriate mitigation plan was put in place and tested internally. This resulted in a configuration update to address the SmartClient retry method and remove orphaned location.
Future Proof and Monitoring: The incident will be reviewed in detail and any required improvements will be built into the service on the next service deployment. Monitoring will also be reviewed for any improvements that will highlight a similar issue in the future.
Final Status: System is fully operational, and the mitigation steps put in place have addressed this issue and returned the service to normal operation.
Outage Window: 07:20 am UTC - 2:30pm UTC