Advisory: Microsoft Azure Singapore Service Bus resource alert.
Incident Report for uniFLOW Online
Postmortem

Description 

On the 8th of February the uniFLOW Online system was seriously impacted by critical system failure within the Southeast Asia Microsoft Azure Data Centre.  

User Impact 

During the affected period users of uniFLOW Online will have experienced delays and timeouts of core functionality. Print and scanning services will have been impacted resulting in print and scan functions failing or needing to be performed multiple time before the action would complete. 

Scope of Impact 

This incident was limited to our Singapore deployment. 

Incident Start Date and Time 

  • Microsoft incident start: Feb 7th, 2023 – 20:30 UTC 

Incident Resolved Date and Time 

  • uniFLOW Online incident end: Feb 8th, 2023 – 07:30 UTC 

    • uniFLOW Online was operational 21 hours before the datacentre incident was resolved due to mitigation actions. 
  • Microsoft incident end: Feb 9th, 2023 – 04:30 UTC 

Root Cause 

Microsoft: 

(This information is derived from the Microsoft incident, “Datacentre Cooling Event – Southeast Asia, Tracking ID: VN11-JD8), preliminary report. 

  • A power surge impacted some of the datacentre cooling systems. 
  • As a result of the cooling system issue infrastructure was taken offline limiting available resources. 
  • This resource exhaustion caused significant issues across multiple Azure services affecting all tenants on the degraded infrastructure. 

NOTE: At the time of writing this PM (Post-Mortem) Microsoft has yet to release their own full incident report and NT-ware recommends you follow official Microsoft status updated for a full account of the incident. 

NT-ware: 

While multiple Azure resources were impacted across the affected infrastructure it was the Azure Service Bus component that NT-ware utilises that affected uniFLOW Online. The Azure Service Bus is vital for the queueing and management of events within uniFLOW Online. These events can be a user requesting a print job to be released or a mobile print job submission, etc.  

In the degraded state the service was not able to handle the number of requests which resulted in delays and timeouts. It is important to note that the service was NOT offline, and many requests were being processed and if a user experienced a timeout, it was very likely that subsequent attempts would work. For example, if a print job was not released at the device a second or third attempt may have. This was evident in our testing and could be seen in the azure metrics. 

uniFLOW Online Emergency Mode: 

The uniflow Online emergency was designed to allow local printing when the desktop SmartClient was not able to contact the cloud service for job submission. As the SmartClient was connecting to unaffected azure services it did not meet the fail-over condition, hence not enabled automatically by the system. 

Manually enforcing Emergency mode was considered by the NT-ware Operations teams and not actioned. It was determined based on available metrices that there was an equal number of successful queued tasks getting through as there were errors. Enforcing this action may well have further impacted already working tenants depending on their configuration. The option was designed as a failsafe if the automated detection mentioned above was not triggered, however print submission by SmartClient was not available due to a cloud service failure. To additionally meet the required criteria ALL printing would been to be affected and during this incident this was not the case. 

Chronological Events: 

Feb 7th, 2023 – 20:30 UTC 

  • Microsoft Incident Start. 

Feb 7th, 2023 – 21:03 UTC 

  • Microsoft first notification was received by email by NT-ware Operations informing us of the service degradation. NT-ware Operations reviewed the report and begin active monitoring of the uniFLOW Online Singapore deployment. It was observed that there was a minimal number of Service Bus errors at this time. Testing by operations team found there was no functional impact to uniFLOW Online and our customers. 

Feb 08, 2023 - 01:21 UTC 

  • NT-ware Operations raised support request to Microsoft as per agreed process to ensure we have a direct and indirect source of information in the incident. 

Feb 08, 2023 - 02:00 UTC (Approx) 

  • NT-ware testing, and field escalation confirmed that from approximately 02:00 UTC the strain on Azures was now impacting uniFLOW Online and causing the above-mentioned delays and timeouts.  
  • It was at this time we evaluated mitigation options such as manually enforcing Emergency Mode. This was not actioned as detailed above. 
  • Operations continues to monitor the situation and followed up with Microsoft for an ETA on their incident resolution. 

Feb 08, 2023 - 05:30 UTC 

  • Microsoft were not able to provide an ETA and the Azure Status Page information made it clear the incident was going into an extended recovery period. 
  • NT-ware Operations and the Dev Operations began mitigation actions to migrate the Azure Service Bus to new Azure infrastructure.  
  • Preparing and performing the required tasks through the Azure UI were very slow. This was very likely down to the amount of people connecting to Azure administration UI trying to recover their systems and perform similar actions. 

Feb 08, 2023 - 07:00 UTC 

  • Dev Operations redeployed a new Azure Service Bus into an alternate (in region) data centre to bring the system fully online. 
  • Service recovery started immediately but took approximately 1 hour before full operational functionality was returned, 08:00 UTC. 
  • NT-ware Operations continues to monitor for several hours as the system stabilised. 

Feb 8th, 2023 – 12:00 UTC 

Next Steps 

We apologize for the impact to affected customers. We are continuously taking steps to improve the uniFLOW Online Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to): 

  • We will review the fail-over procedure for single component failure to reduce the time required to action in the future. 
  • The information and metrices gathered during this incident will be reviewed. Any improvements to availability and recovery will be ticketed and reviewed.
Posted Feb 13, 2023 - 11:42 UTC

Resolved
Hello Everyone

We have been monitoring the metrics and stability of the uniFLOW Online service for several hours without signs of issues. At this time the operations team will move the status to 'Resolved' but continue to monitor.

Microsoft have extended their recovery period and it is likely this may continue into tomorrow. Please monitor the Azure status page to say current with their recovery efforts.

Once we have completed a full review of the incident we will provide a Post Incident Report, (no longer then 20 business days).

Regards
NT-ware uniFLOW Online Operations
Posted Feb 08, 2023 - 12:03 UTC
Update
Hello Everyone,

We now are seeing the mobile print returned to normal operations and have place all services back to an 'Operational' state. We will monitor the stability and usage of the system over the next few hours.

Further information will be provided as we move the issue to resolved.

Regards
NT-ware uniFLOW Online Operations
Posted Feb 08, 2023 - 08:17 UTC
Update
Hello Everyone,

We have successfully transferred affected components into an alternate Azure datacenter. Microsoft has not changed the status on their side, and it might be a long time until services are returned to normal. We will continue to run in the alternate location (within region) until we have a clear confirmation that it is safe to return services.

Operations is monitoring this closely and we can see the new infrastructure slowly processing jobs already. General printing looks to be recovered but mobile printing is still recovering at this.

Important Note: This advisory was original raised with US components accidently selected. This has been rectified and we apologise for any confusion this might have caused. This issue was strictly impacting tenants on our Singapore azure deployment and not the US.

Regards
uniFLOW Online Operations Team.
Posted Feb 08, 2023 - 07:26 UTC
Update
We are continuing to monitor for any further issues.
Posted Feb 08, 2023 - 05:56 UTC
Update
Hello Everyone,

With the Microsoft incident still impacting our services we are not able to provide further updates at this time.

https://azure.status.microsoft/en-us/status

We are reviewing and evaluating mitigation options as there is no update or further confirmation from Microsoft when the data centre services will be restored.

Regards
NT-ware uniFLOW Operations.
Posted Feb 08, 2023 - 04:41 UTC
Update
Hello Everyone,

Microsoft's efforts to restore services are still ongoing. As more of their infrastructure is affected, we are now seeing this impacting uniFLOW Online with confirmation from the field that tenants are affected.

Current reported issues relate to printing, onboarding of tenants / devices and mobile print.

We appreciate your patience and understand that this is not a desirable situation.

Next update will be in 60 minutes or sooner as information is made available.

Regards
NT-ware uniFLOW Online Operations.
Posted Feb 08, 2023 - 03:12 UTC
Update
Hello Everyone,

To keep you updated, it is now confirmed by Microsoft that they have a major incident occurring within their Southeast Asia data centre. Details can be found directly on the Microsoft Azure Status page below.

https://azure.status.microsoft/en-us/status

From the available telemetry uniFLOW Online is handling the incident well. Requests to affected infrastructure are requeuing, and subsequent attempts are generally processing successfully. It might be seen that there are minor delays or time outs to some functionality.

We will continue to monitor this situation as it develops.

Regards,
NT-ware uniFLOW Online Operations
Posted Feb 08, 2023 - 02:26 UTC
Monitoring
Hello Everyone

Description:
Microsoft raised an automated email alert on, February 7, 2023 21:03 UTC regarding this detected issue. NT-ware Operations have been closely watching this issue from when it was first reported. We can see within our metrics minor increases in errors reported from the Azure Service, but our testing does not indicate an impact to uniFLOW Online at this time.

Due to the increasing usage of our service as, the business day starts in the region, we will closely watch for any developments.

Required actions:
No action is needed at this time. If you experience an issue with unusual delays or degraded performance with uniFLOW Online, please contact your Canon or Canon Business Partner support representative.

Further updates will be provided as available from Microsoft.

Regards
NT-ware Operations Team.
Posted Feb 07, 2023 - 23:20 UTC
This incident affected: SG Deployment (Printing, Email print, Scanning).