Service Disruption Reported AU/NZ
Incident Report for uniFLOW Online
Postmortem

User Impact

The uniFLOW SmartClient would not register to uniFLOW Online. Failure to release print jobs from devices.

Scope of Impact

  • Australian deployment
  • New registrations of the uniFLOW SmartClient to uniFLOW Online
  • Release of print jobs where the job information is stored in the cloud (default)

Incident Start Date and Time

April 23rd, 2021 - 11:30am AEST

Incident End Date and Time

April 23rd, 2021 - 16:30pm AEST

Root Cause

Due to a misconfiguration, an exhaustion limit was prematurely reached. The uniFLOW SmartClient must successfully register to uniFLOW Online to complete the print release process when job information is stored in the cloud and the SmartClient cannot be directly reached by the device. With the uniFLOW SmartClient failing to connect, it would retry within a minute but with the resource pool consumed, the system began throttling requests causing delays and causing a partial outage of print release capabilities. The onboarding of new uniFLOW SmartClients during this time would also not have been possible.

Note: The standard behavior of the Devices, if the connection to uniFLOW Online is lost, is to try reaching the SmartClient using its last known address. This connection is subject to the configuration of the customers’ network.

Incident Response

April 23rd, 2021 - 11:30am AEST

The issue was reported from the field via our support channel and confirmed.

April 23rd, 2021 - 12:30am AEST

NT-ware Operations Team confirmed field reports, began reviewing telemetry. Internal testing was performed to correlate the findings and escalated to put a mitigation plan in place.

April 23rd, 2021 - 03:00pm AEST

The mitigation process began with the provisioning of additional resources and management of the web resource role. We over-provisioned resources to address the high number of pending requests and to ensure all requests in the queue were processed.

April 23rd, 2021 - 16:00pm AEST

All resources had returned to normal operation and the system was performing within acceptable parameters. We kept the case open for a further 30 minutes to monitor the status and ensure all mitigations controls were in full effect.

Lessons Learnt

  1. Provision of a higher resource limit to allow for ‘burst’ growth on all production deployments
  2. Implement a regular review process and checkpoint
  3. This action will be negated soon with pending improvements already scheduled to provide live alerting and monitoring of this growth

Customer Post Actions

  • There are no post-incident actions or recommendations.
Posted Apr 28, 2021 - 18:36 UTC

Resolved
This incident has been resolved. All systems are back to normal operation.
Posted Apr 23, 2021 - 09:36 UTC
Monitoring
Update
We are pleased to say the system has returned to normal operation which is confirmed within our telemetry. We will continue to monitor and ensure the mitigation controls perform as expected.

We will investigate the telemetry and information gathered to improve and learn from this situation.

A postmortem of this incident will be posted within 5 working days.
Posted Apr 23, 2021 - 06:27 UTC
Update
Update:
The issue has been identified and we are putting mitigation plans in place. Current information is pointing to a spike in traffic causing a saturation of certain services beyond their auto scaling limit. We have now monitoring the recovery and will keep you updated on the progress.

At this time we cannot estimate the service recovery period but will provide updates as the situation improves.

The Mobile Print service impact reported earlier could possible be a False/Positive. We are only seeing an issue with job release when non MEAP devices are used. This would match our investigation as these would require the SmartClient for Processing.

Next Update in 30 minutes.
Posted Apr 23, 2021 - 05:40 UTC
Update
We are continuing to work on a fix for this issue.
Posted Apr 23, 2021 - 04:30 UTC
Identified
Update:
Investigations are ongoing, it should also be expected that the issue will impact the release of mobile print job.

We are working hard to resolve the issue as quickly as possible.

Next update will be in 30 minutes.
Posted Apr 23, 2021 - 04:30 UTC
Update
Update:
The scope of the incident at this time 'appears' to be isolated to the SmartClient. This is affecting the registration process and job release where SmartClient job processing is taking place.

The investigation is ongoing and has also be escalated to our backend DevOps team.

We will provide the next update in 30 minutes.
Posted Apr 23, 2021 - 03:51 UTC
Update
We are continuing to investigate this issue.
Posted Apr 23, 2021 - 03:04 UTC
Investigating
Identified: 11:30am AEST

Incident Scope:
Field reports of impact to print job release for AU and NZ tenants.

Description:
An issue has been reported but undefined at this time. Possible issue releasing print jobs sent to uniFLOW Online. NT-ware operations team is now investigating the issue and checking telemetry.

Next Update:
The next update will be in
Posted Apr 23, 2021 - 02:59 UTC
This incident affected: AU Deployment (Printing, Email print).