Cigo Tracker incident

Unplanned Platform Downtime (Cloud Vendor Outage)

Cigo Tracker experienced a major incident on January 21, 2024 affecting Dispatch Web Platform and Public API and 1 more component, lasting 6h 44m. The incident has been resolved; the full update timeline is below.

Started: Jan 21, 2024, 06:40 AM UTC
Resolved: Jan 21, 2024, 01:24 PM UTC
Duration: 6h 44m
Detected by Pingoru: Jan 21, 2024, 06:40 AM UTC

Affected components

Dispatch Web PlatformPublic APIiOSRouting and Itinerary OptimizationOperator APIAndroidMapsCustomer TrackerNotificationsOutbound Email Service

Update timeline

investigating Jan 21, 2024, 06:40 AM UTC

We are currently experiencing extended database downtimes stemming from Microsoft Azure's database instances. This has been occurring intermittently since 9:45 PM (EST), and the issue has been recurring at a higher frequency since 12 AM (EST). Our team is actively investigating the root cause of this service disruption. We appreciate your patience as we work to resolve this issue promptly.
identified Jan 21, 2024, 07:23 AM UTC

Our ongoing investigation into the connectivity disruption impacting a subset of our database servers and various Azure services has identified an issue on Microsoft's end. The Microsoft Operational Systems Support team has acknowledged the problem and is actively addressing it. We are diligently awaiting further updates from their team and will keep you informed as soon as new information becomes available. Your patience during this time is sincerely appreciated.
monitoring Jan 21, 2024, 07:57 AM UTC

It appears that Microsoft Azure has successfully implemented a fix, and our database server connections are now operational. However, we are currently awaiting official confirmation from the Microsoft Operational Systems Support team to validate that the issue has been fully mitigated. Thank you for your continued understanding.
monitoring Jan 21, 2024, 08:30 AM UTC

Our team is actively monitoring our database server instances to guarantee service availability. While things are looking positive with the recent improvements, we are awaiting official confirmation from the Azure team to ensure that the problem has been fully resolved.
resolved Jan 21, 2024, 01:24 PM UTC

The incident has been successfully resolved. We're now awaiting the Root-Cause Analysis report from Microsoft Azure's team. Once received, we'll compile a post-mortem of the event to provide you with a comprehensive overview.
postmortem Jan 29, 2024, 03:42 PM UTC

We want to provide you with an update on the January 21st incident that impacted our services. Here's a breakdown of the situation: **Incident Summary:** On January 20th, 2024, at around 9 PM EST, an internal maintenance process by the Azure OSS team resulted in a configuration change to Azure Resource Manager. Unfortunately, this led to repeated failures of the Azure Resource Manager's node upon startup. **Root Cause:** The configuration change triggered a negative feedback loop, overwhelming the remaining Azure Resource Manager nodes and causing a rapid drop in availability. This, in turn, affected our backend storage, leading to random failures on data plane API calls. These failures, specifically, disrupted the functionality of our database server, leading to intermittent crashes, particularly during the timeframe of 12 AM to 2 AM. **Resolution:** The Azure engineering team worked to address the issue, and we’re able to fully resolve it around 4 AM EST on January 21st, 2024. **Preventive Measures:** To prevent similar incidents in the future, we are closely reviewing our internal processes and working collaboratively with the Azure OSS team to implement additional safeguards. We sincerely apologize for any inconvenience this may have caused, and we appreciate your understanding as we continue to enhance our systems to provide you with a more reliable experience.