YOOBIC incident

Major Outage

YOOBIC experienced a major incident on July 17, 2023 affecting YOOBIC applications and Public API and 1 more component, lasting 42m. The incident has been resolved; the full update timeline is below.

Started: Jul 17, 2023, 08:13 AM UTC
Resolved: Jul 17, 2023, 08:56 AM UTC
Duration: 42m
Detected by Pingoru: Jul 17, 2023, 08:13 AM UTC

Affected components

YOOBIC applicationsPublic APISchedulerIntegration CenterMySQL Reporting Database

Update timeline

investigating Jul 17, 2023, 08:13 AM UTC

We are currently experiencing a major outage that has resulted in the unavailability of our services. During this outage, users may be unable to access our application/service. Our engineering team is actively investigating the issue and working to restore full functionality as quickly as possible. We understand the impact this may have on your workflow and assure you that we are making every effort to restore full functionality as quickly as possible. We are committed to keeping you informed about the progress of the incident. Regular updates will be provided on this page as our team works towards a resolution. We appreciate your patience and understanding during this challenging time. Thank you for your support as we work diligently to restore normal service and minimize any disruptions to your workflow.
investigating Jul 17, 2023, 08:23 AM UTC

We are actively investigating the issue you may be experiencing and will provide updates as soon as possible. We appreciate your patience and understanding during this process.
investigating Jul 17, 2023, 08:44 AM UTC

Our engineering team is still actively looking into the ongoing incident which is impacting our services. The following services are affected: - Mobile application - Web application - Public API - Integrations centre We understand the impact this has on your experience and want to assure you that resolving the issue is our top priority. We appreciate your patience and support during this time. Further updates will be provided as our team works towards restoring normal service.
resolved Jul 17, 2023, 08:56 AM UTC

The incident has been fully resolved and you should be able to use our services. We understand the impact this incident may have had on your operations, and we apologize for any inconvenience caused. If you have any lingering concerns or encounter any unexpected issues, please do not hesitate to contact our support team. We are here to assist you and ensure a smooth experience moving forward.
postmortem Jul 18, 2023, 04:04 PM UTC

# Incident report 17 July 2023 ## Summary On 17 July 2023, our SaaS application experienced an outage resulting in high response times for our customers. The incident was caused by a query plan rebuilt, triggered by a new index created on the mission collection in the previous days. This incident report provides a detailed account of the timeline, resolution, root cause analysis, impact, corrective actions, and conclusion related to the outage. ## Timeline Events that happened on the 17th of July, 2023: * 08:05am \(UTC\): Our server monitoring system detected higher CPU usage and increased response times. * 08:15am: The issue was escalated to the CTO, who initiated a crisis conference call with the Head of Backend and other Backend engineers for investigation. * 08:20am: Initial investigation focused on determining if the issue was related to MongoDB or Heroku, utilizing analytics tools. * 08:25am: It was confirmed that the problem was on the MongoDB side, and further analysis revealed the new index on the mission collection as the root cause. * 08:28am: The team began rebuilding the query plan on the server side. * 08:30am: The performance progressively returned to normal, and the outage was resolved. ## Root cause analysis The outage was caused by the unexpected rebuilt of the query plan, triggered by the implementation of a new index on the mission collection designed to improve the performance of specific queries on the mission collection. This rebuilt was specific to the production environment context and was not triggered during previous tests in staging environments. The outage was resolved by forcing ramp up of queries using the index and rebuilding the query plan on the server side. These actions were taken promptly to restore normal performance. ## Impact During the outage, our SaaS application experienced high response times, with the 95th percentile exceeding 20 seconds. ## Corrective actions To prevent similar incidents in the future, the following corrective actions have been taken: * A thorough review process will be implemented for any index creation or modification, ensuring ramp up query plan building right after deployment. ## Conclusion We apologize for the inconvenience caused by the outage and resulting in higher response times. The incident was promptly addressed by our engineering team, who successfully identified the root cause and implemented corrective actions to restore normal performance. We remain committed to delivering a reliable and efficient SaaS application for our frontline worker customers and will continue to take proactive measures to prevent future occurrences.