YOOBIC incident

Degraded Performance

Minor Resolved View vendor source →

YOOBIC experienced a minor incident on July 24, 2023 affecting YOOBIC applications and Public API and 1 more component, lasting 1h 4m. The incident has been resolved; the full update timeline is below.

Started
Jul 24, 2023, 07:53 AM UTC
Resolved
Jul 24, 2023, 08:57 AM UTC
Duration
1h 4m
Detected by Pingoru
Jul 24, 2023, 07:53 AM UTC

Affected components

YOOBIC applicationsPublic APISchedulerIntegration CenterMySQL Reporting Database

Update timeline

  1. investigating Jul 24, 2023, 07:53 AM UTC

    We are currently experiencing degraded performance with our services, which may result in slower response times. Our engineering team is actively investigating the issue and working to improve the performance as quickly as possible. While the degraded performance may impact your experience, rest assured that we are prioritizing the resolution to provide a smoother and more reliable service. Our team is committed to restoring normal performance and ensuring a seamless user experience. We will provide regular updates on this page to keep you informed about the progress of the incident. Please bear with us as we work diligently to address the performance issues and deliver the level of service you expect. Thank you for your understanding and cooperation. We appreciate your continued support as we work towards resolving the degraded performance and enhancing your overall experience.

  2. investigating Jul 24, 2023, 08:19 AM UTC

    Our engineering team is still actively looking into the ongoing incident which is impacting our services. The following services are affected: - Mobile application - Web application - Public API - Scheduler - Integrations centre We understand the impact this has on your experience and want to assure you that resolving the issue is our top priority. We appreciate your patience and support during this time. Further updates will be provided as our team works towards restoring normal service.

  3. identified Jul 24, 2023, 08:44 AM UTC

    Our engineering team has identified the cause of the ongoing issue and is currently implementing a fix. We appreciate your patience as we work towards resolving the situation. Further updates will be provided as our team works towards restoring normal service.

  4. resolved Jul 24, 2023, 08:57 AM UTC

    This incident has been resolved.

  5. postmortem Jul 25, 2023, 04:26 PM UTC

    ## Summary On 24 July 2023, our application experienced degraded performance with slower response times for our customers. The incident was caused by a higher server CPU usage, due to a scheduled task calculating aggregation metrics. This incident report provides a detailed account of the timeline, resolution, root cause analysis, impact, corrective actions, and conclusion related to the outage. ## Timeline Events that happened on the 24th of July, 2023: * 07:40am \(UTC\): Our server monitoring system detected higher CPU usage and increased response times. * 07:50am: The issue was escalated to the CTO, who initiated a crisis conference call with the Head of Backend and other Backend engineers for investigation. * 08:00am: The initial investigation focused on determining if the issue was related to a scheduled task, impacting the load on the database, utilizing analytics tools. * 08:05am: The team decided to kill the scheduled task. * 08:10am: The load stopped and the performance progressively returned to normal * 08:15am: The performance was back to normal, and the outage was resolved. ## Root cause analysis The root cause of the issue was traced back to a scheduled task impacting MongoDB database load, and causing the outage. ## Impact During the outage, our application experienced higher response times for some customers, including invalidation of their token forcing a logout, because the token could not be refreshed. It is probable that the same scheduled task played a role in the outage of the 17th July and we are actively monitoring that task in the coming weeks. ## Corrective actions To prevent similar incidents in the future, we have implemented the following corrective actions: * Dispatch the load of the scheduled task on the secondary nodes of MongoDB. * Adding new indexes to ease the workload of the scheduled task. * Evaluating the benefit of adding dedicated secondary nodes to MongoDB for this type of calculation unloading the workload from the primary nodes. ## Conclusion We apologize for the inconvenience caused by the outage. Our team worked diligently to identify and resolve the issue promptly. We are taking the necessary steps to prevent similar incidents in the future and improve the overall stability and reliability of our application. We understand the critical nature of our services for frontline workers, and we are committed to continuously improving our infrastructure and processes to ensure the best possible experience for our valued customers. If you have any further questions or concerns, please don't hesitate to contact our support team. Thank you for your understanding.