Zomentum experienced a critical incident on July 4, 2023 affecting Main API server, lasting 14m. The incident has been resolved; the full update timeline is below.
Affected components
Update timeline
- investigating Jul 04, 2023, 06:23 AM UTC
We're experiencing an elevated level of API errors and are currently looking into the issue.
- resolved Jul 04, 2023, 06:37 AM UTC
This incident has been resolved.
- postmortem Jul 04, 2023, 06:48 AM UTC
On 4 July 2023, our server experienced a critical incident resulting in a false fatal error that caused the server to go down. This incident report aims to provide an in-depth analysis of the incident, including the root cause, impact, actions taken to mitigate the issue, and preventative measures implemented to avoid similar incidents in the future. Timeline: * 1:20 UTC: Incident identification * 2:00 UTC: Investigation and diagnosis * 3:00 UTC: Mitigation and server recovery * 5:30 UTC: Issue resolution and confirmation * 6:00 UTC: Post-incident analysis and documentation Incident Details: A code path triggered a false fatal error, leading to the server going down. This incident caused a temporary disruption in service availability, but fortunately, there was no impact on customer data integrity. Root Cause Analysis: Upon conducting a thorough investigation, we identified the root cause of the incident as the false fatal error. While this particular code path had not been previously triggered, it got triggered from 1:20 UTC onwards. Impact: Although the incident resulted in a temporary service disruption, we would like to emphasize that no customer data was compromised or affected. The downtime lasted intermittently in bursts of 1 minute per 15 minutes interval between 1:20 UTC - 5:30 UTC. Mitigation and Resolution: To address the incident promptly and minimize its impact, the following actions were taken: 1. Temporary workaround: A temporary workaround was implemented to stabilize the server and prevent further recurrence of the false fatal error while a permanent solution was being developed. This was done by increasing the redundancy of the server. 2. Permanent fix: A hot fix was rolled out to production removing this unwanted error. Preventative Measures: In light of this incident, we have added test in the code that would check for these errors to be implemented only wherever it is appropriate. Conclusion: The false fatal error incident on 4 July 2023 resulted in temporary service disruption but had no impact on customer data integrity. Through a thorough investigation, we identified the root cause, implemented a temporary workaround, and developed a permanent fix to prevent similar incidents from occurring in the future. Our commitment to enhancing error handling mechanisms, testing processes, and proactive monitoring will ensure the stability and reliability of our services moving forward. We sincerely apologize for any inconvenience caused by this incident and assure you that we are continuously striving to improve our systems to provide a seamless experience for our valued customers.