Parade experienced a major incident on May 8, 2024 affecting Broker Portal APIs and Aljex Integration and 1 more component, lasting 7h 2m. The incident has been resolved; the full update timeline is below.
Affected components
Update timeline
- investigating May 08, 2024, 08:04 PM UTC
We are currently investigating an issue that is affecting the whole platform. The incident started around 10PM PST Yesterday that caused downtime to internal and external integrations. We are still investigating slowdowns across platform that are causing a delay in integrations and portal performance.
- resolved May 09, 2024, 03:07 AM UTC
This incident has been resolved.
- postmortem May 09, 2024, 07:37 PM UTC
**Service Outage Report: Internal APIs** **Overview:** We recently experienced a service disruption affecting our internal APIs immediately following the deployment of a recent update. This incident led to the API intermittently returning a 503 Service Unavailable error, impacting services reliant on this API, such as P4C and integration services. **Incident Timeline:** * The issue commenced with the deployment of a problematic query in version 1.92.1 at 05:05 UTC. * At 12:06 UTC, the issue was identified due to carrier pages not loading, with the problem officially communicated by the customer experience team. * Initial attempts to resolve the issue by rolling back to a previous version were unsuccessful. * By 14:12 UTC, temporary traffic rerouting restored functionality for most services. * A permanent solution was implemented and rolled out by 17:20 UTC with the deployment of version 1.92.4. **Root Cause:** The root cause was identified as a performance issue with a SQL query that was part of an upgrade involving our backend framework and packages. The upgrade inadvertently introduced a change in how email recipients are managed, which significantly increased database query times in our production environment. **Resolution and Recovery Steps:** * Traffic was temporarily rerouted to alternative deployments to isolate the problematic component and restore service functionality. * A permanent fix was developed, optimizing the SQL queries involved, and successfully deployed without further incident. **Moving Forward:** We have taken steps to prevent similar incidents by enhancing our testing protocols to better simulate real-world loads in our staging environment. We appreciate your understanding and apologize for any inconvenience caused. If you have further questions or need assistance, please contact our support team.