Gorilla incident

Intermittent API errors and delayed responses

Gorilla experienced a major incident on December 17, 2025 affecting API and Web Application, lasting 1d 2h. The incident has been resolved; the full update timeline is below.

Started: Dec 17, 2025, 03:02 PM UTC
Resolved: Dec 18, 2025, 05:31 PM UTC
Duration: 1d 2h
Detected by Pingoru: Dec 17, 2025, 03:02 PM UTC

Affected components

APIWeb Application

Update timeline

investigating Dec 17, 2025, 03:02 PM UTC

Some users experienced intermittent API errors and slower response times. In certain cases, repeated requests failed until the affected service instances were restarted. We are investigating an ongoing issue causing intermittent API errors and slower response times for calculation-related requests. The issue occurs when the API exceeds execution time limits and enters a degraded state, which can cause repeated request failures until the affected service instances are recycled. We are actively mitigating impacted instances and monitoring the platform. In parallel, we are enhancing detection and alerting to ensure a faster response while we work on a longer-term solution to prevent this behavior from recurring. Further updates will be shared as progress is made.
identified Dec 17, 2025, 04:13 PM UTC

We are investigating a potential interaction between a recently updated monitoring component and our API runtime. This may, under certain conditions, prevent proper timeout handling and lead to intermittent errors. Fixes are being rolled out, and the investigation is ongoing.
identified Dec 17, 2025, 04:23 PM UTC

We are continuing to work on a fix for this issue.
monitoring Dec 17, 2025, 04:42 PM UTC

We have identified a likely contributing factor related to a recent update in a third-party monitoring component that was automatically included in our latest deployment. This update affected our ability to observe the API and may have contributed to intermittent timeouts and 500 errors. In parallel, the monitoring provider has deprecated their legacy Lambda monitoring and recommended migrating to a newer APM-based integration. We have completed this migration, applied a workaround for the identified issue, and rolled out a hotfix across all environments. The remaining work is to update dashboards and alerts to fully use the new monitoring data. We continue to monitor the situation closely and will provide further updates as stability is confirmed.
investigating Dec 17, 2025, 07:01 PM UTC

We are investigating a recurrence of intermittent API errors and slow responses. Previous changes did not fully resolve the issue. Early findings suggest our internal timeout safeguards are not consistently triggering in production, allowing platform-level timeouts to occur and cause repeated service restarts. The exact cause is still under investigation. We are actively working on mitigations and will share updates as we learn more.
identified Dec 17, 2025, 08:09 PM UTC

We have identified the cause of the ongoing API issues. In some cases, individual requests can take longer than expected, which may prevent our internal safeguards from triggering correctly. This can lead to temporary service degradation and errors. Mitigation in progress: We are applying changes to ensure slow requests fail cleanly without affecting overall service stability, and are identifying additional improvements to prevent this behavior from recurring. We will provide another update once the fix is deployed and move the incident to the Monitoring phase.
monitoring Dec 17, 2025, 08:24 PM UTC

A fix has been implemented to ensure slow requests fail cleanly without impacting overall service stability. We are actively monitoring to confirm the fix fully resolves the issue. Further updates will be shared if needed.
resolved Dec 18, 2025, 05:31 PM UTC

This incident has been resolved. Our team will continue to explore long-term solutions and improvements. If you continue to experience, please reach out to our support team