Evidos incident

Issues with creating transactions and accessing the web portal.

Critical Resolved View vendor source →

Evidos experienced a critical incident on September 3, 2025 affecting API and UI / View and 1 more component, lasting 1h 12m. The incident has been resolved; the full update timeline is below.

Started
Sep 03, 2025, 03:18 PM UTC
Resolved
Sep 03, 2025, 04:31 PM UTC
Duration
1h 12m
Detected by Pingoru
Sep 03, 2025, 03:18 PM UTC

Affected components

APIUI / ViewPortal

Update timeline

  1. investigating Sep 03, 2025, 03:18 PM UTC

    We are experiencing issues with creating transactions and entering the portal. We are investigating the issues.

  2. investigating Sep 03, 2025, 03:38 PM UTC

    We have placed the platform in maintenance as we continue our investigation.

  3. investigating Sep 03, 2025, 03:38 PM UTC

    We are continuing to investigate this issue.

  4. monitoring Sep 03, 2025, 04:08 PM UTC

    A fix has been implemented and we are monitoring the results.

  5. resolved Sep 03, 2025, 04:31 PM UTC

    This incident has been resolved.

  6. postmortem Sep 12, 2025, 01:28 PM UTC

    ### What happened? On September 2 and 3, 2025, our platform experienced a major service outage that significantly impacted availability and transaction processing. The disruption was caused by a code deployment that introduced a bug resulting in excessive duplicate message processing. This overwhelmed our database connections and led to downtime across both days. ### What we did On September 2, we deployed a change and observed an outage shortly after. The team initiated a rollback to restore functionality. After reviewing the change, we re-deployed the same change later that day, believing the issue was unrelated. However, on September 3, a second major outage occurred around the same time. During this time, all SQL queries across the Signhost services failed due to database saturation. After analyzing message logs, we had to conclude the code change was the root cause. The deployment intermittently triggered bursts of hundreds of thousands of duplicate messages for the same transaction events. These spikes occurred after hours of normal operation, making the issue difficult to detect early. Once confirmed, we permanently rolled back to the previous stable version. The system has remained stable since. ### What caused the issue? The root cause was a still unidentified bug in the change we deployed on September 2, that sporadically generated massive volumes of duplicate messages. These bursts overwhelmed the database, causing complete service outages. The duplicates were not caused by infinite loops but occurred in sudden, high-volume spikes. ### What are we doing next? To prevent similar incidents and improve our deployment safety for these kinds of changes, we are taking the following steps: Deployment safeguards: * Gradual rollout strategy: Implement phased rollouts with real-time monitoring to detect anomalies before full deployment. Monitoring improvements: * Monitoring system: We are transitioning to a more robust monitoring system, which will provide improved insights, faster anomaly detection, and more precise alerting across our infrastructure. Root cause analysis: * Deep dive: Continue investigating the specific bug that caused the duplicate message generation to prevent recurrence in future deployments.