Proxyclick incident

Proxyclick by Eptura - Users Unable to Authenticate to Web Application/Dashboard

Notice Resolved View vendor source →

Proxyclick experienced a notice incident on February 17, 2023, lasting —. The incident has been resolved; the full update timeline is below.

Started
Feb 17, 2023, 09:13 PM UTC
Resolved
Feb 15, 2023, 03:33 PM UTC
Duration
Detected by Pingoru
Feb 17, 2023, 09:13 PM UTC

Update timeline

  1. resolved Feb 17, 2023, 09:13 PM UTC

    Proxyclick by Eptura Detailed Root Cause Analysis (RCA) – S1 Event 2023-02-15 On February 15, 2023, at 15:33 UTC, Proxyclick started to receive reports that users were unable to access the application. Engineering and DevOps teams isolated an issue with the Web Application servers and restarted them, restoring service. Type of Event: Service Disruption Services Impacted: Proxyclick Web Application Remediation: DevOps and Engineering restarted the impacted service hosts, restoring normal operation. Timeline of Events: 15:33 UTC - First reports received by Support 15:35 UTC - DevOps and Engineering begin investigation 15:42 UTC - DevOps restarts impacted service hosts 15:44 UTC - Service host restart completes and normal operations resume Total Duration: 11 Minutes Groups Involved in the Event: Support DevOps Engineering Root Cause Analysis: A primary service host for the Proxyclick Web Application crashed due to an Out-of-Memory exception and was not automatically restarted. The cause of this OOM exception was identified as a memory leak in the application which had previously escaped notice due to frequent restarting of the service hosts during regular product update deployments. Proxyclick Engineering had a larger than normal gap between releases after the service migration event on January 15th, 2023 which surfaced the conditions for this memory leak to consume all available memory on the service host. Preventative Action and Analysis: DevOps has implemented additional health monitoring to Proxyclick Load Balancer infrastructure to detect service hosts failing due to memory limits and remove them from the pool. Additionally, a self-healing trigger has been added to this health check response to bring the failed host back into service automatically to maintain HA and load capacity. Engineering will investigate the memory leak to produce a patch that resolves the root issue permanently in a future release.