Proxyclick incident
Proxyclick platform authentication failures
Proxyclick experienced a notice incident on August 1, 2024 affecting Dashboard and iPad app and 1 more component, lasting 8h 39m. The incident has been resolved; the full update timeline is below.
Affected components
Update timeline
- investigating Aug 01, 2024, 12:53 PM UTC
We are currently investigating reports of user authentication failures across the platform.
- investigating Aug 01, 2024, 01:13 PM UTC
We are continuing to investigate this issue.
- identified Aug 01, 2024, 02:19 PM UTC
A potential root cause of the issue has been identified. Our engineering and infrastructure teams are working towards implementing a solution to restore the service as soon as possible.
- identified Aug 01, 2024, 04:03 PM UTC
We are pleased to report that we have identified the root cause and are implementing mitigations to address the service disruption. Please allow time for these updates to populate through the infrastructure, as service stabilizes you may still experience intermittent disruption. Updates will be provided as we continue to monitor, next update will be issued at 19:00 CEST
- identified Aug 01, 2024, 05:18 PM UTC
While we have addressed the initial root cause of the disruption, performance continues to be affected due to the volume of pending API requests backlogged. We have allocated additional server resources to expedite recovery and are redeploying API nodes to initiate this change. We continue to monitor progress and will post our next update at 20:00 CEST.
- monitoring Aug 01, 2024, 06:10 PM UTC
We have allocated additional server resources to expedite recovery and are redeploying API nodes to initiate this change. We continue to monitor progress. we will monitor the incident until further notice. next update at 21:00 CEST.
- monitoring Aug 01, 2024, 07:14 PM UTC
We are continuing to monitor the processing of API requests, Next update will be 23:00CEST
- resolved Aug 01, 2024, 09:33 PM UTC
We have completed remediation and confirmed that all queues have returned to normal volumes and performance has stabilized. We will continue to closely monitor the status during our root cause analysis and will release our findings as a final update to this incident within 10 business days.
- postmortem Sep 04, 2024, 04:03 PM UTC
We are grateful for your continued support and loyalty. We value your feedback and appreciate your patience as we worked to resolve this incident. **Type of Event:** S1 – Visitor Not accessible **Services/Modules Impacted:** Visitor API not responsive, login \(forms-based and SAML SSO\) blocked **Root Cause:** The storage cluster surpassed its capacity, preventing the creation of new producers. Consequently, the Visitor messaging system experienced a failure, rendering Eptura Visitor inaccessible to all clients. **Remediation:** Eptura initially increased the disk size to accommodate additional messages within the service. This temporary solution enabled us to conduct a thorough investigation for a more permanent resolution. We pinpointed a particular issue that had been subject to manual cleanup as part of our monitoring efforts. Following the cleanup, we restored the message service to operational status, normalizing activity to restore Visitor services. **Timeline:** All times listed in U.S. Central Time 7:28 a.m.: Monitoring alerts indicated that Eptura exceeded its storage capacity. 7:34 a.m.: An incident was reported, prompting Eptura to start investigation. 7:53 a.m.: Eptura updated the Visitor status page to reflect the incident and investigation. 11:03 a.m.: Eptura identified the root cause and implemented measures to mitigate the service disruption. 12:10 p.m.: Eptura redeployed API nodes to allocate additional server resources. 12:18 p.m.: System performance remained impacted due to a backlog of pending API requests. 1:11 p.m.: Eptura confirmed that the backlog cleared and resumed monitoring. 4:33 p.m.: Following a successful monitoring period, Eptura marked the incident as resolved. **Total Duration of Event:** 9 hours, 5 minutes **Preventive Action:** We have implemented automated storage cluster cleanup processes and enhanced monitoring, and will continue to examine both manual and automated enhancements to optimize cluster sizing and management moving forward.