Proxyclick incident

Elevated response times in Proxyclick admin dashboard

Proxyclick experienced a major incident on October 8, 2024 affecting Dashboard and iPad app and 1 more component, lasting 21h 31m. The incident has been resolved; the full update timeline is below.

Started: Oct 08, 2024, 07:46 AM UTC
Resolved: Oct 09, 2024, 05:18 AM UTC
Duration: 21h 31m
Detected by Pingoru: Oct 08, 2024, 07:46 AM UTC

Affected components

DashboardiPad appBrowser-based KioskSMSMailSlackSkype for BusinessAPIWebhooksBox integration

Update timeline

investigating Oct 08, 2024, 07:46 AM UTC

We are currently investigating reports of increased loading times and slow response in the Proxyclick admin dashboard.
investigating Oct 08, 2024, 08:57 AM UTC

Our Infra team is currently investigating the issue to identify the cause and take remedial actions. Next Update in 2 hours.
investigating Oct 08, 2024, 11:04 AM UTC

Our Infra team is actively working to determine the root cause of the disruption and assess its impact. We will provide next update in 4 hours. Thank you for your patience as we work to resolve this issue.
investigating Oct 08, 2024, 01:23 PM UTC

We are continuing to investigate this issue.
investigating Oct 08, 2024, 01:27 PM UTC

We believe we have identified the root cause of the issue. The impact is currently intermittent. To resolve the issue, we will be restarting the database service at 15:30 CET. This will result in an approximately 30-minute outage. We expect the service to be fully restored after the database restart. During this period, our services will be temporarily unavailable. We apologize for any inconvenience this may cause and appreciate your understanding and patience as we work to improve our systems.
monitoring Oct 08, 2024, 02:27 PM UTC

The system is operational now. We shall continue to monitor the situation. Next update: 11:00 US Central Time.
monitoring Oct 08, 2024, 04:01 PM UTC

The systems are functioning normally. We shall continue to monitor the situation for extended period of time. Next update: 03:30 US Central time on October 9, 2024.
resolved Oct 09, 2024, 05:18 AM UTC

We are pleased to share that this incident is resolved as we have confirmed full restoration of admin response time performance. We will publish our root cause analysis findings on this incident within 10 business days.
postmortem Oct 25, 2024, 09:31 AM UTC

We are grateful for your continued support and loyalty. We value your feedback and appreciate your patience as we worked to resolve this incident. **Type of Event:** S2 – Visitor \(PXC\) API and Dashboard Performance Issues **Services/Modules Impacted:** Visitor Dashboard, web application, and API **Root Cause:** We're always working to keep our systems running smoothly, and part of that includes a weekly cleanup job that removes unused documents. This job efficiently processes up to 10,000 documents at a time and is managed by a scheduler that communicates with our system via an event. Recently, the job completed successfully and was marked as such in our database. However, an event acknowledgment didn't go through, leading to repeated retries. While this caused some temporary database deadlocks, our team quickly identified the issue and resolved it with a database restart. **Remediation:** To address the immediate problem, the customers were notified, and the database was restarted, which cleared the deadlocks and stopped the retry loop. **Timeline:** _All times listed in CEST_ ‌ **06 Oct 2024** 1_0:00 p.m.: HardDeleteDocuments job executed_ _**07 Oct 2024**_ _07:23 a.m.: Deadlock errors were logged in the database. 10 deadlock errors were logged throughout the business day, but the number was not alarming._ _**08 Oct 2024**_ _02:18 a.m.: Deadlock errors started to show in the database again._ 07:40 a.m: _An incident was reported, prompting Eptura to initiate the investigation._ 09:46 a.m.: _Eptura updated the Visitor status page to reflect the incident and investigation._ _03:00 p.m.: The Eptura Infra team identified the root cause of the issue and suggested restarting the Database._ _03:27 p.m.: The Eptura team updated the status page to notify the customers of the above and also notified of the 30-minute downtime._ _03:41 p.m.: The Eptura team restarted the database and performed sanity checks on the app._ 04:00 p.m.: The Eptura team updated cases to ask customers for initial feedback on the issue. 04:27 p.m.: The Eptura team updated the status page confirming the issue is resolved and moved to monitoring. **09 Oct 2024** _03:48 p.m.: Following a successful monitoring period, Eptura marked the incident as resolved._ **Total Duration of Event:** 6 hours 47 minutes **Preventive Action:** Eptura is proactively enhancing our monitoring to better track job executions. We're setting up alert monitoring jobs and reviewing our retry mechanism to implement stronger solutions, like limited retries or a dead letter queue, to prevent future issues. Additionally, we're planning a broader re-architecture in early 2025 to ensure smooth scaling of our application. We're committed to continuous improvement and excited to deliver an even better experience for you!