Proxyclick incident

S2 - Elevated response times in the admin dashboard

Proxyclick experienced a minor incident on August 13, 2024 affecting Dashboard, lasting 1d 5h. The incident has been resolved; the full update timeline is below.

Started: Aug 13, 2024, 12:33 PM UTC
Resolved: Aug 14, 2024, 05:44 PM UTC
Duration: 1d 5h
Detected by Pingoru: Aug 13, 2024, 12:33 PM UTC

Affected components

Dashboard

Update timeline

investigating Aug 13, 2024, 12:33 PM UTC

We are currently investigating an issue with admin dashboard
investigating Aug 13, 2024, 03:47 PM UTC

Our Engineering & Infrastructure teams are investigating issue related to intermittent gateway timeouts which is impacting response times in the admin dashboard. Investigation into the underlying root cause of timeouts is ongoing with both Eptura & Microsoft teams engaged.
monitoring Aug 13, 2024, 06:44 PM UTC

Eptura Engineering has identified and implemented a fix as of 12:30pm CST. As we have not seen further service disruptions, we are moving to a Monitoring status for the next 2 hours.
monitoring Aug 13, 2024, 09:02 PM UTC

We are continuing to monitor to ensure fix has mitigated the impact experienced. We will leave monitoring window open for another 2 hours.
monitoring Aug 13, 2024, 11:25 PM UTC

We will continue to monitor until 8am CST to ensure all logs and processing is clear. Thank you for your understanding.
monitoring Aug 14, 2024, 11:42 AM UTC

Standard application monitoring identified that one of the queues within the application was growing from 5:10 UTC to 7:20 UTC. During this time some customers may have observed elevated dashboard response time times. Eptura Cloud Operations implemented mitigations and we continue to monitor.
resolved Aug 14, 2024, 05:44 PM UTC

This incident has been resolved.
postmortem Sep 11, 2024, 04:23 PM UTC

We are grateful for your continued support and loyalty. We value your feedback and appreciate your patience as we worked to resolve this incident. **Type of Event:** S2 – Visitor \(PXC\) dashboard intermittent performance Issues **Services/Modules Impacted:** Visitor Dashboard **Root Cause:** The servers were operating at full capacity when the service job was simultaneously handled by both API servers, leading to an increase in server traffic. This situation temporarily saturated the servers, causing them to drop incoming connections. Additionally, the extra load extended the completion time for existing requests, which intermittently affected dashboard performance. **Remediation:** Eptura proactively restarted the API services to address an intermittent issue, providing immediate relief. This strategic approach allowed us to conduct an in-depth investigation to devise a more permanent solution. We identified a specific configuration issue within the API servers, which our CloudOps team promptly addressed, optimizing the service configurations to prevent jobs from being simultaneously picked up by the API servers. This enhancement ensures smoother operations and enhanced reliability. **Timeline:** _All times listed in UTC_ _9:40 a.m.: Monitoring alerts indicated that API services were not responding._ _10: 15 a.m.: Eptura restarted the API services and restored the functionality._ _12:06 p.m.: An incident was reported, prompting Eptura to start the investigation._ _13:33 p.m.: Eptura updated the Visitor status page to reflect the incident and investigation._ _15:00 p.m.: Eptura restarted the services but in vain. Eptura engaged Micorosft’s Network team to aid the investigation._ 16:00 p.m.: Eptura deployed additional API nodes and restarted services to resolve the issue. _16:18 p.m.: System performance remained impacted due to a backlog of pending API requests._ _18:30 p.m.: Eptura confirmed that the backlog cleared and resumed monitoring._ _5:30 a.m.: Monitoring alerts indicated that API services were not responding._ _07:15 a.m.: Eptura team restarted the services on API nodes to resolve the issue and resumed monitoring._ _19:44 p.m.: Following a successful monitoring period, Eptura marked the incident as resolved._ **Total Duration of Event:** 8 hours 35 minutes **Preventive Action:** We have enhanced our infrastructure by provisioning additional servers to efficiently manage the daily requests. Furthermore, the Eptura CloudOps team has implemented advanced internal monitoring systems. These systems are designed to better monitor errors and dropped connections from a load balancer perspective, enabling us to take swift and effective action whenever necessary. This proactive approach ensures a smoother and more reliable service experience.