CyberQP incident

Intermittent Web Dashboard Slowness

Minor Resolved View vendor source →

CyberQP experienced a minor incident on September 25, 2024 affecting Web Dashboard and Web Dashboard, lasting 8d 9h. The incident has been resolved; the full update timeline is below.

Started
Sep 25, 2024, 02:31 PM UTC
Resolved
Oct 04, 2024, 12:03 AM UTC
Duration
8d 9h
Detected by Pingoru
Sep 25, 2024, 02:31 PM UTC

Affected components

Web DashboardWeb Dashboard

Update timeline

  1. investigating Sep 25, 2024, 02:31 PM UTC

    Our team is seeing intermittent slowness in the web dashboard (admin.getquickpass.com). We are actively investigating and will provide further updates as more information is made available.

  2. monitoring Sep 26, 2024, 01:45 PM UTC

    Our team has investigated the causes and will be monitoring today.

  3. monitoring Sep 26, 2024, 10:43 PM UTC

    We have seen a return of the Dashboard slowness this afternoon and are continuing to monitor and work on a permanent resolution. We will continue to update the page as we have more details.

  4. monitoring Sep 27, 2024, 09:19 PM UTC

    We have made a significant changes to resolve this. We will be making additional changes and continue to monitor this over the weekend. Additional updates will be made Monday/Tuesday as we continue to improve the responsiveness of the Dashboard.

  5. monitoring Oct 01, 2024, 11:18 PM UTC

    After a weekend of good results, Tuesday has resurfaced this challenge. Our team is working on this as quickly as possible and we'll continue to update everyone as we have more to share. We have seen early morning (6am to noon EST) seeing much better response times, however as the day progresses, the slowness returns.

  6. monitoring Oct 02, 2024, 09:24 PM UTC

    The CyberQP team worked overnight to implement some integral changes to the Dashboard loading process. These changes, coupled with additional changes that we will be implementing over the next 24 hours, should return the Dashboard to normal functionality. A postmortem will be posted with the final resolution of this incident. If you are still seeing slowness when loading the Accounts lists for your customers, taking longer than 30 seconds to load, please advise us via a support ticket (reply to this email to automatically create the ticket).

  7. resolved Oct 04, 2024, 12:03 AM UTC

    After continued monitoring shows a return to normal operations for the Accounts related pages, this incident is resolved. We are aware of slowness on the Agents page, and our team will continue to release fixes to address this in the coming days. If you continue to receive error banners, please purge the Cache and Cookies for the Dashboard site, and relaunch the Browser and log in again. We've heard from a number of users that this resolved the error. A Post Mortem will be posted tomorrow for this incident. Please reply to this email if you are still experiencing slow responses on the Dashboard.

  8. postmortem Oct 04, 2024, 07:50 PM UTC

    * **What happened?** * The API that powers our Dashboard experienced high latency in a few endpoints which caused corresponding pages to have high response times. * **Why did it happened?** * Over the past few weeks, we've had an influx of notifications from the agents that have slowed down the service that processes these notifications. The Dashboard is dependent on this service for communication to the agents which is why it slowed down. * **How we fixed it?** * Since then, we have pushed a few fixes to decouple the dashboard page loads from that notification service so that the dashboard does not rely on it and therefore loads quicker. We also pushed an optimization to the notification processing where we offload the heavy processing to a separate, more scalable and load balanced service so that notifications can be processed more efficiently. * **What are long term plans to make sure it doesn't happen again?** * We've added additional monitoring and alerting to critical components that will notify us when performance is degraded. We're continuing to improve the communication between our agent and our cloud services, including optimizing queries to our database and queuing heavy workloads to be processed timely and efficiently. We're improving our incident response process, so we are quicker and more responsive during these incidents.