Ashby incident

Ashby Unavailable

Major Resolved View vendor source →

Ashby experienced a major incident on October 22, 2025 affecting Email and Google and 1 more component, lasting 4h. The incident has been resolved; the full update timeline is below.

Started
Oct 22, 2025, 04:06 PM UTC
Resolved
Oct 22, 2025, 08:07 PM UTC
Duration
4h
Detected by Pingoru
Oct 22, 2025, 04:06 PM UTC

Affected components

EmailGoogleAshby APIRecruitingSlackOffice 365Reports APIAnalyticsJob Post APIHosted Job Boards

Update timeline

  1. investigating Oct 22, 2025, 04:06 PM UTC

    We are currently investigating this issue.

  2. identified Oct 22, 2025, 04:09 PM UTC

    We've identified the issue and are investigating a fix.

  3. monitoring Oct 22, 2025, 04:19 PM UTC

    We've implemented a fix and are monitoring.

  4. monitoring Oct 22, 2025, 05:29 PM UTC

    We are continuing to monitor. As a precaution we have disabled Workday and Analytics syncs.

  5. monitoring Oct 22, 2025, 07:15 PM UTC

    We're beginning to process the backlog of Analytics and Workday syncs and are continuing to monitor. Until we are caught up, customers of Ashby Analytics will see data that is a couple of hours behind their ATS and Ashby All-in-One customers will see delays in the Workday integration.

  6. resolved Oct 22, 2025, 08:07 PM UTC

    Our systems have caught up on the backlog of Analytics and Workday sync tasks. Sync times have returned to normal. This incident has been resolved.

  7. postmortem Nov 10, 2025, 06:32 PM UTC

    ## **Summary** On October 22, 2025, Ashby was unavailable for approximately 20 minutes \(4:00 PM to 4:19 PM UTC\). Due to a bug introduced five hours earlier, a critical service in Ashby’s infrastructure began to fail, causing our application to become unavailable. Customer job boards remained functional during this period. Once we resolved the availability issue, we disabled the Workday and Analytics syncs and gradually increased their frequency over approximately 1 hour and 45 minutes, from 5:29 PM UTC to 7:15 PM UTC. ## **Why did this happen?** Before the incident, a bug was introduced to our application code that ran on an automated schedule. Over the course of five hours, the bug caused an increasing amount of network traffic to a critical service that the application code communicated with. Approximately 10 minutes before the incident, the increased network traffic caused a sudden spike in memory usage on the virtual machine hosting the critical service, resulting in the virtual machine running out of memory and the service failing. ## **How did we resolve the situation?** On Thursday, October 22, 2025, at 10:48AM UTC, we deployed the application code change containing the bug. At 4:00 PM UTC, various automated monitors alerted our team that Ashby was unavailable, and an incident was initiated by our On-call Engineer. At 4:09 PM UTC, we determined the service failed due to the virtual machine running out of memory and began the process of failing over the service to a backup virtual machine. At 4:19PM UTC, the service was restored, and we confirmed Ashby was available. At 4:33PM UTC, we identified the probable cause and began reverting the change that introduced the bug. At 6:09PM UTC, we verified the bug we suspected was the root cause. At 6:15PM UTC, the revert was shipped to production. At 6:16PM UTC, out of an abundance of caution, we slowly increased the frequency the application code was run while monitoring the critical service it communicated with. At 7:15 PM UTC, we felt confident that we had removed any effects introduced by the identified bug and restored the application code to its normal schedule. Because we hadn’t run the normal schedule for several hours, it would take time for the application code to go through the backlog of tasks it needed to perform. At 8:07PM UTC, the backlog of tasks was completed, and we resolved the incident. ## **What have we put in place to prevent it from happening in the future?** Once we identified the root cause and resolved the incident, we immediately implemented two changes to detect or prevent this issue from recurring: * **Monitors that detect and alert our team to outliers in network traffic to our critical service.** * **We deployed a change that removed the network traffic side effect that caused the incident.** The bug uncovered this side effect. **Our team has also committed to moving the critical service that failed to virtual machines that auto-scale.** One of the reasons the critical service failed was that we had explicitly allocated a specific amount of memory for the virtual machine, and when that limit was reached, the machine failed. We will be transitioning this service to one that enables our cloud service provider to automatically scale both the size \(including available memory\) and the number of machines on which the service runs.