Jobvite incident

Delays with Candidate Syncs

Jobvite experienced a minor incident on April 8, 2026, lasting —. The incident has been resolved; the full update timeline is below.

Started: Apr 08, 2026, 07:57 PM UTC
Resolved: Apr 08, 2026, 06:00 AM UTC
Duration: —
Detected by Pingoru: Apr 08, 2026, 07:57 PM UTC

Update timeline

resolved Apr 08, 2026, 07:57 PM UTC

Summary: We experienced an incident affecting the API service that syncs candidate data from our CRM to downstream systems. Impact: Customers may have experienced delays in seeing updated candidate information appear in the Jobvite ATS during the incident window. Affected Service: Mukmuk API Engine Timeline: Start: April 8 at approximately 2:00 AM EDT Duration: Approximately 8 hours
postmortem Apr 13, 2026, 04:15 PM UTC

**Date:** April 8, 2026 **Duration:** ~11 hours \(01:00 am ET – 12:00 pm ET\) We want to share an update on a recent automation issue, including what happened, how it was resolved, and the steps we’ve taken to prevent it from happening again. ## Customer Impact On April 8, 2026, certain background processing agents responsible for moving data between Talemetry Apply and external Applicant Tracking Systems \(ATS\) were not running as expected. As a result: * Job and application data imports/exports were delayed. * Job status updates were not reflected in near real time. * Newly submitted applications were not immediately visible in the external ATS. No data was lost. All affected jobs and applications were successfully processed once service was restored. ## Root Cause The issue occurred due to a congestion condition in the Mukmuk agent processing system: * A large number of agents were scheduled to run, exceeding the capacity of the background worker pods, which were actively handling the workload. * Several non-essential agents consumed capacity, blocking higher-priority production agents, and causing the system to enter a stalled state where critical agents could not execute. In short, the system lacked sufficient safeguards to prevent inactive or non-critical agents from interfering with production workloads during peak processing conditions. ## Resolution Once the issue was identified, the following actions were taken: 1. Non-essential development agents were disabled. 2. Background processing capacity was temporarily increased to accelerate backlog processing. 3. Queued jobs were monitored until all pending imports and exports completed successfully. By approximately 12:00 ET, all agents had caught up, job processing returned to normal, and customer-facing data reflected up-to-date timestamps confirming restoration. ## Preventative Actions To reduce the likelihood and impact of similar incidents in the future, the following improvements are underway or completed: * **Proactive Monitoring** * New alerts are being added to detect stalled or non-running agents in near real time. * Monitoring schedules are being expanded beyond limited overnight checks. * **Improved Observability** * Additional metrics will track agent execution health and backlog growth. * Engineering alerts will trigger automatically when agents fail to run as expected. * **Operational Safeguards** * Cleanup of unused or churned agents to prevent resource contention. These actions will significantly reduce detection, diagnosis, and recovery times should similar conditions arise again.