Chef incident

Elevated connection rate and 500's

Critical Resolved View vendor source →

Chef experienced a critical incident on April 16, 2020, lasting 3h 31m. The incident has been resolved; the full update timeline is below.

Started
Apr 16, 2020, 11:06 PM UTC
Resolved
Apr 17, 2020, 02:37 AM UTC
Duration
3h 31m
Detected by Pingoru
Apr 16, 2020, 11:06 PM UTC

Update timeline

  1. investigating Apr 16, 2020, 11:06 PM UTC

    After upgrading PostgreSQL we are seeing database connections and CPU spikes. We're resizing the database to get more system resources and will provide additional updates as we have them.

  2. investigating Apr 17, 2020, 12:00 AM UTC

    We're investigating an unexpected elevation in fetches by the authz service from the database.

  3. investigating Apr 17, 2020, 01:58 AM UTC

    We're isolating the issue with authz service's database queries that is taking an abnormally long time to complete.

  4. investigating Apr 17, 2020, 02:13 AM UTC

    We've implemented a short term workaround to restore service. We're monitoring the service.

  5. investigating Apr 17, 2020, 02:24 AM UTC

    Traffic patterns and service have normalized to regular levels observed prior to the maintenance window. We will conduct an incident analysis and write up a blog post for this next week. Thank you for your patience and I'm sorry that this impacted your workflows.

  6. resolved Apr 17, 2020, 02:37 AM UTC

    This incident has been resolved.