Python Package Index incident

PyPI Outage 2022-07-28 18:46-18:53 UTC

Notice Resolved View vendor source →

Python Package Index experienced a notice incident on July 28, 2022, lasting —. The incident has been resolved; the full update timeline is below.

Started
Jul 28, 2022, 07:00 PM UTC
Resolved
Jul 28, 2022, 07:00 PM UTC
Duration
Detected by Pingoru
Jul 28, 2022, 07:00 PM UTC

Update timeline

  1. resolved Jul 28, 2022, 07:00 PM UTC

    PyPI experienced an outage in its backends from 18:46-18:53 UTC on 2022-07-28. This outage stemmed from a cascading failure in the backend infrastructure that disrupted the management layer that deploys and secures PyPI's application servers. It is now resolved.

  2. postmortem Jul 28, 2022, 07:00 PM UTC

    PyPI experienced an outage in its backends from 18:46-18:53 UTC on 2022-07-28. This outage stemmed from a cascading failure in the backend infrastructure that disrupted the management layer that deploys and secures PyPI's application servers. ## Details Beginning at 17:24 UTC multiple cloud instances that are members of the Kubernetes cluster that PyPI runs on were automatically replaced by the provider due to an unknown failure. By 18:20 UTC all instances were replaced and healthy as part of the Kubernetes cluster. This replacement event caused the system to reshuffle _many_ pods which impacted both our Consul and Vault clusters inside Kubernetes. Both services were restarted across new nodes. At 18:23 UTC the first alert was delivered to PSF Infrastructure to notify that at least one Vault container was down and required unsealing to come back online. Upon investigation it was determined that all Vault containers in the cluster were unavailable and that the Consul containers were unable to establish a new leader. The responding Infrastructure Team member worked to bring Consul back online, then unseal the new Vault containers. Complications arose in bringing Consul back into a healthy state as one of the pods was stuck in the “Terminating” state, which caused the discovery mechanisms for Consul servers to get stuck expecting an additional server to participate in elections. Once this pod was forcefully removed and allowed to be re-created, recovery commenced. These backing services being offline cause new containers launched to serve PyPI’s applications to fail. As a result, containers rescheduled due to node replacements were unable to start. This eventually caused PyPI to become unresponsive at 18:46 UTC as more and more containers were rescheduled by Kubernetes in an attempt to meet the specified instance count.