Python Package Index experienced a major incident on April 5, 2022 affecting pypi.org - Backends and files.pythonhosted.org - Redirects and 1 more component, lasting 1h 11m. The incident has been resolved; the full update timeline is below.
Affected components
Update timeline
- investigating Apr 05, 2022, 03:25 PM UTC
All backend services for PyPI are down due to a cascading failure in our deployment tooling. We are investigating and working on restoring service.
- identified Apr 05, 2022, 03:38 PM UTC
Core failed service has been identified and is coming back online. Next we'll bring up the ancillary services. Once all our automation services are online we can begin to bring the applications back.
- monitoring Apr 05, 2022, 03:41 PM UTC
Applications are coming back online and we are monitoring for stability.
- resolved Apr 05, 2022, 04:36 PM UTC
This incident is resolved.
- postmortem Apr 05, 2022, 04:36 PM UTC
At approximately 15:00 UTC, the TLS certificates that PyPI’s internal deployment tooling uses to secure access to Vault expired. This led to a cascading failure within the PyPI infrastructure that caused running pods to lose access to secure credentials and stopped new instances from being launched. Under normal circumstances, this would have been resolved as the Vault instances restarted and retrieved a new TLS certificate, but an abnormally large backlog of expired leases caused the new Vault instances to crash on startup and required manual intervention to cleanup extraneous leases. The initial remedy will be to upgrade our Vault instances to a version that resolves the crash on launch issue when the quantity of expired leases is too high, which would allow for this outage to have been recovered in a more automated fashion. Longer term, research and development time will be allocated to improving the automation around detection of instances nearing expiration as well as mechanisms to securely automate the unseal process for our secure storage.