Python Package Index experienced a major incident on October 24, 2022 affecting pypi.org - Backends and files.pythonhosted.org - Redirects and 1 more component, lasting 53m. The incident has been resolved; the full update timeline is below.
Affected components
Update timeline
- investigating Oct 24, 2022, 02:51 PM UTC
The backend that hosts PyPI and associated services is experiencing a major outage, we are investigating.
- monitoring Oct 24, 2022, 03:05 PM UTC
We have identified and resolved the reason for the outage and are monitoring to ensure it remains stable.
- resolved Oct 24, 2022, 03:44 PM UTC
This incident has been resolved.
- postmortem Oct 24, 2022, 03:44 PM UTC
## Summary The cluster that hosts PyPI’s backends as well as multiple ancillary services experienced an outage during maintenance that interrupted access to services over HTTPS. ## Details From 2022-10-24 14:43 UTC until 2022-10-24 15:03 UTC, PyPI’s backends were not accessible over HTTPS. This interfered with our CDNs ability to fetch pages, made uploads to [uploads.pypi.org](http://uploads.pypi.org) impossible, and interrupted other services such as our legacy file redirect service. PyPI services run as deployments in a Kubernetes cluster and are exposed via Ingress with the AWS Elastic Load Balancer \(ELB\) integration. The TLS certificate that this load balancer uses is managed via Amazon’s Certificate Manager \(ACM\). When initially deploying our Kubernetes cluster, the Ingress managed ELB was configured to use an existing ACM TLS certificate. Earlier today, regular maintenance of our Kubernetes cluster required a rolling restart of all nodes in the cluster to distribute upgrades and new configurations to all nodes. During this rolling restart as the Kubernetes API server hosts were deployed, Kubernetes validated and refreshed the Ingress configurations we had previously defined. Since the most recent rolling upgrade, an additional hostname was needed for PyPI’s Ingress. PyPI administrators created a new ACM TLS certificate including that hostname and updated the Ingress managed ELB to use this new certificate. As a result, the new Kubernetes API servers were unable to find the previous ACM TLS certificate and disabled the HTTPS listener for the Ingress configuration that serves PyPI and associated services as they came online. Once identified, the PyPI admins updated the Ingress configuration to point to the new ACM TLS certificate and Kubernetes restored the HTTPS listener on the Ingress managed ELB, restoring access to all services. ## Mitigation We will investigate mechanisms by which the hostnames needed on ACM TLS certificates for PyPI’s Ingress configurations can be managed via Kubernetes resources rather than manually via the AWS console. By managing resources all via the Kubernetes API, drift between desired state and reality will be less likely to occur and surface during similar maintenance in the future.