Python Package Index incident
Redirect Loops on JSON API endpoints.
Python Package Index experienced a minor incident on June 10, 2022 affecting pypi.org - CDN and pypi.org - Backends and 1 more component, lasting 5h 23m. The incident has been resolved; the full update timeline is below.
Affected components
Update timeline
- identified Jun 10, 2022, 11:08 AM UTC
Some cached responses are causing redirect loops for endpoints on the JSON API. We are working to determine how to clear these cached values without impacting the overall health of PyPI.
- identified Jun 10, 2022, 11:34 AM UTC
We have started a task which will iterate over all projects and purge the cache for each individually. This will keep the PyPI backends from being overloaded by a completely bare cache. This process will take some time to complete, current estimate is 1-2 hours.
- identified Jun 10, 2022, 12:38 PM UTC
Our mass purge operation is continuing. Based on the current rate that we're able to process all purges the processes should be complete in 45-60 minutes.
- identified Jun 10, 2022, 01:54 PM UTC
The cache purge has cleared all but projects starting with the letter `p`. Our estimates failed to take into consideration the popularity of project names on PyPI starting with p 🙃
- monitoring Jun 10, 2022, 04:01 PM UTC
All purges of JSON API documents have completed. Our backends are recovering from the added load of repopulating the entire cache. Any failed purges may result in latent redirect loops for specific projects or releases in the JSON API, these will self-resolve within 24 hours as caches expire.
- resolved Jun 10, 2022, 04:32 PM UTC
This incident has been resolved.
- postmortem Jun 10, 2022, 04:32 PM UTC
## Summary PyPI's JSON API experienced an incident that caused redirect loops to occur for all clients when requesting many of the endpoints available on the service. These endpoints provide JSON documents describing projects and specific releases of projects hosted on [pypi.org](http://pypi.org). This incident was initiated by changes intended to make the JSON API more performant and less of a burden on the PyPI backends. Specifically, the combination of [this change to the warehouse codebase](https://github.com/pypa/warehouse/pull/11546) and [this change to the CDN configuration for PyPI](https://github.com/python/pypi-infra/pull/87) update the canonical locaiton for accessing the JSON documents deterministic. This allows for deeper cache efficiency and the ability to move redirects to the canonical URL out to the CDN edge. The result is faster response times for clients and reduced load on the PyPI backends. The redirect updates specifically had an unintended consequence that led to this outage. Namely the existing elements in cache for the "normalized" project names redirected to the verbatim names of projects. Example: * Before * `/pypi/pyOpenSSL/json` "verbatim" project name url returns a `200` with the document and is canonical. * `/pypi/pyopenssl/json` "normalized" project name url `301` redirects to canonical URL. * After * `/pypi/pyOpenSSL/json` "verbatim" project name url returns a `301` redirect to canonical URL. * `/pypi/pyopenssl/json` "normalized" project name url returns a `200` with the document and is canonical. The issue at hand is that what is intended to the new canonical URL using the "normalized" project name was cached for _many_ projects, and that cache included a redirect to the "verbatim" project name, which... redirected back. ## Impact This impacted any client of the PyPI JSON API. These include the `poetry` tool for installing and managing Python dependencies, mirroring tools such as `bandersnatch`, and even our own internal service that redirects legacy file URLs on [files.pythonhosted.org](http://files.pythonhosted.org) to their new locations. The outage began with the deployment of [this change to the warehouse codebase](https://github.com/pypa/warehouse/pull/11546) and [this change to the CDN configuration for PyPI](https://github.com/python/pypi-infra/pull/87) at approximately 2022-06-10T10:50 UTC and continued to impact some URLs in a "long-tail" fashion through to 2022-06-10T16:00 UTC. Most notably project names beginning with `p` were the last to be affected as it is the most common first letter for projects uploaded to PyPI. ## Mitigation The changes described that led to this outage have been actively being attempted for over 6 weeks after more than a year of being aware of the problem. When faced with a system that was functioning correctly, but required purging of cache to be fully established the PyPI administrator managing the deployment chose to roll forward rather than roll back, knowingly extending the impact of this incident in favor of getting PyPI to a more maintainable state for serving the JSON API into the future. Because of the massive scale of PyPI's caches it was untenable to specifically purge bad URLs, leading to a need to clear the entire PyPI cache. This was undertaken by kicking off pools of processes that would iterate over individual first letters of project names to purge the entire project cache in parallel over a 1-2 hour duration. This duration was required as purging the entire cache would have overloaded the backends for PyPI in such a way that even with massive temporary scale up/out the load would have been too much for our backend and led to _many more hours_ of outage across the entire service. While most of the purges completed within the two hour estimate, the letter `p` was the final sgement, taking nearly 4 hours due to the popularity of the `py` prefix on the index. While the purges were ongoing PyPI's backends even in a scaled out state provided lackluster response times and performance as the caches were slowly refilled. ## Future work Issues have been filed to create better tooling to safely and expediently purge the caches and to limit the blast radius of purges in the future. We will also begin to research ways to build more confidence and expose these kinds of errors in our review process for changes to the configuration of our CDN.