Porter incident

Deployments Outage

Major Resolved View vendor source →

Porter experienced a major incident on March 12, 2025 affecting Porter UI and Porter API, lasting 5h 55m. The incident has been resolved; the full update timeline is below.

Started
Mar 12, 2025, 05:45 AM UTC
Resolved
Mar 12, 2025, 11:40 AM UTC
Duration
5h 55m
Detected by Pingoru
Mar 12, 2025, 05:45 AM UTC

Affected components

Porter UIPorter API

Update timeline

  1. investigating Mar 12, 2025, 06:59 AM UTC

    We're investigating failures with deployments on customer workloads, and our chart/app template repositories.

  2. identified Mar 12, 2025, 07:04 AM UTC

    The issue has been identified as a problem with an upstream DNS provider. A fix is currently being worked on, and we hope to have this resolved ASAP.

  3. identified Mar 12, 2025, 07:49 AM UTC

    We're continuing to work on fixing this, and are in touch with the upstream DNS provider to help resolve this speedily.

  4. identified Mar 12, 2025, 10:18 AM UTC

    We are rolling out a fix for this at the moment, and we'll be able to resume deployments across the board soon.

  5. monitoring Mar 12, 2025, 11:16 AM UTC

    We're deploying a fix for this, and are monitoring its rollout. Any failed deployments can be re-run via Github Actions or the Porter dashboard.

  6. resolved Mar 12, 2025, 11:40 AM UTC

    This incident has now been resolved. User workload deployments are proceeding normally; any failed deployments can be re-run via Github Actions or the Porter dashboard. A detailed RCA is included below. This RCA addresses the downtime our platform faced on the 12th of March 2025, specifically with deployments to user workloads. Some context - earlier, Porter's dashboard and build/deployment endpoints were on https://dashboard.getporter.dev. Around May 2024, we migrated over to a new domain - dashboard.porter.run - in line with our recent branding efforts and while we notified users we'd now be available on dashboard.porter.run, we also set up a DNS redirect on the older domain - getporter.dev - to ensure that people would never be in a situation where they can't access the dashboard, even with the older domain. On the 12th of March, our upstream DNS provider mistakenly detected what they felt was potentially fraudulent activity on our account, and froze our DNS zone file. This meant that any requests for dashboard.getporter.dev would fail, and unfortunately it also meant that we were locked out of our credentials, thus delaying efforts on our end to bring the domain back up. This didn't affect user workloads in any way, since user workloads run on users' infrastructure with their own domains. What this did affect, was the ability to push new deployments; since the Porter CLI and the Github Actions as well as our app chart repositories were on getporter.dev, users would have found themselves unable to deploy new changes to their existing apps. The issue was finally resolved once we were able to prove to the DNS provider that we were solely in control of our DNS zonefile, which led to the blocks being removed. We take this incident very seriously. We're an infrastructure company where we look ourselves as partners to our customers and the products they've built. To wit, we've started instituting a number of changes: 1. While we've been using Cloudflare for most of our domains, we're going to now migrate all domains - including getporter.dev - to Cloudflare. 2. In addition, we're going to migrate all our app template repos to porter.run, as well as fully deprecate getporter.dev going forward, in order to remove any potential gaps. We'd like to sincerely apologise for this incident, and we truly appreciate your patience through this.

  7. postmortem Mar 17, 2025, 12:10 PM UTC

    ``` This RCA addresses the downtime our platform faced on the 12th of March 2025, specifically with deployments to user workloads. Some context - earlier, Porter's dashboard and build/deployment endpoints were on https://dashboard.getporter.dev. Around May 2024, we migrated over to a new domain - dashboard.porter.run - in line with our recent branding efforts and while we notified users we'd now be available on dashboard.porter.run, we also set up a DNS redirect on the older domain - getporter.dev - to ensure that people would never be in a situation where they can't access the dashboard, even with the older domain. On the 12th of March, our upstream DNS provider mistakenly detected what they felt was potentially fraudulent activity on our account, and froze our DNS zone file. This meant that any requests for dashboard.getporter.dev would fail, and unfortunately it also meant that we were locked out of our credentials, thus delaying efforts on our end to bring the domain back up. This didn't affect user workloads in any way, since user workloads run on users' infrastructure with their own domains. What this did affect, was the ability to push new deployments; since the Porter CLI and the Github Actions as well as our app chart repositories were on getporter.dev, users would have found themselves unable to deploy new changes to their existing apps. The issue was finally resolved once we were able to prove to the DNS provider that we were solely in control of our DNS zonefile, which led to the blocks being removed. We take this incident very seriously. We're an infrastructure company where we look ourselves as partners to our customers and the products they've built. To wit, we've started instituting a number of changes: 1. While we've been using Cloudflare for most of our domains, we're going to now migrate all domains - including getporter.dev - to Cloudflare. 2. In addition, we're going to migrate all our app template repos to porter.run, as well as fully deprecate getporter.dev going forward, in order to remove any potential gaps. We'd like to sincerely apologise for this incident, and we truly appreciate your patience through this. ```