Proton incident

Technical difficulties

Proton experienced a notice incident on January 9, 2025 affecting Web Application and Free servers and 1 more component, lasting 3h 12m. The incident has been resolved; the full update timeline is below.

Started: Jan 09, 2025, 05:37 PM UTC
Resolved: Jan 09, 2025, 08:49 PM UTC
Duration: 3h 12m
Detected by Pingoru: Jan 09, 2025, 05:37 PM UTC

Affected components

Web ApplicationFree serversMobile AppsWeb ApplicationMobile AppsIncoming MailRegular ServersWeb ApplicationMobile and Desktop AppsBrowser Extensions

Update timeline

investigating Jan 09, 2025, 03:10 PM UTC

We are currently experiencing intermittent network issues affecting some of our users. We are working to fully restore services as soon as possible. We apologize for the inconvenience.
investigating Jan 09, 2025, 03:27 PM UTC

We are continuing to investigate this issue.
identified Jan 09, 2025, 04:42 PM UTC

As of 16:15 CET, all services other than Mail and Calendar are operating normally. We are still working on fixing the issue and restoring the rest of the affected services. We'll come back with more information in the next update. Thank you for your patience.
identified Jan 09, 2025, 05:37 PM UTC

Access to Proton Mail has been fully restored, and we can confirm that it is now operating normally. We are working on a solution for Calendar and will be back soon with more information.
identified Jan 09, 2025, 05:39 PM UTC

We are continuing to work on a fix for this issue.
monitoring Jan 09, 2025, 06:27 PM UTC

We have resolved all service outages, and the situation has been stable for some time. We have identified the root cause of the problem, implemented a fix, and are now monitoring the results.
resolved Jan 09, 2025, 08:49 PM UTC

Incident report: Earlier today at around 4PM Zurich, the number of new connections to Proton's database servers increased sharply globally across Proton's infrastructure. This overloaded Proton's infrastructure, and made it impossible for us to serve all customer connections. While Proton VPN, Proton Pass, Proton Drive/Docs, and Proton Wallet were recovered quickly, issues persisted for longer on Proton Mail and Proton Calendar. For those services, during the incident, approximately 50% of requests failed, leading to intermittent service unavailability for some users (the service would look to be alternating between up and down from minute to minute). Normally, Proton would have sufficient extra capacity to absorb this load while we debug the problem, but in recent months, we have been migrating our entire infrastructure to a new one based on Kubernetes. This requires us to run two parallel infrastructure at the same time, without having the ability to easily move load between the two very different infrastructures. While all other services have been migrated to the new infrastructure, Proton Mail is still in middle of the migration process. Because of this, we were not able to automatically scale capacity to handle the massive increase in load. In total, it took us approximately 2 hours to get back to the state where we could service 100% of requests, with users experiencing degraded performance until then. The service was available, but only intermittently, with performance being substantially improved during the second hour of the incident, but requiring an additional hour to fully resolve. A parallel investigation by our site reliability engineering team identified a software change that we suspected was responsible for the initial load spike. After this change was rolled back, database load returned to normal. This change was not initially suspected because a long period of time had elapsed between when this change was introduced and when the problem manifested itself, and an initial analysis of the code suggested that it should have no impact on the number of database connections. A deeper analysis will be done as part of our post-mortem process to understand this better. The completion of ongoing infrastructure migrations will make Proton's infrastructure more resilient to unexpected incidents like this by restoring the higher level of redundancy that we typically run, and we are working to complete this work as quickly as possible.