Proemion incident

Full Outage on PSP

Proemion experienced a critical incident on February 13, 2017, lasting 21m. The incident has been resolved; the full update timeline is below.

Started: Feb 13, 2017, 12:55 PM UTC
Resolved: Feb 13, 2017, 01:16 PM UTC
Duration: 21m
Detected by Pingoru: Feb 13, 2017, 12:55 PM UTC

Update timeline

identified Feb 13, 2017, 12:55 PM UTC

We encountered a major hardware failure during a hardware replacement. Tech-Teams is restoring the infrastructure at the moment.
resolved Feb 13, 2017, 01:16 PM UTC

All services are back up and running. The outage started at 13:40 and get resolved at 13:57-CET. We had no data loss. There is no action required for the customers. We are going to follow up the incident with a postmortem soon.
postmortem Aug 02, 2018, 04:40 PM UTC

# Outage on February 13th Last week we had a series of problems caused by a failing hardware system. We decided to replace it. During the replacement process we ran into an issue that caused yesterday's outage. We know that our customers rely on our service daily. We're continually improving our setup and we're sorry that we went offline. # Facts On Feb 13, we experienced a full outage from 13:40 to 13:57 CET. All PROEMION Data Platform services have been affected and were restored after 17 minutes. No data has been lost and no customer action has been required. # What happened The PROEMION data platform is hosted on a set of hardware clusters, one of which has been experiencing hardware problems. We are currently in the process of replacing that system with a new one. One of the key components of this system is a cluster manager, which is responsible for resource placement. After we added the replacement node, the cluster manager ran into an issue and failed to allocate resources. This erroneously caused a full service shutdown. Since we were actively working on those components, we immediately recognized the outage and started restore procedures. # Mitigation We are addressing the hardware issues by completing the system replacement. Analyzing the cluster manager's failure and evaluating alternatives is scheduled for this week. # In Conclusion We're aware of how important availability of our data platform is to our customers. We're taking these issues absolutely serious and are commited to not only resolve them quickly, but also to prevent them in the future.