Evidos incident

Issue creating transactions

Evidos experienced a critical incident on September 26, 2024 affecting API and Portal, lasting 4h 28m. The incident has been resolved; the full update timeline is below.

Started: Sep 26, 2024, 08:35 AM UTC
Resolved: Sep 26, 2024, 01:03 PM UTC
Duration: 4h 28m
Detected by Pingoru: Sep 26, 2024, 08:35 AM UTC

Affected components

APIPortal

Update timeline

investigating Sep 26, 2024, 08:35 AM UTC

We are currently experiencing issues when creating transactions. We are investigating the issue.
investigating Sep 26, 2024, 09:01 AM UTC

We have put the complete platform in maintenance while we are investigating the issue. Our services are unavailable for the time being.
investigating Sep 26, 2024, 09:45 AM UTC

Unfortunately, we have not yet been able to identify the problem. We are in close contact with our providers and continue to investigate this issue with the highest priority.
monitoring Sep 26, 2024, 10:01 AM UTC

A fix has been implemented and we are monitoring the results.
investigating Sep 26, 2024, 10:09 AM UTC

The implemented fix did not solve the issue. Please bear with us as we treat this issue with utmost priority.
investigating Sep 26, 2024, 11:00 AM UTC

Our API is operational but we are holding off bringing our portal online again. We continue to investigate the issue.
monitoring Sep 26, 2024, 11:23 AM UTC

We have identified an issue with our hosting provider and have implemented a fix. We are monitoring the results closely.
resolved Sep 26, 2024, 01:03 PM UTC

The incident has been resolved. We are working on getting all information and evaluating, a post mortem will be posted later when this process is done.
postmortem Oct 01, 2024, 07:27 AM UTC

## What happened? On Thursday September 26 starting around 10:25 CEST users were unable to create transactions caused by an issue with a database node. ## What did we do? After having noticed the issue occurring, we directly tried to analyze what the root cause may be, and we have immediately contacted our hosting party to also check their logging for any issues. Simultaneously we put our platform in maintenance mode to prevent overloading while continuing investigating the issue. Our hosting provider discovered a problem with one of the database servers. We moved the first database to a different server, which helped improve things. After bringing our platform back online we noticed the delay occurring again and decided to immediately put the platform in maintenance mode again. We decided to also migrate the second database node. However, when we tried to move the second database, our hosting party ran into some issues. As a solution, we switched the commits database to the first server. After making this change, the platform stabilized. We gradually start bringing the servers online and after about half an hour we were fully operational again. ## What was the cause of the downtime? Our hosting party identified high traffic on the server where our database node was running, which led to performance problems. To fix this, we tried moving our database servers to different hardware. Moving the first server helped improve the situation somewhat, but we encountered problems when trying to move the second server. Our cloud hosting provider found an issue with this second server, which prevented the migration. This caused the delay and the downtime of the platform. We will work to gain better insights into the status of our database servers, including monitoring their performance and load more effectively. This will help us detect issues earlier and take corrective action faster. We will keep in close contact with our hosting partner to prevent this from happening again.