gaiia software experienced a critical incident on June 13, 2023 affecting Public GraphQL API, lasting 2h 54m. The incident has been resolved; the full update timeline is below.
Affected components
Update timeline
- investigating Jun 13, 2023, 07:02 PM UTC
We are currently investigating this issue.
- investigating Jun 13, 2023, 07:03 PM UTC
We are continuing to investigate this issue.
- identified Jun 13, 2023, 07:04 PM UTC
Both the AWS Control Plane and Data Plane are down.
- identified Jun 13, 2023, 07:29 PM UTC
The us-east-1 region is completely down. AWS has identified the root cause of the problem, but we are still preparing to launch the disaster recovery process in another region if need be.
- identified Jun 13, 2023, 07:38 PM UTC
Update from AWS: "We are continuing to experience increased error rates and latencies for multiple AWS Services in the US-EAST-1 Region. We have identified the root cause as an issue with AWS Lambda, and are actively working toward resolution. For customers attempting to access the AWS Management Console, we recommend using a region-specific endpoint (such as: https://us-west-2.console.aws.amazon.com). We are actively working on full mitigation and will continue to provide regular updates."
- identified Jun 13, 2023, 08:15 PM UTC
Update from AWS: "We are continuing to work to resolve the error rates invoking Lambda functions. We're also observing elevated errors obtaining temporary credentials from the AWS Security Token Service, and are working in parallel to resolve these errors."
- monitoring Jun 13, 2023, 08:44 PM UTC
We have been able to log back into gaiia and are continuing to monitor the resolution. Update from AWS: "We are beginning to see an improvement in the Lambda function error rates. We are continuing to work towards full recovery."
- resolved Jun 13, 2023, 09:57 PM UTC
Gaiia has fully recovered, and all accumulated events that were pending during the incident have been processed.
- postmortem Jun 14, 2023, 01:55 PM UTC
POST MORTEM: * **Incident:** The gaiia API was down due to a major AWS downtime in the us-east-1 region. Most of the AWS services were affected, notably Lambda, API Gateway and Cloudwatch. * **Scope**: This incapacitated all web applications relying on the gaiia API: gaiia users were not able to use the gaiia web application, and end customers were not able to place orders or log into the client portals. * **Potential mitigation**: No direct mitigation of the AWS issue was possible, but a partial disaster recovery process was initiated in case this issue would have lasted longer. * **Resolution**: AWS fixed their services. * **Timeline:** 2023-06-13 14h55 EST: Issue was first discovered 2023-06-13 16h44 EST: Issue was partially resolved 2023-06-13 18h37 EST: Issue was fully resolved * **Time to discovery:** ~5 minutes according to AWS timeline * **Time to full resolution:** 3h48 mins