Teem incident

S1 - TEEM Services unavailable

Teem experienced a critical incident on February 13, 2024 affecting Web Interface and Mobile Data and 1 more component, lasting 6h 24m. The incident has been resolved; the full update timeline is below.

Started: Feb 13, 2024, 10:02 AM UTC
Resolved: Feb 13, 2024, 04:26 PM UTC
Duration: 6h 24m
Detected by Pingoru: Feb 13, 2024, 10:02 AM UTC

Affected components

Web InterfaceMobile DataAPIEventBoardPhone SystemLobbyConnectAuthentication (SSO)Datadog Events

Update timeline

investigating Feb 13, 2024, 10:02 AM UTC

We are currently investigating an issue with Teem. We will update you when we have more information.
investigating Feb 13, 2024, 11:20 AM UTC

We are currently investigating an issue with TEEM. Our Engineering team is currently investigating to determine the cause of the disruption. The next update will be posted at 7 AM CST.
investigating Feb 13, 2024, 12:59 PM UTC

We are currently investigating an issue with TEEM. Our Engineering team is currently investigating to determine the cause of the disruption. The next update will be posted at 9 AM CST.
monitoring Feb 13, 2024, 02:11 PM UTC

A fix has been implemented. We are moving into the Monitoring Phase for the next 2 hours. 10:00am CST
resolved Feb 13, 2024, 04:26 PM UTC

As we have not seen further service disruptions after the fix was implemented, we have moved to the Resolved Phase. A RCA will be posted in this incident in 10 business days. Please stay subscribed to the page to receive post automatically.
postmortem Feb 23, 2024, 04:44 PM UTC

**Teem by Eptura detailed Root Cause Analysis | 2/12/2024** **S1 – Inability to Access Teem** We are truly grateful for your continued support and loyalty. We value your feedback and appreciate your patience as we worked to resolve this incident. **Description:** **On Monday February 12, 2024, around 7:02pm MST both internal and external customers experienced an inability to access Teem. At approximately 7:06pm MST, internal team members were alerted that Teem Login Page has failed a monitoring check. Internal teams immediately began to investigate the issue.** **Type of Event:** Outage **Services/Modules impacted:** All production. **Timeline:** **The timeline is posted in MST.** **We received an alert at 7:02 PM. Our internal Engineering team was able to jump onto the issue. 7:16 PM. We continued investigating and jumped on a call with AWS engineering to get this issue resolved. 8:39 PM Our internal teams are still on a call with AWS engineering. 9:40 PM The call is continuing with AWS as our Engineering team and AWS engineering teams are working together to get database online. 3:02 AM Our Engineering team posted a status page as a Severity 1 letting our customers know the issues at hand 4:20 AM We are still investigating the issue. 5:59 AM Investigation continues 7:11 AM the issue has been resolved and we have moved Virtual Machines to allow hosting for our database. We then notified our customer base and put the status page into a monitoring state. 9:26 AM the issue has been confirmed resolved and the status page has now been updated to reflect.** **Total Duration of Event:** 12 Hours **Root Cause:** AWS hosting required an update to our Virtual Machine we are hosting our Database on. Our Engineering team wasn’t notified due to an email being sent to an old email address. We were notified of the outage right away due to failsafe's put in place. Investigation and attempts at resolution started immediately. **Remediation:** We have moved to RDS for AWS and this should now no longer cause issues with downtime on server updates. We have also updated all emails and notification systems. **Preventative Action:** Having the correct email in place as well as being on a hosting server allows for quick switchover without downtime.