GUIDEcx incident

Issues with logging in affecting several users.

Minor Resolved View vendor source →

GUIDEcx experienced a minor incident on September 16, 2025, lasting —. The incident has been resolved; the full update timeline is below.

Started
Sep 16, 2025, 11:01 PM UTC
Resolved
Sep 16, 2025, 11:01 PM UTC
Duration
Detected by Pingoru
Sep 16, 2025, 11:01 PM UTC

Update timeline

  1. resolved Sep 16, 2025, 11:01 PM UTC

    Type: Incident Duration: 1 hour and 13 minutes Affected Components: , Project Management, Web Application → Sep 16, 23:01:16 GMT+0 - Investigating - We are currently investigating this incident. Sep 16, 23:18:09 GMT+0 - Investigating - We are currently investigating this incident. This is our top priority right now and we have our engineering team actively investigate potential solutions. A root cause has yet to be identified. Sep 16, 23:34:01 GMT+0 - Investigating - We are currently investigating this incident and login access has been restored, although it can take up to a minute or so to login in. Once a user is logged in they can navigate the app as expected and are experiencing normal navigation speeds. Users should be able to log in now, but again will experience a minor delay after entering your password. Sep 17, 00:08:08 GMT+0 - Monitoring - We implemented a fix and are currently monitoring the result. The fix has been deployed and users should be able to log in without any delay. Any users that were currently logged in during this degraded performance didn't experience any additional slowness while logged in. Sep 17, 00:14:19 GMT+0 - Resolved - This incident has been resolved. Thank you for your patience as we navigated restoring the log in flow. Sep 17, 23:25:36 GMT+0 - Postmortem - ## **Post-Mortem: Service Provider (Atlas) MongoDB Connection Incident (September 16, 2025)** **Summary** On September 16, 2025, the access-audit service experienced widespread connection failures to MongoDB Atlas, causing major latency in the login flow and service disruptions. The MongoDB was taking too long to respond, resulting in timeouts for requests sent there. MongoDB was making some infrastructure changes that affected us and other clients. The issue was resolved through implementation of a workaround to handle MongoDB timeouts more gracefully. **Resolution** The issue was resolved by: Deploying a change to the access-audit service to Gracefully handle the timeouts. Incident Timeline | **Time (MDT)** | **Date** | **Status** | | -------------- | -------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | 4:34 PM | Sep 16 | Engineering was alerted through our automated alerts that we were seeing unexpected latency. Posts were also made to the engineering channel to alert other engineers. | | 4:47 PM | Sep 16 | Support raises alarm that they are seeing customer impact in the Product Support channel. | | 4:53 PM | Sep 16 | Engineering assembles in a war room and communicates to the org they are investigating. | | 5:01 PM | Sep 16 | The situation is deemed an incident and the status page is updated to indicate investigation is underway. | | 5:18 PM | Sep 16 | Engineering team actively investigating, root cause not yet identified | | 5:34 PM | Sep 16 | Login access restored with \~1 minute delay, users can navigate normally once logged in | | 6:08 PM | Sep 16 | Fix implemented and deployed, users should be able to log in without delay | | 6:14 PM | Sep 16 | Incident resolved, login flow restored to normal operation | **Root Causes** * Atlas experienced an issue when implementing a feature flag for serverless MongoDB databases, causing latency for us and others of their customers. **Observed Evidence**: **Contributing Factors:** * Service did not gracefully handle MongoDB connection timeouts, causing complete service failures instead of degraded operation. * Authentication endpoints were dependent on the response from audit calls, though successful completion or error did not impede user login. A fix was implemented to cease awaiting that response, thereby allowing the continued processing of login, logout, and other requests (e.g., SendEmail) irrespective of the audit's response. **Additional Notes** To prevent similar issues in the future, we will be implementing the following: * Implementing better timeout handling and retry logic for MongoDB connections * Adding in better logging to indicate connection issues. * Code adjustments to Mongo follow our existing database connection processes.