GUIDEcx incident
Intermittent latency throughout the application
GUIDEcx experienced a minor incident on May 28, 2025, lasting —. The incident has been resolved; the full update timeline is below.
Update timeline
- resolved May 28, 2025, 03:54 PM UTC
Type: Incident Duration: 2 hours and 57 minutes Affected Components: Report Navigator and Report Builder, Resource Management, Project Management, Advanced Time Tracking, Compass Customer Portal May 28, 15:54:17 GMT+0 - Investigating - We are currently investigating this incident. We have already boosted resources to reduce lag, but still haven't found the root issue. We have a couple of leads we are still investigating. May 28, 16:24:52 GMT+0 - Investigating - We are currently investigating this incident. Again, we have a few ideas about a potential root cause, but haven't identified the culprit yet. May 28, 17:16:35 GMT+0 - Investigating - Our team is actively investigating the issue impacting performance. We've identified a likely root cause and are preparing a fix that will be deployed shortly. We’re closely monitoring the situation and will provide updates as soon as more information becomes available. Thank you for your patience as we work to resolve this quickly and thoroughly. May 28, 18:04:52 GMT+0 - Monitoring - We implemented a fix and are currently monitoring the result. Typical navigation speed, project creation, task creation, etc. are all seeing improved speed. May 28, 18:50:50 GMT+0 - Resolved - This incident has been resolved. Thank you for your patience. We will be reviewing what caused the issue and share with you our learnings and the steps we have taken to help prevent future disruptions. Please reach out to Support if you experience any future latency. May 28, 20:18:35 GMT+0 - Postmortem - **Post-Mortem: Messaging Service Incident (May 28, 2024)** # **Summary** Following the release of a new messaging feature early on May 28, 2024, users experienced general slowness in messaging related requests. The new messaging feature increased the number of requests hitting our gateway, triggering rate limit issues that hadn't been encountered before. Additionally, messaging failed to load in the task drawer due to invalid feature flag configurations related to customers, and channel loading was slow due to an expensive database query. # **Resolution** The issues were resolved by taking three key actions: 1. Increasing resources on affected services 2. Temporarily scaling gateway services to handle the increased request volume while implementing a permanent fix that adjusted rate limits to more reasonable values 3. Optimizing the expensive database query to retrieve only necessary data, reducing contention on the database # **Incident Timeline** | **Time** | **Event** | | -------------------- | -------------------------------------------------------- | | 2:30 AM MDT, May 28 | New messaging feature released | | 8:12 AM MDT, May 28 | Customer reports of issues received | | 8:30 AM MDT, May 28 | Resources adjusted on affected services | | 9:00 AM MDT, May 28 | War Room initiated for coordinated response | | 11:33 AM MDT, May 28 | Optimized query deployed | | 11:35 AM MDT, May 28 | Adjusted the feature flag configuration | | 12:34 PM MDT, May 28 | Adjusted rate limit configuration deployed | | 12:50 PM MDT, May 28 | All issues resolved, system returned to normal operation | # **Root Causes** * Resource Constraints: Insufficient resources allocated to select services to handle the increased load from the new messaging feature * Gateway Rate Limiting: Rate limits on gateway services were too restrictive, causing legitimate requests to be denied when traffic increased * Inefficient Database Queries: Certain queries were retrieving excessive data, causing database contention and slowing down channel loading # **Additional Notes** To prevent similar issues in the future, we will be looking into the following: * Proactive monitoring for rate limit thresholds, especially during new feature releases * Load testing with realistic traffic patterns prior to major feature releases * Database query optimization reviews as part of the deployment checklist * Automated scaling policies for critical gateway servicesEnhanced existing * monitoring to include coverage for areas primarily affected by the incident, such as messaging request latency and error rates.