LogDNA incident

Intermittent user session timeouts, requiring periodic re-authentication

Minor Resolved View vendor source →

LogDNA experienced a minor incident on December 4, 2023 affecting Web App, lasting 1h 12m. The incident has been resolved; the full update timeline is below.

Started
Dec 04, 2023, 12:06 PM UTC
Resolved
Dec 04, 2023, 01:19 PM UTC
Duration
1h 12m
Detected by Pingoru
Dec 04, 2023, 12:06 PM UTC

Affected components

Web App

Update timeline

  1. investigating Dec 04, 2023, 12:06 PM UTC

    The Web UI is currently encountering user session timeouts, prompting customers to log in every 1-2 minutes. Our team is actively investigating the root cause of this issue, while the remaining aspects of the service remain fully functional.

  2. monitoring Dec 04, 2023, 12:13 PM UTC

    We have implemented a fix for the user session timeouts on the Web UI, but will continue to monitor the situation closely.

  3. resolved Dec 04, 2023, 01:19 PM UTC

    The issue has been resolved, and no further issues have been observed with user sessions.

  4. postmortem Dec 04, 2023, 01:46 PM UTC

    **Dates:** Start Time: Monday, December 4, 2023, at 10:29 UTC End Time: Monday, December 4, 2023, at 12:01 UTC Duration: 92 minutes ‌ **What happened:** Web UI users were logged out frequently – usually within 1-2 minutes of logging in. Users could successfully login again without any issues, but the session would expire shortly afterwards. ‌ **Why it happened:** It was identified that both Web UI pods and the Redis database pods, which are responsible for storing user sessions, experienced a critical memory shortage, leading to uncontrolled data purging. When this same issue happened in July 2023, our engineering team deployed a fix that enhanced how Redis stores the user session keys. This fix successfully prevented any recurrence of the problem until today. The team is still determining what made it exceed the memory limit this time. ‌ **How we fixed it:** Initially, the Web UI pods were restarted, but that did not resolve the problem permanently. The engineering team then restarted the Redis database pods and the session stopped expiring. ‌ **What we are doing to prevent it from happening again:** The team will revise the previous fix, including implementing a mechanism for the pod to automatically restart upon reaching its limit and setting up alerts to notify an engineer when it's approaching that threshold.