Mindtickle incident

Uploader service down, preventing Admins from uploading media on the Mindtickle platform.

Notice Resolved View vendor source →

Mindtickle experienced a notice incident on December 29, 2023, lasting —. The incident has been resolved; the full update timeline is below.

Started
Dec 29, 2023, 02:33 AM UTC
Resolved
Nov 28, 2023, 01:23 PM UTC
Duration
Detected by Pingoru
Dec 29, 2023, 02:33 AM UTC

Update timeline

  1. resolved Dec 29, 2023, 02:33 AM UTC

    Incident Overview: Issue: Uploader service is down, preventing Admins from uploading media on the Mindtickle platform. Duration: 18:53 PT to 22:18 PT on November 28th, 2023.

  2. postmortem Dec 29, 2023, 02:33 AM UTC

    **Incident Overview:** * Issue: Uploader service down, preventing Admins from uploading media on the Mindtickle platform. * Duration: 18:53 PT to 22:18 PT on November 28th, 2023. **Root Cause:** Memory overflow in the in-memory data store powering the Uploader service. * Uploader service shares the in-memory data store with the llm-gateway service. * llm-gateway was adding keys without Time To Live \(TTL\), causing gradual memory increase. * Memory cluster reached 100% usage, impacting the Uploader service. * Alerting Issue:Alert prioritization marked 'medium,' delayed on-call team response. **Timeline of Events:** * 28th Nov 18:55 PT: Alert received for llm-gateway service high error rate. * 28th Nov 20:00 PT: Alert acknowledged and acted upon. * 28th Nov 21:00 PT: Root cause identified; TTL change deployed. * 28th Nov 22:18 PT: Increased memory for in-memory data store; Uploader service restored. * 28th Nov 23:30 PT: Old keys \(without TTL\) deleted from the cluster. **Learning and Next Steps:** * Challenge: Non-impactful alerts from llm-gateway hindered issue identification. * Action Items: * Revisit alerts and configurations. * Ensure prioritization of alerts for production-impacting services.