inSided incident

US Communities - Currently experiencing issues with community loading

Major Resolved View vendor source →

inSided experienced a major incident on June 13, 2025 affecting Status of our US Community Infrastructure, lasting 1h 6m. The incident has been resolved; the full update timeline is below.

Started
Jun 13, 2025, 06:56 PM UTC
Resolved
Jun 13, 2025, 08:03 PM UTC
Duration
1h 6m
Detected by Pingoru
Jun 13, 2025, 06:56 PM UTC

Affected components

Status of our US Community Infrastructure

Update timeline

  1. investigating Jun 13, 2025, 03:19 PM UTC

    We are currently investigating this issue.

  2. investigating Jun 13, 2025, 03:45 PM UTC

    We are continuing to investigate this issue - we have identified a couple of infrastructure issues thought to be causing this. We appreciate your continued patience,

  3. investigating Jun 13, 2025, 04:46 PM UTC

    We are continuing to investigate the source of the problem, and have identified a few possibilities for it. As soon as we know more information we will post it here asap.

  4. investigating Jun 13, 2025, 05:32 PM UTC

    We are continuing to investigate this issue.

  5. investigating Jun 13, 2025, 05:37 PM UTC

    We are continuing to investigate this issue.

  6. investigating Jun 13, 2025, 06:11 PM UTC

    User experience is improving at this time but we are still investigating.

  7. investigating Jun 13, 2025, 06:44 PM UTC

    Users may experience degraded performance in some areas.

  8. identified Jun 13, 2025, 06:56 PM UTC

    The issue has been identified, and a fix was deployed. Performance is steadily improving.

  9. monitoring Jun 13, 2025, 07:05 PM UTC

    We are monitoring the results of our fix, and performance is back to normal.

  10. resolved Jun 13, 2025, 08:03 PM UTC

    This incident has been resolved. Details will be added once available.

  11. postmortem Jun 19, 2025, 10:58 AM UTC

    **Redis Cache Degradation and Service Impact - June 13, 2025** ### **Overview** On **June 13, 2025**, from **17:09 to 21:13 CET**, some of our services in the **US region** experienced significant disruptions due to a cascading infrastructure failure. The issue began with degraded performance in our Redis caching layer, triggered by underlying **AWS infrastructure instability**, and ultimately led to elevated error rates and service outages across several systems. ‌ ### **What Happened** A sudden degradation in our Redis cache - likely caused by transient **network issues within AWS infrastructure** - resulted in dramatically slower response times and dropped connections. As Redis became unavailable, systems attempted to query the main database directly, which caused that database to hit its connection limits and become overloaded. This caused broader performance issues across multiple services. ### **Customer Impact** * **High error rates** \(HTTP 5XX\) across several services in the US region * **Slower API response times** and degraded system performance * **Stale or delayed data** due to failed or missed caching * **Up to 4 hours of partial or full service disruption** * **No customer data was lost**, but availability and performance were impacted ### **Timeline of Events \(CEST\)** * **17:09** - Error rates spike and initial alarms trigger * **17:13** - Incident response begins * **17:30-19:00** - Teams work to stabilise the system, including service rollbacks and traffic load reduction * **18:38** - AWS confirms a related infrastructure/network event affecting Redis connectivity * **19:30** - Redis instance upgraded to a higher-capacity configuration * **21:10** - Services fully recover * **21:13** - Incident officially resolved ### **What Caused the Issue?** Our Redis cache experienced performance degradation and stopped serving traffic effectively. Based on internal logs and external confirmations, this was **likely due to transient network and infrastructure issues within AWS**, which impacted bandwidth and connection handling for Redis. As a result, traffic was redirected to the main database, which then became overloaded, creating a cascading service failure. ### **How Was It Resolved?** * Redis was scaled up to handle more traffic and recover performance * Non-essential services were temporarily paused to reduce system load * Once Redis recovered, database pressure eased and systems began returning to normal ### **Could This Have Been Prevented?** Partially. Some safeguards helped limit impact \(such as CDN caching for guest users\), but earlier detection of Redis degradation and better isolation between cache and core systems could have reduced downtime. The unexpected AWS infrastructure event contributed significantly to the incident’s scope. ‌ ### **Key Lessons and Next Steps** * **Improve monitoring and alerting** for Redis resource exhaustion and network saturation * **Isolate caching infrastructure** by service or use case to prevent single points of failure * **Enhance fallback logic** to reduce stress on primary databases when caching layers fail * **Work closely with AWS** to better understand and mitigate infrastructure-related disruptions * **Explore more resilient multi-layer caching strategies** to handle future spikes gracefully We sincerely apologise for the disruption and appreciate your patience. We are taking meaningful steps to strengthen our infrastructure and improve our detection and response capabilities to prevent similar events in the future. For any more information or issues please email [[email protected]](mailto:[email protected]).