Mural incident

Mural users are unable to log in

Critical Resolved View vendor source →

Mural experienced a critical incident on November 13, 2023 affecting Authentication, lasting 9h 59m. The incident has been resolved; the full update timeline is below.

Started
Nov 13, 2023, 12:01 PM UTC
Resolved
Nov 13, 2023, 10:00 PM UTC
Duration
9h 59m
Detected by Pingoru
Nov 13, 2023, 12:01 PM UTC

Affected components

Authentication

Update timeline

  1. investigating Nov 13, 2023, 12:01 PM UTC

    We're experiencing a service disruption that is preventing users from logging in to Mural. We're investigating the issue and will restore regular service as soon as possible. Please check our status page for the most up-to-date info 👉 status.mural.co/

  2. identified Nov 13, 2023, 12:58 PM UTC

    The issue has been identified and we are working towards implementing the fix.

  3. identified Nov 13, 2023, 01:06 PM UTC

    We are continuing to work on a fix for this issue.

  4. identified Nov 13, 2023, 01:57 PM UTC

    We are continuing to work on a fix for this issue. We appreciate your patience while we resolve this.

  5. identified Nov 13, 2023, 02:53 PM UTC

    We are continuing to work on a fix for this issue. We appreciate your patience while we resolve this.

  6. identified Nov 13, 2023, 03:48 PM UTC

    We are continuing to work on a fix for this issue. We appreciate your patience while we resolve this.

  7. identified Nov 13, 2023, 04:34 PM UTC

    The issue with logging in to Mural has been largely resolved. Users that were unable to access Mural should be able to log in again. There continues to be an intermittent performance degradation. Our team are investigating and we will continue to update our status page as this develops. Stay up-to-date with the latest info via 👉 status.mural.co

  8. monitoring Nov 13, 2023, 04:58 PM UTC

    The performance degradation issue has been addressed and service has returned to normal. Users can resume logging in and using Mural as normal. We'll continue to monitor the results of our corrections ensure service remains stable, and will publish a full root cause analysis in the coming days.

  9. resolved Nov 13, 2023, 10:00 PM UTC

    The correction we implemented earlier has been successful in resolving the issue and full service has been resolved. Some users reported connectivity issues after the earlier correction. In all cases this has been solved by clearing browser cache and using the link app.mural.co/bye to clear any previous session data. We apologize for the inconvenience this interruption caused. We will be conducting a full review will publish a root cause analysis in the coming days.

  10. postmortem Nov 16, 2023, 12:50 PM UTC

    **Summary**: On Saturday, November 11th at 03:00 UTC, Mural performed scheduled maintenance on our production clusters. Post-migration checks indicated all functions were performing as expected. On Monday, November 13th, some Mural customers reported difficulty logging into the Mural web application. Mural’s incident response team was immediately engaged in troubleshooting these reports. Initial investigations revealed that the platform upgrade over the preceding weekend had incorrect settings for the DNS infrastructure and a key backend application's auto-scaling. This resulted in unstable connections for some users. During the course of this investigation, we also discovered that load balancing improvements for clients with specific network and application configurations altered how the client’s IP address was interpreted by our system, preventing access for such clients. Our incident response team addressed the auto-scaling configuration, resolving DNS-related issues and restored access for the majority of users. Next, a new load-balancing configuration underwent adjustments and testing to restore stable connections for the previously-impacted users. The total time from when our incident response team started working on this incident, to deploying the final fix, was 9 hours 40 minutes. **What we’ve done to prevent this happening again:** As part of Mural’s post-incident procedure, our engineering teams conducted a thorough review to identify the root cause and outline necessary improvements. 8 separate changes have been identified and will be implemented in the coming weeks. These changes cover monitoring to detect this scenario sooner, enhanced post-migration checks to ensure this scenario and others are included in our use cases and reviewing our migration process to reduce the risks. We apologize for any inconvenience this incident may have caused and sincerely thank your patience whilst we worked through this incident.