KnowledgeOwl incident

Site outage & reports of slowness

Critical Resolved View vendor source →

KnowledgeOwl experienced a critical incident on February 16, 2023 affecting Knowledge Bases and Web Application and 1 more component, lasting 2h 11m. The incident has been resolved; the full update timeline is below.

Started
Feb 16, 2023, 02:53 PM UTC
Resolved
Feb 16, 2023, 05:05 PM UTC
Duration
2h 11m
Detected by Pingoru
Feb 16, 2023, 02:53 PM UTC

Affected components

Knowledge BasesWeb ApplicationAPI

Update timeline

  1. investigating Feb 16, 2023, 02:53 PM UTC

    We've had several reports of the KnowledgeOwl app and knowledge bases being slow or inaccessible this morning. We're investigating the root cause and will provide updates as we have them.

  2. investigating Feb 16, 2023, 03:08 PM UTC

    We've confirmed a full outage of the app and knowledge bases and are actively working to get it resolved. Sorry for the disruption to your day and we hope to be back online quickly!

  3. identified Feb 16, 2023, 03:44 PM UTC

    We have identified the issue and are testing a fix.

  4. monitoring Feb 16, 2023, 03:49 PM UTC

    It looks like things have stabilized due to our fix, but we are continuing to monitor performance.

  5. identified Feb 16, 2023, 03:54 PM UTC

    The fix we implemented isn't performing as well as we'd hoped. We're taking sites down briefly to implement an additional fix.

  6. monitoring Feb 16, 2023, 04:14 PM UTC

    We've rolled out a new fix and are monitoring its performance. So far we've seen some small performance spikes but nothing that should prevent access. We'll continue to monitor to be sure things are resolved.

  7. resolved Feb 16, 2023, 05:05 PM UTC

    Our monitoring has looked good and we're seeing continued normal performance across the board, so we're marking this as Resolved. We have noticed some issues with knowledge base tables of contents either missing some content or having duplicate articles. If your knowledge base is showing either of these issues, please reach out to our support team and we can get you back to a normal table of contents state. Thank you all for your patience and being so gracious with our team today through this whole outage. We'll post a full postmortem after we've fully fleshed out the root cause and next steps.

  8. postmortem Feb 16, 2023, 06:37 PM UTC

    # Summary Last night we released a set of changes to the table of contents that had a bug in it. This bug caused display issues in tables of contents. First, we tried to hotfix the issue. The hotfix caused some processes to eat up more memory than normal. The memory shortage built until it caused a slowdown and an outage. At this point, we rolled the release back completely. # Next Steps We already have a new fix ready to work into the initial release. Thanks to today's issues, we've identified several opportunities for improvement: ## Short-term We're updating our testing and release processes for changes to the table of contents. We'll be using these revamped processes to test the fix and the full release before we take it live. ## Mid-term We're reviewing our enterprise- and business-level account SLAs. We'll be issuing credits to customers whose up-time SLAs weren't met this month. If you're a customer in one of these tiers, you can expect to hear from a member of our team to discuss this in more detail. ## Long-term We've also identified several possible improvements in our load-testing processes. We'll be making changes to those processes, too. # What you can do While the bug was live, it may have caused changes to your knowledge base's table of contents. Please review your table of contents. If you see duplicate articles or missing subcategories, please email [[email protected]](mailto:[email protected]) so we can get it fixed. # Thank you Thank you for your patience and grace with us through this month's issues. Outages are every software provider's worst nightmare and we are very thankful to have such amazing customers.