Site outage & reports of slowness
Incident Report for KnowledgeOwl
Postmortem

Summary

Last night we released a set of changes to the table of contents that had a bug in it. This bug caused display issues in tables of contents. First, we tried to hotfix the issue. The hotfix caused some processes to eat up more memory than normal. The memory shortage built until it caused a slowdown and an outage. At this point, we rolled the release back completely.

Next Steps

We already have a new fix ready to work into the initial release. Thanks to today's issues, we've identified several opportunities for improvement:

Short-term

We're updating our testing and release processes for changes to the table of contents. We'll be using these revamped processes to test the fix and the full release before we take it live.

Mid-term

We're reviewing our enterprise- and business-level account SLAs. We'll be issuing credits to customers whose up-time SLAs weren't met this month. If you're a customer in one of these tiers, you can expect to hear from a member of our team to discuss this in more detail.

Long-term

We've also identified several possible improvements in our load-testing processes. We'll be making changes to those processes, too.

What you can do

While the bug was live, it may have caused changes to your knowledge base's table of contents. Please review your table of contents. If you see duplicate articles or missing subcategories, please email support@knowledgeowl.com so we can get it fixed.

Thank you

Thank you for your patience and grace with us through this month's issues. Outages are every software provider's worst nightmare and we are very thankful to have such amazing customers.

Posted Feb 16, 2023 - 13:48 EST

Resolved
Our monitoring has looked good and we're seeing continued normal performance across the board, so we're marking this as Resolved. We have noticed some issues with knowledge base tables of contents either missing some content or having duplicate articles. If your knowledge base is showing either of these issues, please reach out to our support team and we can get you back to a normal table of contents state. Thank you all for your patience and being so gracious with our team today through this whole outage. We'll post a full postmortem after we've fully fleshed out the root cause and next steps.
Posted Feb 16, 2023 - 12:05 EST
Monitoring
We've rolled out a new fix and are monitoring its performance. So far we've seen some small performance spikes but nothing that should prevent access. We'll continue to monitor to be sure things are resolved.
Posted Feb 16, 2023 - 11:14 EST
Identified
The fix we implemented isn't performing as well as we'd hoped. We're taking sites down briefly to implement an additional fix.
Posted Feb 16, 2023 - 10:54 EST
Monitoring
It looks like things have stabilized due to our fix, but we are continuing to monitor performance.
Posted Feb 16, 2023 - 10:49 EST
Identified
We have identified the issue and are testing a fix.
Posted Feb 16, 2023 - 10:44 EST
Update
We've confirmed a full outage of the app and knowledge bases and are actively working to get it resolved. Sorry for the disruption to your day and we hope to be back online quickly!
Posted Feb 16, 2023 - 10:08 EST
Investigating
We've had several reports of the KnowledgeOwl app and knowledge bases being slow or inaccessible this morning. We're investigating the root cause and will provide updates as we have them.
Posted Feb 16, 2023 - 09:53 EST
This incident affected: Knowledge Bases, Web Application, and API.