Intermittent Outages and Slowness
Incident Report for KnowledgeOwl
Postmortem

Summary

In the early morning of Friday, March 24th, our systems began to see large spikes of web traffic from various IP addresses. These traffic spikes were targeting our public website (www.knowledgeowl.com\). This high traffic volume overwhelmed the reverse proxy system that we use to serve the public website, the KnowledgeOwl application, and customer knowledge bases. As an immediate mitigation step, we changed the configuration of our public website to no longer use our primary proxy. This seemed to stabilize the application and customer knowledge bases. The traffic spikes continued until late morning.

Our investigations indicated that these connections were not initiated by customers. The associated web requests did not appear to be legitimate. While we cannot say with certainty that these connections were malicious, we are treating this incident as a distributed denial of service (DDoS) attack.

Next Steps

The risk from high-volume traffic spikes like these is almost impossible to completely remove. However, we are reviewing our systems and processes to better handle these kinds of traffic spikes. We have already identified some concrete next steps to reduce the overall risk:

Short-term

In the short-term, we are taking three steps:

  1. We are exploring ways to further separate our public website from the KnowledgeOwl application and customer knowledge bases.

  2. Friday's incident provided us with data on how to better identify these types of events. We are building that knowledge into our processes moving forward.

  3. We have already begun modifying our proxy and web application firewall configuration to make our traffic management infrastructure a bit more robust. We'll continue to monitor these changes and iterate on them as needed.

Long-term

This event has highlighted some potential architectural improvements to our infrastructure, mainly around our traffic-management systems. We'll review these changes' feasibility and effectiveness in the coming months.

Thank you

Above all, we want to thank you for your patience during this incident. We know that we've had a higher number of downtime incidents in the last two months. We know how integral your knowledge base can be to daily operations. Our team is working hard to learn from this experience and to make KnowledgeOwl stronger in the future.

Posted Mar 27, 2023 - 14:18 EDT

Resolved
We are officially marking this incident as resolved. All systems have been operating normally since approximately 12 pm EDT, and we have not seen any more malicious looking traffic hitting our servers. We are continuing to investigate the source of the potential attack and we are planning to share a postmortem on Monday.

We are sorry for the trouble this caused and we appreciate the grace everyone has shown as we dealt with this problem. Let us know if we can help with anything in the meantime, and we will be in touch early next week with our postmortem.
Posted Mar 24, 2023 - 16:20 EDT
Update
Thank you everyone for your patience today. Based on our investigation, it appears that KnowledgeOwl was attacked by malicious traffic which caused the intermittent outages and slowness. We have made some changes to mitigate the impact of the spikes, but we are not quite ready to mark this as resolved as we want to ensure that there are no more spikes of illegitimate traffic. We will be providing more information and a postmortem once we feel the issue is fully resolved. Please let us know if you are still having issues.
Posted Mar 24, 2023 - 14:00 EDT
Update
Quick update from our end. The app and knowledge bases have come back online and seem to be stable now. We have a lead on potential root cause but we are still working through it. We are so sorry for the disruptions to your day and we will be sharing more information as we work to fully resolve the problem.
Posted Mar 24, 2023 - 12:26 EDT
Update
The system is back down. We are dealing with an abnormal amount of traffic and trying to resolve asap. We are so sorry for the trouble today.
Posted Mar 24, 2023 - 11:17 EDT
Monitoring
Our CTO confirmed that all systems are back to operational and we will continue to monitor. We have a possible lead as to the cause of today's issues and will post more information as it becomes available. Sorry again for the trouble and let us know if you are still having any issues!
Posted Mar 24, 2023 - 10:05 EDT
Update
We are continuing to investigate the intermittent outages and slowness. Since just before 7 am EDT, the system seems to have gone down and recovered on its own a few times. While it appears to be back up and working, we will leave this incident active as we continue to investigate and monitor the situation. We will continue to post updates here along with a postmortem once it is resolved. Thanks for your patience this morning and our apologies for the disruption to your day!
Posted Mar 24, 2023 - 09:49 EDT
Investigating
We are currently investigating and are working to get everything back up and running asap. We will be posting updates here!
Posted Mar 24, 2023 - 08:49 EDT
This incident affected: Knowledge Bases, Web Application, and API.