Xink incident

Outlook add-in not accessible or slow

Xink experienced a major incident on May 22, 2023 affecting Outlook Add-in, lasting 20m. The incident has been resolved; the full update timeline is below.

Started: May 22, 2023, 04:50 PM UTC
Resolved: May 22, 2023, 05:10 PM UTC
Duration: 20m
Detected by Pingoru: May 22, 2023, 04:50 PM UTC

Affected components

Outlook Add-in

Update timeline

investigating May 22, 2023, 07:12 PM UTC

Some clients may experience downtime with signature load. We are currently investigating this issue.
resolved May 22, 2023, 07:15 PM UTC

This incident has been resolved.
postmortem May 23, 2023, 11:53 PM UTC

# RCA: Increased Backend Load Leading to Latency and Timeouts **Summary:** Our system experienced a critical incident characterized by increased load on backend services, resulting in high latency and timeouts. This postmortem aims to outline the causes of the issue, detail the steps taken to mitigate its impact, and highlight future actions to prevent similar incidents from occurring. **Incident Details:** The incident was primarily caused by a surge in load on our backend services, exceeding the capacity they were designed to handle. This influx of requests led to increased latency and, in some cases, timeouts for our clients. The impact was felt across the system, negatively affecting the overall performance and user experience. **Mitigation Steps:** In response to the incident, our team took the following steps to address the issue: 1. Scaling Up Add-In Instances: As an initial approach, we scaled up the number of instances for our add-in service. This allowed us to distribute the load more efficiently, accommodating the increased demand. Consequently, we observed a noticeable improvement in performance, with a 10-30 percent increase. 2. Implementing Local Image Caching: To reduce the number of calls to the backend, we implemented a cache storage mechanism for images within the add-in. This solution effectively stored accessed images locally, minimizing the reliance on backend services. As a result, the performance of image-related operations saw a significant improvement. **Next Steps:** While the above measures have proven effective in alleviating the immediate impact of the incident, we recognize the need for further enhancements to ensure a more robust and resilient system. The following steps will be taken to enhance the system's performance and prevent future occurrences: 1. Cache Storage for Signatures Service: Building upon the success of implementing local image caching, we plan to extend this approach to the signatures service. By storing signature data locally, we expect to observe a substantial increase in overall performance. 2. Load Testing and Capacity Planning: We will conduct comprehensive load testing to identify potential bottlenecks and capacity limitations in our backend infrastructure. This exercise will enable us to better anticipate future scalability requirements and proactively adjust resources as needed. 3. Monitoring and Alerting Improvements: Enhancements will be made to our monitoring and alerting systems to promptly detect and respond to any abnormal spikes in backend load. Real-time visibility into system metrics will aid in identifying issues before they escalate and impact the user experience. 4. Continuous Optimization: Our team remains committed to continuously optimizing our backend services to handle increasing loads more efficiently. This includes fine-tuning algorithms, optimizing database queries, and exploring performance optimizations at various layers of the application stack. We apologize for any inconvenience caused by this incident and appreciate your understanding and support as we work towards a more robust system. Should you have any further questions or concerns, please do not hesitate to reach out to our support team.