Eptura Workplace incident

Unable to Upload Or Download Files

Critical Resolved View vendor source →

Eptura Workplace experienced a critical incident on April 11, 2024 affecting Copy Module and Softspace, lasting 1h 28m. The incident has been resolved; the full update timeline is below.

Started
Apr 11, 2024, 09:04 PM UTC
Resolved
Apr 11, 2024, 10:32 PM UTC
Duration
1h 28m
Detected by Pingoru
Apr 11, 2024, 09:04 PM UTC

Affected components

Copy ModuleSoftspace

Update timeline

  1. investigating Apr 11, 2024, 09:04 PM UTC

    We are currently investigating an issue with uploading or downloading files that is producing an error and not working. We will be updating status page by 5:03 PM MST or if we have identified the issue by then.

  2. monitoring Apr 11, 2024, 09:32 PM UTC

    A fix has been implemented. We are moving into the Monitoring Phase for the next 1 hour.

  3. resolved Apr 11, 2024, 10:32 PM UTC

    As we have not seen further service disruptions after the fix was implemented, we have moved to the Resolved Phase.

  4. postmortem May 16, 2024, 04:16 PM UTC

    **Eptura Workplace Detailed Root Cause Analysis –** ` `**S1 – Unable to Upload or Download Files** **Description:** On April 11th, 2023, we began receiving reports of an issue related to the inability to access the Copy, Upload, and Download products. Users reported errors when attempting to use these modules, specifically being directed to a page stating a "Timeout" had occurred. ` `**Type of Event:** Service Disruption **Service/Modules Impacted:** Copy, AutoCad, SoftSpace **Remediation:** Engineering has redistributed our load balancers and reset the affected part of the module, restoring access. **Timeline: \(Times are in MST\)** 2:58 PM - Issue identified, Tier 2 pulls the FireAlarm. 3:04 PM – Status update posted by the Manager. 3:36 PM – Engineering reports they are addressing the issue and have implemented a fix. 3:33 PM – Moved to monitoring after internal confirmation that service is operational. 4:33 PM – Status page updated to reflect issue resolution. **Total Duration of Event:** 28 minutes **Root Cause Analysis:** A server was overwhelmed with data requests, and our load balancers failed to distribute the workload effectively. The server had not been updated for a while, and the retirement of some previous services contributed to the inefficient data distribution, leading to downtime. **Preventative Action:** We have implemented monitoring for these servers and established a reminder to update them as needed according to our maintenance cycle. Thank you for your patience and understanding during this disruption and we assure you that we are dedicated to continually improving our services to better serve your needs.