Hive experienced a notice incident on October 27, 2022 affecting Web Application and Desktop Application, lasting 5h 12m. The incident has been resolved; the full update timeline is below.
Affected components
Update timeline
- investigating Oct 27, 2022, 12:18 PM UTC
We are currently investigating this issue.
- investigating Oct 27, 2022, 12:18 PM UTC
We are currently work to roll back changes which were deployed earlier and related to the issue.
- investigating Oct 27, 2022, 12:26 PM UTC
We have failed over to a stable application version while we work to identify root cause. The application should be available and working now. We will leave this incident open while we investigate original cause, implement a fix, and monitor before resolving.
- investigating Oct 27, 2022, 12:55 PM UTC
We are continuing to investigate the issue, but systems have now remained stable.
- resolved Oct 27, 2022, 05:30 PM UTC
All systems have remained stable since our earlier updates at 8:26am and 8:55am Eastern. We have gone ahead and unified application states across all environments to ensure no users experience version mismatches. We'll continue to actively monitor stability throughout the day, and do not anticipate any further issues. A post-mortem has been underway since ~9:30am Eastern this morning and will be posted here once finalized.
- postmortem Oct 28, 2022, 06:12 PM UTC
# Context and timeline On behalf of the team here at Hive, we would like to apologize for interruptions to services yesterday, and we appreciate your patience as we worked to resume service continuity. As posted in the incident status updates, the Hive web platform experienced service disruptions which impacted project loading from 8:01am through to 8:26am Eastern. The incident was left open with partial outage as we monitored failover from 8:26am through to 8:55am Eastern, and left in a monitoring state through to incident close out. A detailed timeline including mitigation steps taken list listed out below \(all times stated are Eastern timezone\): **8:01am -** Application monitor alarm bells raised, notifying our team of issues from completion of an application deployment. **8:15am -** Initial investigation confirms issues are widespread, impacting users who had been swapped over the latest web application refresh. **8:18am -** Application deployment reversion started. Failover to stable environment initiated. **8:26am -** Confirmation of all users switched over to failover environment and project loading service disruption resolved. **8:30am -** Upon review of logs after switching to the failover environment, the team confirmed from logs that a specific scenario of project creation from templates with pre-configured table layout options failed to fully complete. This specific issue remained until separate service redeployment which was initiated at 8:18am. The issue was due to application version mismatch and impacted just below 2% of the active user population. # Root cause In short, a web application deployment \(which completed just before 8am\) contained a cached version of a pre-production Hive build, leading to mismatched application versions and logic between services. Upon review of the deployment command logs, the team has confirmed that this cached version was previously deployed to a pre-production environment and not properly cleared out before the production deployment was built. # Remediation plan While our deployment scripts already ask for written confirmation for initiating a deployment and show information in the confirmation regarding which version \(branch/build\) and target environment, potential untracked or cached change warnings do not show. In order to ensure the root cause of mismatched application versions being deployed never happens again, the team has taken steps to update deployment commands and contextual information such that deployment will automatically fail in the event of untracked or cached changes.