TaxJar experienced a critical incident on January 11, 2021 affecting TaxJar Reporting and Tax Calculations API and 1 more component, lasting 1h 30m. The incident has been resolved; the full update timeline is below.
Affected components
Update timeline
- investigating Jan 11, 2021, 06:30 PM UTC
We are currently investigating this issue.
- investigating Jan 11, 2021, 06:36 PM UTC
We are continuing to investigate this issue.
- investigating Jan 11, 2021, 07:07 PM UTC
We are continuing to investigate this issue.
- identified Jan 11, 2021, 07:30 PM UTC
The issue has been identified and a fix is being implemented.
- monitoring Jan 11, 2021, 07:33 PM UTC
A fix has been implemented and we are monitoring the results.
- monitoring Jan 11, 2021, 07:34 PM UTC
We are continuing to monitor for any further issues.
- monitoring Jan 11, 2021, 07:50 PM UTC
We are continuing to monitor for any further issues.
- resolved Jan 11, 2021, 08:00 PM UTC
This incident has been resolved.
- postmortem Jan 14, 2021, 05:05 AM UTC
During this incident, TaxJar customers were not able to access the TaxJar App or use the TaxJar API. We know this was impactful, and we are truly sorry it happened. We have already implemented the following operational changes to ensure this type of failure does not happen again: * We updated our deployment pattern to a blue-green deployment pattern to allow us to better verify changes to production environments. * We are conducting a full audit of our vendor provided managed services that lack the acceptable level of rollback capabilities **Incident Root Cause Analysis** * The incident started with a routine Kubernetes minor version upgrade using our vendor’s managed kubernetes service * This is a routine upgrade operation that we’ve completed 15 times in the past across 3 accounts and 2 regions. We perform this upgrade quarterly in order to keep pace with Kubernetes releases. * Immediately following completion of the upgrade of our production cluster, Kubernetes workers began reporting “Not Ready” status. * Within a few minutes all nodes were now in a state of “Not Ready” which caused all workloads to be marked as offline by our load balancers. * Kubernetes upgrades on our vendor’s managed Kubernetes service are not able to be rolled back. Furthermore new deployments and upgrades to the managed Kubernetes service can take 30-50 minutes to complete, leaving us forced to resolve the immediate issue rather than rolling back. * The vendor’s support team was able to identify the issue: * Clusters, starting with Kubernetes version 1.14 create a cluster security group when they are created. * This security group is designed to allow all traffic from the control plane and[ ](https://docs.aws.amazon.com/eks/latest/userguide/managed-node-groups.html)managed node groups to flow freely between each other. * After the upgrade was completed, the vendor identified that the security group no longer had the required rules configured to allow this traffic to pass, even though this has always happened in prior instances of minor version upgrades. * We manually added the missing rule, which restored connectivity to our managed Kubernetes cluster. * At this point our services started coming back online. * Several other security groups, managed with cloudformation, which had utilized this rule for connectivity between our K8s workloads to other services provided by the vendor \(such as memory caches and databases\) were identified as being unexpectedly altered after this upgrade and also had to be repaired before all services could be restored. * We continue to work with the vendor to understand the root cause for the failure of the managed service to not operate as documented.