MageMojo incident

Increased error rates for internal coredns resolution.

MageMojo experienced a notice incident on December 7, 2020, lasting —. The incident has been resolved; the full update timeline is below.

Started: Dec 07, 2020, 09:07 PM UTC
Resolved: Dec 06, 2020, 02:30 AM UTC
Duration: —
Detected by Pingoru: Dec 07, 2020, 09:07 PM UTC

Update timeline

resolved Dec 07, 2020, 09:07 PM UTC

Internal dns requests are showing an increase in error rates.
postmortem Dec 07, 2020, 09:07 PM UTC

Our T1 team was alerted to a problem when our HTTP monitor alerts began firing. They investigated and found the alerts were accurate and escalated to T2. Normally T1 will open an incident on our status page. The T1 lead was recently promoted to shift lead and did not have access to the status page. T2 immediately escalated to our development team. The devteam found one of the coredns instances on the internal cluster started to fail internal dns requests. Internal dns requests are used to resolve service names for each individual customer, such as; “redis”, “varnish”, “php-fpm”, etc.. When these fail to resolve the page request will fail to render. They tried to ssh into the coredns instance but were failing to connect. They were eventually successful however the console was very lagged and barely responsive. They tried to reboot the instance but the instance was hung. They forcefully stopped and started the instance moving it to new physical hardware at AWS. At this point instance recovered and the HTTP alerts cleared. We found network errors in the logs confirming a network connectivity problem on the AWS side but the network was not completely down. This in turn created conditions where AWS didn’t move and recover the instance automatically due to failed health checks. This also created a condition where internal quorum did not agree on the failure of the coredns pod and automatically remove it from the internal coredns cluster. We also found this coredns pod had landed on a kube master at some point in its life. Internal dns requests are rerouted by the masters after request failure. Because both the coredns pod and the master where it was located were having connectivity issues they were unable to automatically reroute failed dns requests from this failing coredns pod to the good coredns pods. The new T1 shift lead has access to the status page and we reviewed the playbook to confirm they have clear instructions and access to post future status page updates. Masters, while being the default location where coredns pods want to live, have been excluded from running coredns pods, ensuring coredns pods are equally distributed and isolated. Micro caching coredns pods are being added to all individual workers to ensure in the event of a larger coredns issue that each worker node has it’s own local coredns cache to service requests from.