Eligible incident

Intermittent API failures

Eligible experienced a minor incident on July 27, 2016, lasting 6h 1m. The incident has been resolved; the full update timeline is below.

Started: Jul 27, 2016, 02:26 PM UTC
Resolved: Jul 27, 2016, 08:28 PM UTC
Duration: 6h 1m
Detected by Pingoru: Jul 27, 2016, 02:26 PM UTC

Update timeline

investigating Jul 27, 2016, 02:26 PM UTC

We are experiencing higher than normal error rates. The team is investigating the issue.
investigating Jul 27, 2016, 02:31 PM UTC

We have identified the issue and deployed a fix. Our team is going to continue monitoring the situation.
monitoring Jul 27, 2016, 03:13 PM UTC

A fix has been implemented and we are monitoring the results.
resolved Jul 27, 2016, 08:28 PM UTC

The incident from this morning has been completely resolved. Public postmortem is coming later today.
postmortem Aug 01, 2018, 07:38 PM UTC

## Incident Overview On the morning of July 27, 2016, Eligible's API had a serious outage, and many of our customers experienced intermittent response timeouts and service reliability issues. Eligible engineering and technical operations teams have identified the problem, fixed the root cause of the incident, and are working hard to apply additional remediations that should prevent this class of issues from happening in the future. ## Details of The Incident While optimizing the underlying logic that defines our API request routing, Eligible staff introduced a regression to our routing algorithm. In some cases, this regression caused the routing algorithm to prefer backup/failover payer connections, even when the primary connections remained available. The problem code change was deployed this morning: July 27, 2016. After the deployment of the new routing code, no immediate issues were apparent, however some of our large payer connections were being automatically routed to their backup/failover connections instead of the configured primary connections. About 20 minutes after the deploy, the secondary/backup routing company suffered an outage due to the increased traffic from Eligible’s customers. As a result, the Eligible customers directed to that secondary routing company started experiencing delays and timeouts from Eligibile's API. Eligible monitoring identified this service issue immediately. In response, Eligible engineers rolled back the deployed code, removing the problem change, and restoring full functionality to the API. With the service restored, focus was directed towards identifying and fixing the cause of the incident: * New tests were added to stress the problematic routing code, and to reproduce the production issue. * The root cause was identified within the routing layer. * The problem code was fixed. * The new and patched code was deployed, both optimizing routing requests, and eliminating the problem of errant request routing. ## Remediations Eligible incident response practices include rigorous postmortems. Every customer-facing incident is analyzed by a team of engineers, and a list of remediation steps is produced for every incident. Specific to this incident, Eligible is taking the following steps to prevent this class of issues from happening again: * Additional mandatory code reviews will be performed for all changes involving our routing logic. * Additional automated test coverage will validate the routing logic prior to every deploy. * We will improve our monitoring and alerting systems to provide better visibility into production transaction routing processes. * We will perform regularly scheduled load testing and capacity planning of our backup connections and 3rd party services, to ensure we retain the ability to handle primary connection outages in the future.