Mezmo incident

Ingestion of new logs to our Syslog endpoint – for logs sent using a custom port, only – is intermittently delayed

Mezmo experienced a notice incident on February 26, 2022 affecting Log Ingestion (Syslog), lasting 1d 23h. The incident has been resolved; the full update timeline is below.

Started: Feb 26, 2022, 07:51 PM UTC
Resolved: Feb 28, 2022, 06:58 PM UTC
Duration: 1d 23h
Detected by Pingoru: Feb 26, 2022, 07:51 PM UTC

Affected components

Log Ingestion (Syslog)

Update timeline

identified Feb 26, 2022, 07:51 PM UTC

Ingestion of new logs to our Syslog endpoint using a Custom Port is intermittently failing.
identified Feb 27, 2022, 07:08 PM UTC

Ingestion of new logs to our Syslog endpoint using a Custom Port is still intermittently failing. We are continuing to work on a fix.
monitoring Feb 28, 2022, 02:23 AM UTC

A fix has been implemented for the ingestion of new logs to our Syslog endpoint using a custom port. We will continue to monitor the results.
resolved Feb 28, 2022, 06:58 PM UTC

This incident has been resolved.
postmortem Mar 01, 2022, 08:35 PM UTC

**Dates:** Start Time: Saturday, February 26, 2022, at 19:51 UTC End Time: Sunday, February 27, 2022, at 22:13 UTC Duration: 26:22:00 ‌ **What happened:** Ingestion of new logs to our Syslog endpoint – for logs sent using a custom port, only – was intermittently delayed. ‌ **Why it happened:** We recently introduced a new service \(Syslog Forwarder\) to handle the ingestion of logs sent over Syslog. As the name implies, it forwards logs to downstream services. Logs are sent from a range of ports on Syslog Forwarder to a range of ports used by clients running on downstream services. This design worked well in our advance testing, using a limited number of custom ports. Once running in production, however, the Syslog Forwarder needed to connect to a much larger number of custom ports. We then saw that the ephemeral port ranges of the clients running on downstream services overlapped with the port ranges used by the Syslog Forwarder. This led to occasional port conflicts when services and/or clients tried to start. The services and/or clients would attempt to start again until they found an open port without conflicts. This created delays in ingestion. ‌ **How we fixed it:** We changed the ephemeral port ranges of the clients running on downstream services so they no longer overlapped with the port ranges used by the Syslog Forwarder. ‌ **What we are doing to prevent it from happening again:** The new ephemeral port range has been incorporated and proven resilient in production. No further work is needed to prevent this kind of incident from happening again.