Alpaca incident

OPRA Message Processing Issue

Notice Resolved View vendor source →

Alpaca experienced a notice incident on September 15, 2025, lasting 19h 49m. The incident has been resolved; the full update timeline is below.

Started
Sep 15, 2025, 03:57 PM UTC
Resolved
Sep 16, 2025, 11:46 AM UTC
Duration
19h 49m
Detected by Pingoru
Sep 15, 2025, 03:57 PM UTC

Update timeline

  1. investigating Sep 15, 2025, 03:57 PM UTC

    We experienced an issue with processing OPRA messages between 9:30 AM and 9:42 AM ET.

  2. resolved Sep 16, 2025, 11:46 AM UTC

    Issue is already mitigated

  3. postmortem Sep 16, 2025, 11:46 AM UTC

    We use aeron transport to send the preprocessed Exegy data from GCP VMs to our Kubernetes cluster. The aeron-driver component which is responsible for the UDP message transport, showed many NAKs \(retransmissions\) and our application showed aeron publication backpressure and hence lower number of processed messages. Sometimes, usually after we restarted the components including aeron-driver, this happens and an aeron-proxy restart helps and will not appear until the next restart. It can run for weeks without a problem. NOTE: We had a scheduled maintenance this weekend. But the situation is weird since aeron-proxy only communicates with aeron-driver using memory mapped files, and restarting it should not cause any NAK related problems, unless there is a bug in aeron-driver itself. We only experience this issue on the production system, so I have a feeling that this is because of the differences between the production and staging cluster network setup \(but this is just a guess\). Unfortunately, we do not have time to debug the situation whenever it happens because we need to restore the service immediately.