Alpaca incident

Elevated Error Rates Across multiple AP's - 500 Error

Alpaca experienced a minor incident on January 12, 2026 affecting broker.accounts.get and JNLC and 1 more component, lasting 3h 51m. The incident has been resolved; the full update timeline is below.

Started: Jan 12, 2026, 02:58 PM UTC
Resolved: Jan 12, 2026, 06:50 PM UTC
Duration: 3h 51m
Detected by Pingoru: Jan 12, 2026, 02:58 PM UTC

Affected components

broker.accounts.getJNLCOrders APIbroker.journals.getJournal Events (SSE)

Update timeline

investigating Jan 12, 2026, 02:58 PM UTC

We are currently investigating elevated error rates on multiple API. Some requests may fail with HTTP 500 responses. Our engineering team is actively working to identify and resolve the root cause.
identified Jan 12, 2026, 03:08 PM UTC

Our team is actively addressing the issue, clients may still experience intermittent timeouts on order placement while we fix the issue.
monitoring Jan 12, 2026, 03:14 PM UTC

The team identified and addressed the underlying issue. We are seeing our API's recovering. The team is continuing to monitor performance. We will continue to provide updates
monitoring Jan 12, 2026, 03:20 PM UTC

We are no longer observing 500 errors across our API endpoints. All services are now accessible, and we are continuing to monitor for any further issues.
monitoring Jan 12, 2026, 03:33 PM UTC

We have identified another increase in 5xx errors. We are actively addressing and will provide more updates. clients may still experience intermittent timeouts on order placement while we fix the issue.
monitoring Jan 12, 2026, 03:50 PM UTC

We are continuing to address the issue. Clients may still experience intermittent 5xx errors
monitoring Jan 12, 2026, 04:02 PM UTC

We are continuing to address the issue. Clients may still experience intermittent 5xx errors.
monitoring Jan 12, 2026, 04:22 PM UTC

The system has largely stabilised and the majority of requests are completing successfully. We are still observing intermittent timeouts (~3-5%) affecting position updates. Our engineering team is actively investigating the root cause related to high-volume data requests and working toward a full resolution. We will continue to provide updates as we make progress.
monitoring Jan 12, 2026, 04:45 PM UTC

We continue to see intermittent timeouts on an order-handling service, occasionally impacting orders and account/position lookups—though most requests are completing successfully. Our engineers are actively isolating affected pods, capturing diagnostic data, and restarting them to restore stability. We are also investigating database behavior and traffic patterns to identify the underlying root cause.
monitoring Jan 12, 2026, 04:55 PM UTC

Our investigation is progressing. We have identified several contributing factors and are actively analyzing traffic patterns, system behavior, and recent changes to determine the root cause. The team is working diligently toward a full resolution and will continue to provide updates.
monitoring Jan 12, 2026, 05:08 PM UTC

Our investigation continues to make progress. We have analysed system behavior during the affected periods and have narrowed our focus to connection management, which we believe may be contributing to the intermittent issues. We are encouraged that most impacted requests are ultimately completing successfully. The team remains fully engaged and is working toward a resolution. We appreciate your patience and will keep you informed as we learn more.
monitoring Jan 12, 2026, 05:20 PM UTC

All API endpoints are operating normally. We are continuing to investigate the root cause and will share findings once available. Thank you for your patience.
monitoring Jan 12, 2026, 05:35 PM UTC

we are still monitoring the system.
monitoring Jan 12, 2026, 05:48 PM UTC

The system has stabilized and no issues are currently being observed. Our team continues to work on identifying the root cause and will provide an update once we have more information.
monitoring Jan 12, 2026, 06:05 PM UTC

We have not observed any abnormalities since the system stabilised. We will close this incident after 45 minutes. Root cause analysis will continue, and findings will be shared separately. Thank you for your patience.
resolved Jan 12, 2026, 06:50 PM UTC

Our team has mitigated the incident. All systems are operational and customer impact has ceased. We are continuing our root cause analysis to prevent recurrence.
postmortem Jan 12, 2026, 10:12 PM UTC

# January 13th update **Follow-up: Resolution of Monday Market Open Incident** As a follow-up to our communication regarding Monday’s service instability, we are providing a summary of our findings and the corrective actions taken. **Root Cause Analysis** Over the weekend, a planned change was implemented which included the rollout of Istio into our production network. Following this deployment, we observed intermittent connectivity issues that resulted in the instability seen on Monday. Our investigation confirmed that the Istio layer was not stable in establishing connections between services and components over the network. This issue was exacerbated during the market open, where the high traffic footprint led to significant latency and slowness. While this configuration had been present in our staging environments for some time, the issue only manifested in production due to the unique load requirements of the live market open. **System Impact** We specifically investigated why a connection and memory issue within specific pods impacted critical trading functions. The analysis showed that the connection instability caused an excessive backlog of concurrent queries. This led to a significant memory spike that exceeded typical thresholds, creating a cascading effect on service responsiveness during peak traffic. **Remediation** To address this, we have removed the Istio plane across all critical services. All impacted services were restarted following this change to ensure a clean state. **Current Status** Since the removal of the Istio layer, system performance has returned to its baseline and connections remain stable. We are continuing to monitor the environment closely and are utilizing enhanced load testing to ensure our infrastructure remains resilient during peak traffic. -------------------------------------------------------------------------- ### **What Happened** Shortly after market open on January 12, 2026 \(approx. 9:36 AM EST\), our monitoring systems detected a significant degradation in performance across our core APIs. This resulted in elevated error rates and latency for incoming API requests. Our engineering team identified that the issue was caused by resource contention within our system. A combination of high market-open traffic and underlying system abnormalities triggered a significant memory spike in a single pod. This exhaustion of resources caused internal services to become unresponsive, resulting in connection timeouts and creating a bottleneck between our API gateway and the database services. ### **Impact** We understand that reliability is paramount for your operations. Below is a summary of the impact observed during the incident window \(9:36 AM – 11:20 AM EST\): * **API Availability:** Partners experienced intermittent `500` and `504` error responses on Order, Account, and Position endpoints. * **Order Processing:** A subset of orders experienced processing delays. In some cases, orders that timed out on the API response were successfully processed in the background. * **Data Latency:** There were short delays in position updates and trade confirmation events \(SSE\) for executed orders. * **Critical Data Integrity:** **No data was lost during this incident.** All funds and positions remain safe and secure. All transactions that appeared to time out but were executed have been reconciled. ### **Resolution** Our team executed a series of mitigation strategies to restore stability. Immediate action was taken to isolate and restart the affected service instances to clear the connection backlog. We deployed a hotfix to eliminate redundant metrics processing, which reduces unnecessary overhead and helps lower overall memory consumption. The system was fully stabilized by 11:20 AM EST. We have confirmed that error rates have returned to nominal levels and all backlog queues have been processed. We will continue to investigate any abnormalities. ### **Preventative Measures** We are committed to learning from every incident to strengthen our platform. We are prioritizing the following actions: * **Hands-on Monitoring:** We will have our team monitor the market opens all this week to ensure quick intervention. * **System Capacity Review:** We are auditing our resource allocation thresholds to ensure our services can handle "perfect storm" scenarios where high volume coincides with complex queries. * **Deployment Process Optimization:** We are revising our release procedures to ensure that non-critical background processes \(such as metrics collection\) cannot impact core transaction performance during peak market hours. * **Enhanced Monitoring:** We are implementing stricter alerts regarding database connection locking to detect and auto-remediate similar contention issues faster in the future.