Alpaca incident
Degraded Performance – Broker API & Live Trading
Alpaca experienced a minor incident on December 15, 2025 affecting Account API and broker.accounts.get and 1 more component, lasting 6h 4m. The incident has been resolved; the full update timeline is below.
Affected components
Update timeline
- investigating Dec 15, 2025, 07:01 PM UTC
We observed intermittent failures on Broker API and Live Trading API due to database connection saturation. Services have recovered, but the underlying issue is still being mitigated. Engineering is tuning connection usage and limits to prevent recurrence. Continuous monitoring is in place.
- resolved Dec 16, 2025, 01:05 AM UTC
at 4:01PM The team deployed a fix resolving the incident.
- postmortem Dec 16, 2025, 04:49 PM UTC
## **Intermittent Broker & Trading API Failures \(December 15, 2025\)** We sincerely apologize for the service degradation experienced by our partners today, December 15, 2025. This incident primarily affected the Broker API and Live Trading API, causing intermittent transaction rejections and data processing delays. ### **What Happened** Our platform experienced a temporary resource exhaustion event within our core data processing system. Specifically, we observed a rapid and unexpected spike in the number of concurrent connections to our primary brokerage database. This increase was caused by a combination of factors: 1. System Scaling Changes: Recent scaling updates to our Order Management System \(OMS\) unintentionally increased the potential demand for database connections. 2. Account Data Warm-up Issue: A critical internal job responsible for proactively loading \(or "warming up"\) partner account data failed early in the day due to a minor bug related to a system test account. 3. High Traffic Volume: When normal market traffic began, the OMS was forced to load account data on demand for a large volume of requests, with each request opening a new database connection. This high-volume, on-demand loading quickly exceeded the database's maximum connection capacity, leading to the system's saturation. This resource contention resulted in a cascading failure, where key services—such as those responsible for processing new orders and managing account state—could not access the database, causing service slowdowns and intermittent failures. ### **Impact** The primary impact was experienced during periods of high trading volume on December 15. * Trading: Partners experienced intermittent rejection of new order submissions and delays in receiving order status updates across several brokerage endpoints. * Account Operations: There were delays and failures in certain account-related operations, such as creating new accounts or processing large batches of internal jobs \(e.g., dividend payments\). * Data Integrity: Crucially, we can confirm that no client funds or trading data were lost or compromised. The system's design ensures that all transactions are safely logged and persisted. The issue was a temporary access and processing delay, not a data loss event. The issue affected several of our B2B partners, including those relying on real-time order submission and account updates. ### **Resolution** Our engineering teams were immediately engaged and implemented several mitigating actions to restore full service: 1. Immediate Connection Rebalancing: We temporarily scaled down non-critical services in our test and worker environments to free up immediate database connections for the core trading platform. 2. Job Fix Deployment: We deployed an emergency fix for the bug that prevented the critical account data warm-up job from running, ensuring this process now operates correctly. 3. Connection Limit Tuning: We immediately began the process of reducing the maximum allowed connections for specific, known-idle microservices that were holding an unnecessarily high number of open connections. As of **4:01 PM EST**, the platform has fully recovered, and all core services are operating within normal performance parameters. ### **Preventative Measures** We are implementing a robust set of follow-up actions to prevent any recurrence and strengthen the overall resilience of our platform. These efforts are organized into three key themes: * Database Connection Optimization: We will establish and enforce firm-wide connection budgets for every application that accesses the core brokerage database. This will ensure predictable resource usage and prevent any single service from overwhelming the system. * System Resilience & Monitoring: We are enhancing our monitoring and alerting to proactively detect unexpected spikes in database connection usage and account loading activity. This includes deeper investigation into the specific mechanisms that cause a high volume of connections for processes like “trading controls” to implement necessary caching layers. * Process Optimization: We are reviewing and adjusting the operational schedule for large-scale, internal processes \(such as cash dividend processing\) to ensure they do not conflict with or impact peak trading hours for our partners. We appreciate your patience and trust as we continue to invest in our platform's reliability. Our commitment to providing a stable, high-performance financial services platform remains our top priority.