Alpaca incident

Journals are getting Stuck

Alpaca experienced a major incident on December 15, 2025 affecting JNLC, lasting 2h 49m. The incident has been resolved; the full update timeline is below.

Started: Dec 15, 2025, 01:08 PM UTC
Resolved: Dec 15, 2025, 03:58 PM UTC
Duration: 2h 49m
Detected by Pingoru: Dec 15, 2025, 01:08 PM UTC

Affected components

JNLC

Update timeline

investigating Dec 15, 2025, 01:08 PM UTC

We are currently investigating the JNLC issue, as it appears to be stuck. Our engineering team is already looking into it, and we will provide an update shortly.
identified Dec 15, 2025, 01:23 PM UTC

Our engineering team has identified the issue and is implementing mitigations to reduce the impact. We are also monitoring transaction throughput and processing times closely to ensure all pending items continue to move through the system.
monitoring Dec 15, 2025, 01:30 PM UTC

Journal processing is back to normal. We’re actively monitoring to ensure stability.
resolved Dec 15, 2025, 03:58 PM UTC

As we didn't observe any issue since last update, we are marking this resolve.
postmortem Dec 16, 2025, 04:03 PM UTC

## **Transaction Processing Delays and Restoration \(December 15, 2025\)** We sincerely apologize for the service degradation experienced on December 15, 2025, which primarily affected the timely processing of transactions and related account updates for our B2B partners. We understand the impact this can have on your operations and appreciate your patience as our teams worked to restore full stability. ### **What Happened** The incident began with our core ledger system experiencing database connection processing issues. The root cause was identified as a code change deployed on Dec 10th, internal system update intended for performance monitoring. This update required additional database connections, which under high system load ended up consuming all available transactional database connections and created a bottleneck. Although the release was deployed on Wednesday, we did not observe this issue on Wednesday, Thursday, or Friday.The update inadvertently exhausted the available connections to our transactional database, creating a severe bottleneck. This issue first manifested as a rapid slowdown in our internal job queue, preventing new account and transaction events \(journals and memo posts\) from being processed promptly. We deployed an initial quick fix to address the connection issue; however, it did not fully resolve the problem. As a result, we ultimately rolled back to the prior stable version of the core service. ### **Impact** The degradation resulted in short delays and a build-up of transaction volume within our systems. * **Journal Processing:** We observed a queue of approximately 2,500 journal entries awaiting processing, leading to short delays in real-time account updates. This queue was successfully cleared and processed during the initial mitigation. * **Memo Post Processing:** A significant backlog of nearly 9,000 memo posts becoming stuck in a pending state, causing delays in related services like instant funding. * **Data Integrity:** Crucially, we can confirm that no client data was lost, and all funds and transactions remain secure. The system was successfully rolled back, and all backed-up transactions were eventually processed correctly. ### **Resolution** The system is now fully stable and operational, and all pending transaction volume has been processed. 1. **Immediate Hotfix:** Our team initially deployed a fix to disable the problematic metric collection feature, which cleared the initial journal processing queue. 2. **Emergency Rollback:** Due to a subsequent critical error, an immediate emergency rollback of the core ledger service to a stable version was executed, which successfully restored core system stability. 3. **Backlog Clearance:** Following the rollback, all pending memo posts and other transactions were successfully processed in batches to ensure full and accurate booking. Normal transaction processing and system throughput have been fully restored. ### **Preventative Measures** We are committed to preventing recurrence and enhancing the resilience of our platform. To address the core issue of resource exhaustion, our immediate focus is on strengthening our testing and validation process. We will implement rigorous, high-load **stress testing** for all major service deployments, specifically targeting database connection pooling and resource consumption. This action will ensure system stability under maximum transactional volume and prevent future capacity issues. We are proactively enhancing our database observability by implementing granular telemetry and refined alerting thresholds. This framework will provide deeper visibility into connection health, allowing our engineering teams to identify and mitigate potential saturation risks before they impact service availability. ‌ We are committed to maintaining a highly reliable and transparent service for our partners and will provide an update once these key follow-up items are completed.