Phrase incident

Performance Disruption of Phrase TMS (EU) Project Management component between January 27, 2026, 10 AM UTC and 5:38 PM UTC

Major Resolved View vendor source →

Phrase experienced a major incident on January 27, 2026 affecting Project management, lasting 7h 16m. The incident has been resolved; the full update timeline is below.

Started
Jan 27, 2026, 12:54 PM UTC
Resolved
Jan 27, 2026, 08:11 PM UTC
Duration
7h 16m
Detected by Pingoru
Jan 27, 2026, 12:54 PM UTC

Affected components

Project management

Update timeline

  1. identified Jan 27, 2026, 12:54 PM UTC

    Our engineering team has identified the root cause as an LQA-related issue and is actively working to resolve it. The project creation and configuration pages may be unavailable during this time.

  2. identified Jan 27, 2026, 01:57 PM UTC

    A fix is still under active development by our engineering team.

  3. identified Jan 27, 2026, 02:34 PM UTC

    A fix is still under active development by our engineering team. Project Management component is more stable now.

  4. monitoring Jan 27, 2026, 05:57 PM UTC

    A fix has been implemented. Project management is stable and we are monitoring performance.

  5. resolved Jan 27, 2026, 08:11 PM UTC

    The issue has been fixed. Project management is stable.

  6. postmortem Feb 23, 2026, 07:09 AM UTC

    ## Introduction We would like to share more details about the events that occurred on January 27, 2026, between approximately 10:00 AM UTC and 6:30 PM CET, which led to a performance disruption of the Phrase TMS \(EU\) Project Management component. During this time, some users experienced slow or unresponsive behavior when creating or editing projects. The issue was caused by limitations in the underlying database connections used by the LQA service, which is internally used by the Project Management component. ## Timeline \(UTC\) **Jan 27, 2026 @ ~10:00 AM** First customer reports indicate slow or unresponsive project creation and editing in the EU region. **10:54 AM** Automated alert triggered due to a high number of active backend sessions in the EU production environment. **11:00–11:30 AM** Engineering investigation begins. Logs indicate that the LQA service is unable to obtain database connections, with repeated timeouts when requesting connections from the application’s connection pool. **11:40 PM** Application logs show “Too many connections” errors from the underlying database. Some service instances restart as a result. **12:00–3:00 PM** Initial mitigation steps taken: * Increase of database connection limits. * Increase of application-side connection pool limits. * Additional monitoring and logging enabled. The issue improves but intermittent errors persist. **3:30 PM** Decision made to scale the underlying production database instance to a larger size and adjust connection limits accordingly. **~5:30 PM** Database scaling completed. Error rates drop and no further “Too many connections” errors are observed. System performance stabilizes. Incident marked as resolved after continued monitoring confirmed stable behavior. ## Root Cause The disruption was caused by exhaustion of available database connections in the LQA service’s underlying production database. The LQA service uses a connection pool to communicate with the database. Under increased load, the configured connection limits on both the application side and the database side were insufficient. As more requests were processed, all available database connections were consumed. Once the limit was reached: * New requests could not obtain a database connection. * Requests timed out after waiting for a free connection. * Some service instances restarted due to repeated failures. * Project creation and editing operations became slow or temporarily unresponsive. Although the database itself was operational, the maximum number of allowed concurrent connections was too low for the actual usage patterns in production. Additionally, the size of the database instance limited how many connections could be supported safely. This combination led to a bottleneck in the LQA service, which in turn affected the Project Management component in the EU region. ## Actions to Prevent Recurrence To reduce the likelihood of similar incidents in the future, we are implementing the following measures: 1. **Database Capacity Increase** The production database instance for the affected service has been scaled to a larger size to support higher load and more concurrent connections. 2. **Improved Monitoring and Visibility** Additional database performance monitoring has been enabled to provide better insight into: * Active connections * Slow queries * Resource utilization 3. **Resilience Improvements in Project Management** We have initiated follow-up work to improve system resilience so that if the LQA service becomes slow or temporarily unavailable, it does not fully block project creation or editing operations.