Nanonets incident

High Response Times for Instant learning models in US region

Major Resolved View vendor source →

Nanonets experienced a major incident on December 3, 2025 affecting API, lasting 6h 1m. The incident has been resolved; the full update timeline is below.

Started
Dec 03, 2025, 03:08 PM UTC
Resolved
Dec 03, 2025, 09:10 PM UTC
Duration
6h 1m
Detected by Pingoru
Dec 03, 2025, 03:08 PM UTC

Affected components

API

Update timeline

  1. investigating Dec 03, 2025, 03:08 PM UTC

    We are currently investigating this issue.

  2. identified Dec 03, 2025, 04:35 PM UTC

    The issue has been identified and a fix is being implemented.

  3. identified Dec 03, 2025, 05:08 PM UTC

    We have identified the issue and increased our throughput to process requests faster. However, clearing the backlog may take some additional time. Our team is continuously monitoring the situation. We apologize for the inconvenience and appreciate your patience.

  4. monitoring Dec 03, 2025, 08:50 PM UTC

    Our sync prediction API for instant learning models is now operating normally. Async results for older files are available for most users and for few users we see some backlog which is getting cleared

  5. resolved Dec 03, 2025, 09:10 PM UTC

    This incident has been resolved.

  6. postmortem Dec 04, 2025, 05:57 AM UTC

    **Overview** One of our secondary databases experienced connection issues following an unexpected spike in load. This sudden surge placed significant pressure on the database engine, causing request latency to increase substantially. **Root Cause** A rapid increase in incoming traffic caused a large number of concurrent connections to accumulate on a secondary database. This overwhelmed the database’s connection handling capacity, resulting in slow responses and delayed processing for dependent services. **Resolution** Our engineering team quickly identified the issue, took corrective actions to stabilize the database, and restored normal operation. Once the database recovered, the system began processing the accumulated backlog. This recovery phase took additional time due to the volume of pending requests. **Impact** * **Affected:** Only Instant Learning models in the **US region** * **Unaffected:** All other regions and services apart from above continued operating normally throughout the incident