Northpass incident
End users experiencing issues when interacting with courses
Northpass experienced a notice incident on September 11, 2025 affecting Northpass App - AWS, lasting 1h 41m. The incident has been resolved; the full update timeline is below.
Affected components
Update timeline
- investigating Sep 11, 2025, 08:24 PM UTC
End users are experiencing issues when interacting with courses
- identified Sep 11, 2025, 08:42 PM UTC
We've identified the issue and are working towards a resolution.
- identified Sep 11, 2025, 09:43 PM UTC
We are continuing to work on resolving this issue.
- monitoring Sep 11, 2025, 10:14 PM UTC
A Fix has been implemented and we are monitoring the results.
- resolved Sep 11, 2025, 10:23 PM UTC
This incident have now been resolved, we apologize for the service interruption. Our team is hard at work on delivering on our promised transparency, we will provide the postmortem of this issue within 48 hours.
- postmortem Sep 12, 2025, 02:15 PM UTC
On September 11, 2025, our platform experienced intermittent disruptions impacting our AWS-hosted customers. These disruptions affected our end-users' ability to interact with courses. The issue was traced to an automated deployment process that unintentionally updated certain backend services to incompatible versions. This created temporary mismatches between services and led to periodic failures. Azure-hosted customers were not affected. ### Impact * Duration: 2 hours \(16:24 - 18:23 EDT\). * Pattern: Cyclic issues every few minutes with brief automatic recovery periods. * Symptoms: Users experienced difficulties accessing courses, completing activities, and general course interactions. * Affected users: All AWS-hosted customers during problematic service version combinations. ### Root Cause Our deployment automation tool was querying the docker registry for the latest images, but received inconsistent results due to the large number of images in our registry \(suspected registry API limitation\). This caused the automation to cycle through different microservice image combinations approximately every 10 minutes, creating incompatible service versions that disrupted the interdependent functionality required for course interactions. ### Resolution Our team cleaned up the docker registry by removing old images, significantly reducing the total number of images. This stabilized our deployment automation process and eliminated the cycling behavior. ### Next steps to preventing recurrence Implement automated docker image cleanup policy to maintain registry hygiene.