Incident Title: v22 OVH rollback, Redis load spike
Date & Time of Incident: 13 April 2026, 21:00 - 22:30 CET
Affected Services: Timemachine / Redis Server / All services (indirectly)
Severity Level: High
Note : all times in the document are CET times.
The incident began on April 13, 2026, around 22:00, following the deployment of version v22 to
the OVH production environment. Monitoring immediately after deployment showed an
abnormally high load usage on our Redis server.
This spike impacted the entire platform, as the Redis server is a key component for software
availability and stability. A decision was made to roll back the entire v22 deployment, which was
completed around 22:30 hours (CET).
The v22 deployment contained a bugfix which required the timemachine service to clear the
Redis cache whenever a timetable was updated. To achieve this, the timemachine service
performs specific login & filtering operations on keys matching the timetable.
Clients periodically execute stock location imports, and each stock location import interacts with
the timemachine service, as a stock location contains 2 timetables, an opening hours timetable
and an options timetable. While a single import is harmless, the cumulative effect of many
imports affecting many timetables led to the Redis server becoming overloaded.
The OVH rollback has allowed us to avoid all product incidents related to this behavior and
remove risk for client operations. The additional QA measures in our release process are a
proactive step to further protect our customers from any potential post-release issues.
On April 14 at 11:10 CET, the Redis load spike was successfully reproduced in the qualif
environment by importing five batches of 50 stock locations simultaneously.
At 11:41 on April 14, the fix was tested (consisting of reverting the Redis clear mechanism
triggered by timetable updates). Subsequent imports of five batches of 50 stock locations were
executed without any load spike.
A 24 hours observation window confirmed that during the qualif configured stock locations
happening on April 15, the load spike has not recurred.
● Review & audit release & code freeze practices for qualif changes (pre-MEP cycle)
● Extend Go/NoGo performance validation to review -7D, -30D and cumulative performance
indicators
● Audit development environment vs production environment for all tools versioning (Redis
versioning etc).
● Enable continuous AI incident detection for qualify & continuous AI-driven performance
analysis in Datadog (under investigation & testing)