22.11.2024 incident post-mortem:
Issue: Network issues on the server providers’ side caused a crucial component in the message processing to become unresponsive. This lead to most SMS messages not being properly forwarded to the next stage to be sent out.
Background: A contributing factor to this incident was a recent migration done in light of an upcoming server providers maintenance date, where previously evenly distributed connections were moved to one server as the other was said to have maintenance. While this was not necessarily the cause of the issue, it contributed to the size of the impact it had.
Timeline:
00.00 UTC: Network latency issues caused a crucial component in the message processing to become unresponsive
05.25 UTC: Messente team notice the issue and start investigating
06.00 UTC: Team restarted the unresponsive component
06.09 UTC: New incoming messages were processed without further delays, ‘stuck’ messages were starting to be sent
06.17 UTC: The queue was clear and all pending messages had been processed
Conclusions:
While the fix to the issue was found and implemented fast, the time from start of the incident to detection was increased due to the night hours. This has further highlighted to us the need for improved alerting and internal escalating. We are now exploring our options for this to ensure critical level incidents like this would get escalation and fixed faster. We have also created a ticket to add additional health checks and improved monitoring to our components. We are also looking into how to reduce the redundancy in certain components to reduce the chance of a similar future incident .
Thank you for the patience while we have been looking into this and our apologies for any inconveniences!