API Outage April 29, 2024
Summary
This is a post-mortem describing the incident being investigated on 04/29/24 https://status.livepeer.studio/incidents/2mklfnf2hqbf
Incident
Description
Internal alerts notified the Livepeer Studio team of the high utilization of memory and CPU resources within the queuing system. A required update to the queuing system, previously tested successfully in the staging environment, was necessary. However, upon deployment into production, it became apparent that the upgrade had become stuck, leading to the issue.
Impact
Livestreams:
- New streams could not stream
Viewers:
- Only existing streams can be viewed
Regions:
- Europe (Sweden/Russia), North America (Los Angeles/New York), South America (Brazil)
Current status
The service has been fully restored
https://status.livepeer.studio/
Timeline
- 7:52 AM EST - The Livepeer Studio team was alerted of an incident related to API’s not responding
- 7:57 AM EST - The investigation from the Livepeer Studio team led to tasks in the AMPQ being disconnected and backed up. This caused high consumption of CPU and memory which led to tasks being timed out
- 9:10 AM EST - The Livepeer Studio team automatically upgraded the queuing system, which became stuck during the upgrading and caused this issue
- 8:23 AM EST - The Livepeer Studio team has a fix in place and monitored the systems
- 9:55 AM EST - After monitoring the fix for the incident, the Livepeer Studio team concluded that the issue was resolved
Prevention
We are conducting broader audits and revamping our queue utilization practices.