Summary
This is a post-mortem describing the incident being investigated on 05/20/24 https://status.livepeer.studio/incidents/tdrw49vj8y87
Incident
Description
Users reported frequent rebuffering and the inability to start or view streams. Upon investigation, the Livepeer Studio team discovered that this issue affected all regions. Our primary cloud storage provider reported an outage on May 20 from 15:11 UTC to 15:59 UTC. During this outage, a bug in the retry mechanism for uploading recordings caused several servers to lock up and become unresponsive.
Impact
Livestreams:
- Current livestreams in some regions (London, Frankfurt, Stockholm) would experience rebuffering
- New livestreams in some regions may not have been able to stream
Viewers:
- Current and new viewers in some regions experienced a high rebuffer rate
VOD:
- Uploading assets and live recording will take a long time to process
Current status
The service has been fully restored
https://status.livepeer.studio/
Timeline
- 11:00 AM EST - An internal alert triggered an investigation by the Livepeer Studio team to identify and find the cause of this alert
- 11:11 AM EST - A status alert from our storage provider notified us of an outage in one of the US regions
- 12:05 PM EST - An investigation led to an outage by our storage provider at 11:11 AM EST indicated as one of the reasons for this incident
- 12:18 PM EST - After monitoring the fix for the incident, the Livepeer Studio team concluded that the issue was resolved
Prevention
- We enhanced our retry mechanism and implemented additional failover solutions. These solutions correctly switch to our secondary backup storage provider if the primary storage provider experiences any outages.