Overview
On Friday, November 8, 2024, our storage provider experienced a disruption in service across the North American and European regions. This affected our asset links, causing failures for video-on-demand (VOD), clipping, thumbnails, and livestream recordings.
Incident Details
The outage stemmed from an issue with our storage provider’s services. After their investigation, they reported that links for assets were inaccessible due to a broader regional service outage. Specific impacts included:
Link Accessibility
- Links generated for VODs, clipping, thumbnails, and livestream recordings were inaccessible, returning an “Access Denied” error.
Asset Generation
- No new assets (including clips, recordings, and thumbnails) were generated during the incident window.
Resolution
Once the storage provider resolved the regional outage, our services began to be restored. However, we discovered that a high volume of requests repeatedly hit our livestream thumbnail link path. This request rate exceeded the provider’s limits, resulting in blocking our account. We have implemented a solution to reduce the number of requests, restore livestream thumbnail functionality, and unblock our account..
Mitigation Steps
To minimize service impact and restore functionality, we:
- Blocked access to the shared link path specifically for livestream thumbnails, allowing VOD, clipping, and recording paths to function as usual.
- Implement a rate limit on the livestream thumbnail path once the request volume normalizes.
Root Cause
- Primary Cause: Storage provider outage in North America and Europe.
- Secondary Cause: Excessive requests to the livestream thumbnail path following service restoration, triggering a rate-limit response from the provider.
Impact Assessment
- Users Affected: Users attempting to access VOD, clipping, thumbnails, and livestream recordings received an “Access Denied” error.
- Service Downtime: Approximately 12+ hours, with the issue partially resolved upon blocking the livestream thumbnail path.
Follow-up Actions
Implement Rate Limiting
- Deploy CDNs in front of the targeted links to manage and distribute incoming requests efficiently.
Monitoring and Alerts
- Set up monitoring for high request volumes on specific asset paths to detect and address potential bottlenecks before they impact service.