API failing in Asia region
Incident Report for Livepeer Studio
Postmortem

Incident Report

Date: July 23rd, 2024

Time: 9:53 UTC

Resolved: 11:28 UTC

Incident Summary: At 9:53 UTC on July 23rd, 2024, a stream trigger alert was activated indicating that stream triggers had occurred but were not receiving responses, resulting in a timeout.

The issue was traced to an excessive amount of database load triggered by a specific customer's misconfiguration of their implementation.

The incident was resolved at 11:28 UTC by rate-limiting the customer and then working with them to correct the issue.

Incident Details:

  1. Initial Alert:
* **Time**: 9:53 UTC
* **Event**: Stream trigger alert activated indicating no response from triggered streams, leading to a timeout.
  1. Investigation Findings:

    The incident was caused by the CPU regional database replica in Singapore getting pegged at 100% and consequently, nodes failing to connect to the database, which then caused issues with processing new playback requests.

  2. Mitigation Steps:

    After initially increasing our database capacity failed to handle the increased load, we tracked down the problematic queries as being driven by a specific customer. We brought in rate limiting for this customer and then restarted the database to cancel in-flight queries, which immediately resolved the issue.

Actions Taken:

  1. Restarted database replications in Singapore.
  2. Suspended streams causing rapid spikes.
  3. Analyzed stream and viewer behavior to identify patterns and prevent future occurrences.

Next Steps:

  1. Rate limiting by default: Make sure all API endpoints have per-customer rate limits
  2. Increase internal caching, including of errors: Avoid an exponential effect when an issue occurs
Posted Jul 25, 2024 - 13:34 UTC

Resolved
This incident has been resolved.
Posted Jul 23, 2024 - 11:28 UTC
Monitoring
A fix has been implemented and API error rates have dropped. We'll continue to monitor.
Posted Jul 23, 2024 - 11:09 UTC
Investigating
We are currently investigating the issue
Posted Jul 23, 2024 - 09:53 UTC
This incident affected: Livepeer Streaming API.