With the growth of GPUs, there has also been a significant increase in the number of GPU-as-a-Service (GPUaaS) providers. Conventional wisdom suggests that GPU users primarily care about cost and performance. While these are indeed crucial factors, other aspects are equally important, such as availability, data locality/sovereignty, service termination features (e.g., bulk data transfer options), disaster recovery, business continuity, data privacy, ease of use, reliability, data egress costs, carbon footprint, and more.
In this blog, we will focus on availability. According to BMC Software, availability is the percentage of time that the infrastructure, system, or solution is operational under normal circumstances. For example, AWS EC2 provides a 99.5% availability SLA (which is quite low, roughly 3.5 hours of downtime per month), with service credits issued if this SLA is not met. To be fair, AWS also offers a higher regional SLA of 99.99%, equating to approximately 4.5 minutes of downtime per month.
If you are a GPUaaS provider (or an aspiring one) or an NVIDIA Cloud Partner (NCP), you need to determine what level of availability suits your ideal customer profile. You’ll also need to establish how to measure this SLA and what credits (if any) to issue if the SLA is breached. As an aside, availability can be a key differentiator for your GPU cloud service.
Once you’ve set your availability criteria, the next step is to figure out how to meet the availability SLA. Here’s the equation to calculate availability:
Availability = MTBF / (MTBF + MTTR)
MTBF = Mean time between failures
MTTR= Mean time to repair
In other words, to calculate availability, you need to determine the MTBF for your GPU cloud and calculate the MTTR across all failure types. Automated failure resolution is typically rapid and nearly instantaneous, whereas manual resolution can take minutes or hours. The challenge is deciding which faults should be automated and which should be repaired manually so that the blend of repair strategies results in an MTTR that is equal to or lower than the required MTTR. At Aarna, we’ve developed an MTTR calculator to help address this question.
The calculator uses data from Meta on GPU Cloud MTBF. With this data, you can align your repair strategy with your Availability SLA goals. The MTTR calculator requires two inputs:
- The required Availability SLA based on your (i.e. the GPUaaS provider or NCP) requirements.
- Average failure resolution time, assuming faults are identified and repaired manually.
After entering these inputs, the calculator will specify which fault repairs need to be automated and which can be managed manually.
For example, if your goal is 99.999% availability and it takes your operations team an average of 2 hours to identify and repair faults manually, you’ll need to automate the following types of faults:
- Faulty GPU
- GPU HBH3 Memory
- Software bug
- Network Switch/Cable
- Host Maintenance
Feel free to experiment with the MTTR calculator and share your feedback. If you make any improvements, please let us know so we can update the tool for the benefit of the broader community.
Additionally, our GPU Cloud Management Software (AMCOP) features fault management and correlation capabilities to aid in automating repairs. In the future, our product will also provide your BSS system with Availability SLA violation details and a list of affected tenants, enabling you to issue credits as needed. Contact us to explore these topics further.
About us : Aarna.ml is an NVIDIA and venture backed startup building software that helps GPU-as-a-service companies build hyper scaler grade cloud services with multi-tenancy and isolation.