" "
Aarna.ml

Resources

resources

Blog

Milind Jalwadi

Navigating the best fit Job Scheduling option for your HPC and AI/ML workloads
Find out more

1. Introduction

HPC (High-Performance Computing) and AI / ML (Artificial Intelligence / Machine Learning) workloads are integral to addressing some of the most challenging problems in science, engineering, and business. HPC workloads enable precise simulations of natural and physical phenomena, while AI/ML workloads empower systems to learn, adapt, and make predictions from data.

Efficient execution of these tasks relies on robust job scheduling, optimized resource utilization, and effective coordination between CPUs and GPUs. While HPC workloads benefit from traditional, deterministic scheduling, the diversity of AI/ML workloads calls for more dynamic and workload-specific approaches. This blog explores the intricacies of HPC and AI/ML workloads, the critical role of job scheduling, and a detailed mapping of the best suitable schedulers for specific workload types.

2. Understanding the nature of various HPC and AI/ML Workloads

HPC workloads revolve around solving deterministic problems through simulations or computational models. These tasks often run for extended durations and require vast computational power distributed across CPUs and GPUs. The HPC jobs often need an uninterrupted execution and the use cases do not mandate any real time responses.

AI/ML workloads are fundamentally data-driven and probabilistic, encompassing a diverse set of tasks. These workloads range from training and fine-tuning machine learning models to deploying them for inference and decision-making.

Let’s delve into various categories of AI/ML workloads:

  • Large Language Models (LLMs), such as GPT or BERT, are a foundational AI workload. Training LLMs involves optimizing billions of parameters over massive datasets to enable models to understand and generate human-like responses. The workload requires extensive GPU resources for long term duration spanning several months.
  • Small Language Model (SLM) training focuses on smaller, domain-specific models that are optimized for particular applications. While GPUs dominate the training process, SLM models need GPU infrastructure for lesser duration, which would mean frequent set-up and tear down of infrastructure.
  • Fine-tuning is a process where pre-trained models are adapted to specific datasets or domains. This allows businesses to customize AI capabilities for their unique needs. Fine-tuning relies on GPUs for computationally intensive retraining tasks, but needs lesser GPU cycles as compared to SLMs. For this reason, these jobs might be even more dynamic than SLMs.
  • Batch inference processes large datasets using pre-trained models in a non-real-time manner. Based on the size of data to be processed, the GPU usage time for such workloads could be predicted. These tend to be of even shorter duration than Fine-tuning.
  • Retrieval-Augmented Generation (RAG) is a technique that combines a generative model (e.g., GPT, BERT) with an external knowledge retrieval system to generate responses. RAG is characterized by bursty usage of GPUs mainly for inferencing and minimal training. Unlike the other workloads above, RAG is a transactional workload that is “always on” with dynamic scaling to adapt to the load.
  • Real-time inference is another vital AI/ML workload, used in latency-sensitive applications. This workload is also transactional and relies on GPUs for performing the necessary computations with minimal delay. Of course, not all RAG and real-time inference jobs need GPUs, some are perfectly fine with CPUs.

Among all the above AI/ML workload categories, RAG and real-time inferencing need quick response times and the traffic for these is bursty with peaks and valleys.

To summarize, below table captures comparison between all the AI/ML workloads.

3. Why Is Job Scheduling Critical?

Efficient job scheduling is vital for handling the diverse demands of HPC and AI/ML workloads. It ensures that compute resources, whether CPUs or GPUs, are utilized optimally, preventing bottlenecks and idle time. Scheduling also allows for dynamic adaptation to fluctuating workloads, ensuring that resources are allocated in real-time based on priority and demand.

After having looked at various HPC and AI workloads characteristics, let’s take a look at some of the prominent job scheduling options.

SLURM (Simple Linux Utility for Resource Management) is a robust open-source job scheduler designed for High-Performance Computing (HPC) environments. It efficiently manages and allocates resources, such as CPUs, GPUs, and memory, across large clusters of nodes. SLURM supports features like job prioritization, pre-emption, dependency handling, and partitioning, making it ideal for running computationally intensive, batch-oriented workloads.

Ray is a distributed open-source computing framework for executing parallel and distributed processing of AI/ML workloads. Ray's job scheduling capabilities enable distributed model training, fault tolerance, task pre-emption and scaling, which results in efficient resource utilization. Ray enables developers to harness multi-node GPU processing capability by defining tasks that can be executed in parallel across the distributed GPU clusters spanning on-premises, cloud, or hybrid environments.

Kubernetes (With NVIDIA GPU operator) enables scheduling of GPU-accelerated workloads, managing pod-level resource allocation, and ensuring seamless scaling for AI/ML applications. It supports a wide range of use cases, including containerized training pipelines, inference deployments, and batch processing, particularly in multi-tenant or cloud-native environments. By combining Kubernetes’ powerful orchestration features with NVIDIA's GPU optimizations, this solution simplifies the execution of AI/ML workloads in scalable, containerized environments while providing flexibility for both batch and real-time processing tasks. Kubernetes can also be used for transactional workloads with soft isolation through different namespaces. We call it soft since it is not 100% secure, but good enough for an organization for sharing across departments.

Run.ai is a proprietary Kubernetes-native platform focused on optimizing GPU resource utilization and scheduling for AI/ML workloads by extending the native K8s scheduling capabilities specifically for AI/ML workloads. It enables dynamic GPU allocation, pooling, and partitioning, allowing multiple jobs to share GPUs efficiently. Run:AI’s job scheduling capabilities include pre-emption, priority-based scheduling, and resource quotas, ensuring fair allocation in multi-tenant environments.

The below table gives a brief comparison between features offered by the above stated job schedulers.

4. Finding a best fit job scheduler

Determining the best fit job scheduler depends on the short term and long term nature of AI workloads that are planned to be scheduled on the available GPU infrastructure.

For longer duration LLM training that span months, bare metal nodes serve the purpose without any need for a job scheduler. The only requirement here would be to have a sophisticated infra manager that automates bare metal provisioning such that it could be quickly reprovisioned for any type of new LLM training requirement.

For any other category of the AI/ML workloads it would be prudent to use some job scheduler.

The below table depicts how various job schedulers fare for supporting the different categories of AI/ML workloads.

As evident from the chart above, no single job scheduler perfectly addresses the diverse requirements of all use cases. To bridge this gap, at aarna.ml, we have developed a comprehensive GPU Cloud Management Software (CMS) that enhances existing job, instance, and model scheduling solutions by introducing robust multi-tenancy, hard isolation, and seamless support for a wide range of workloads, including bare metal, VMs, containers, and Kubernetes pods.

5. Conclusion

Efficient job scheduling is essential for maximizing the performance of HPC and AI workloads. By understanding the unique strengths and limitations of these schedulers—and aligning them with specific workload demands—organizations can achieve optimal resource utilization, reduce costs, and unlock the full potential of their computational infrastructure.

Milind Jalwadi

Create your NVIDIA AI Cloud Digital Twin: Design, Deploy and Operate AI Cloud with Unmatched Flexibility and Efficiency
Find out more

Deploying large-scale AI infrastructures comes with significant complexity for NVIDIA Cloud Providers (NCPs) who need to validate intricate, multi-tier network architectures built with NVIDIA’s state-of-the-art GPU and networking technologies.

In large-scale deployments, NCPs manage thousands of GPU nodes connected through multi-layered networking that includes leaf, spine, and super spine switches. Nvidia also recommends a Reference Architecture (RA) for NCPs to ensure that these configurations help achieve optimal throughput and low latency. However, implementing this design and validating the configurations is a challenge because of existing hardware limitations in testing environments, risking service reliability and deployment timelines.

The need for robust validation is crucial. The complexity of these topologies, coupled with multi-tenancy requirements, makes a reliable and scalable validation solution a necessity. The alternatives are performance degradation or network downtime, both of which are catastrophic in terms of revenue loss or SLA violation penalties.

To tackle these challenges, aarna.ml presents an innovative Digital Twin solution, which works seamlessly with NVIDIA Air to simplify network validation and streamline operations. This blog will provide an in-depth look at the importance of a Digital Twin, the common challenges faced by NCPs, and how the aarna solution can transform network deployment and management.

Typical Large Scale AI Cloud deployment

The diagram below depicts a high level topology of a large scale deployment.

Typical deployments comprise multiple nodes grouped under scalable units (SU). GPUs within these SUs are then connected through multi-tier switches comprising leaf and spine and core switches, such that GPU to GPU communications across the complete data center is optimized with minimal hops and thereby ensuring high performance and low latency. Please note that topology specified above is only for reference and does not indicate any recommended deployment configurations.

Challenges in Current AI Cloud Deployments and Operations

As we have seen above, the sheer scale of deployment, lack of adequate hardware resources for testing, absence of automation tools makes it difficult for NCPs to ensure that actual deployment matches the intended designed deployment. The current challenges that NCPs need to address could be categorized under day 0; day 1 activities and day 2 activities are also a key consideration and are detailed below.

Day 0 and Day 1 Design and Validation:

  • Manual Setup: Traditional methods involve time-consuming manual configuration of underlay networks and the testing of network cabling and server deployments. The entire lifecycle to make the deployment live could span a few months.
  • RA Compliance: Ensuring that the deployment aligns with NVIDIA’s RA specifications is challenging without a standardized validation tool and test and configuration scripts, increasing the potential for errors.
  • Hardware Limitations: Testing expansive, multi-tier topologies in lab settings is constrained by limited resources, leading to incomplete validation.

Day 2 Operations and Management:

  • Configuration management: Ensuring synchronized and versioned configurations across hundreds of switches is a complex task. Configuration drifts can lead to inconsistent network behavior and potential service issues.
  • Tenant Life Cycle Management (LCM): Allocating and deallocating nodes and virtual resources for tenants need overlay configurations to be performed on several switches and nodes. Manual approach and validating these on production set-up requires a careful design and implementation. Errors can be costly.
  • Topology Changes: Routine maintenance, switch replacements, and topology updates necessitate quick, error-free configuration updates to maintain network stability.
  • Reducing MTTR: Identifying and  correcting GPU related errors is time consuming because of manual steps and configurations. This can cause long mean-time-to-repair (MTTR) durations.

Introducing the Digital Twin Solution

Aarna.ml’s Digital Twin solution is a transformative tool that works with NVIDIA Air to create a comprehensive digital replica of physical network infrastructure. This enables NCPs to simulate their network topologies, test various deployment scenarios, and validate configurations before moving to live production, greatly reducing the risk of errors.

Key Features and Capabilities

  • Complete Network Simulation: The Digital Twin allows NCPs to specify and simulate their desired network topologies, including multi-tier switching, to ensure compliance with NVIDIA RA standards. This simulation environment supports detailed testing of both underlay and overlay configurations.
  • Automated Day 0 Configurations: The solution automates the initial setup of underlay networks, minimizing manual errors and significantly reducing the time required for validation.
  • Dynamic Tenant Overlays: Support for dynamic overlay configurations ensures that tenant-specific requirements are met, enabling seamless management of both east-west (GPU-to-GPU) and north-south (GPU-to-storage) traffic flows.
  • Simulation of Day 2 Operations: The Digital Twin simulates day 2 operations such as configuration changes, switch replacement, topology updates, GPU errors etc. This ensures that all such scenarios are tested and the deployment scripts and configurations are generated that could be used by NCPs for their production setup.

Benefits of aarna.ml Solution

  • Accelerated Deployment: Reduces the certification process from months to weeks by automating validation and configuration tasks.
  • Enhanced Reliability: Ensures comprehensive validation of network configurations to prevent errors before they reach production.
  • Efficient Day 2 Operations: Simplifies tenant-specific changes, configuration drift corrections, ongoing topology management and reduces MTTR by automating GPU related fault corrections.
  • Improved ROI: Maximizes operational efficiency, saving time and reducing the potential for costly network issues.

Summary

The aarna.ml solution, which utilizes NVIDIA Air, is an essential capability for NCPs looking to streamline their deployment processes, reduce risks, and maintain optimal performance in NVIDIA-based networking infrastructures. By automating testing and operational tasks, this solution empowers NCPs not only to validate complex deployments but also extend its usage for production set-up.

How to Engage

Engage aarna.ml for a Digital Twin Professional Service. Complete your day 0 and day 1 provisioning of GPU infrastructure within 2 weeks.

Explore Your Options: Learn more about how Aarna Networks’ Digital Twin technology can transform your network validation and management strategies. Contact us at info@aarna.ml today to integrate this innovative solution into your operations and unlock the full potential of your NVIDIA-based infrastructure.

Sriram Rupanagunta

Dynamic AI-RAN Orchestration for NVIDIA Accelerated Computing Infrastructure
Find out more

NVIDIA accelerated computing can significantly accelerate many different types of workloads. In this blog, I will explain how the same NVIDIA GPU computing infrastructure (all the way to fractional GPU) can be shared for different workloads, such as RAN (Radio Access Network) and AI/ML workloads, in a fully automated manner. This is the foundational requirement for enabling AI-RAN, a technology that is being embraced widely by the telecommunications industry, to fuse AI and RAN on a common infrastructure as the next step towards AI-native 5G and 6G networks. I will also show a practical use case that was demonstrated to a Tier-1 telco.

First some background before diving into the details: The infrastructure requirements for a specific type of workload (e.g., RAN or AI/ML) will vary dynamically, and the workloads cannot be statically assigned to the resources. This is particularly aggravated by the fact that RAN utilization can vary wildly, with the average being between 20%-30%. The unused cycles can be dynamically allocated to other workloads. The challenges in sharing the same GPU pool across multiple workloads can be summarized below:

  • Infrastructure requirements may be different for RAN/5G & AI workloads
  • Dependency on networking, such as switch re-configuration, IP/MAC address reassignment, etc.
  • Full isolation at infra level for security and performance SLAs between workloads
  • Multi-Instance GPU (MIG) sizing - Fixed partitions or dynamic configuration of MIG
  • Additional workflows that may be required, such as workload migration/scaling

This means that there is a need for an intelligent management entity, which is capable of orchestrating both infrastructure as well as different types of workloads, and switch the workloads in a dynamic fashion. This is accomplished using AMCOP (Aarna Networks Multicluster Orchestration Platform, which is Aarna’s Orchestration platform that supports orchestrating infrastructure, workloads, and applications).

The end-to-end scenario works as follows:

  • Create tenants for different workloads – RAN & AI. There may be multiple tenants for AI workloads if multiple user AI jobs are scheduled dynamically
  • Allocate required resources (servers or GPUs/fractional GPUs) for each tenant
  • Create network and storage isolation between the workloads
  • Provide an observability dashboard for the admin to monitor the GPU utilization & other KPIs
  • Deploy RAN components i.e. DU, CU, and NVIDIA AI Aerial (with Day-0 configuration) from RAN tenant
  • Deploy AI workloads (such as an NVIDIA AI Enterprise serverless API or NIM microservice) from AI tenant(s)
  • Monitor RAN traffic metrics
  • If the RAN traffic load goes below the threshold, consolidate RAN workload to fewer servers/GPUs/fractional GPUs
  • Deploy (or scale out) the AI workload (e.g. LLM Inferencing workload), after performing isolation
  • If the RAN traffic load exceeds the threshold, spin down (or scale in) AI workload, and subsequently, bring up RAN workload

The demo for showcasing a subset of this functionality using a single NVIDIA GH200 Grace Hopper Superchip is described below. This uses a single GPU (which is divided into fractional GPUs, as 3+4 MIG configuration), which are allocated to different workloads.

The following functionality can be seen in the demo, as part of the end-to-end flow.

  • Open the dashboard and show the RAN KPIs on the orchestrator GUI. Also, show the GPU and MIG metrics.
  • Show all the RAN KPIs and GPU + MIG metrics for the past duration (hours / days)
  • Show the updated RAN & GPU / MIG utilizations + AI metrics
  • Initiate the AI load/performance testing and then show the AI metrics and GPU/MIG utilizations on the dashboard
  • Query the RAG model (from a UE) from a custom GUI and show the response.

The functional block diagram of the demo configuration using AMCOP based solution is as shown.

Next Steps:

Over the next few years, we predict every RAN site to run on an NVIDIA GPU-accelerated infrastructure. Contact us for help on getting started with sharing NVIDIA GPU compute resources within your infrastructure. Aarna.ml’s AI-Cloud Management Software (also known as AMCOP) orchestrates and manages GPU-accelerated environments including with support for NVIDIA AI Enterprise software and NVIDIA NIM microservices. Working closely with NVIDIA, we have deep expertise with the NVIDIA Grace Hopper platform, as well as NVIDIA Triton Inference Server and NVIDIA NeMo software.

Milind Jalwadi

NCP Technical Considerations for Building GPUaaS Cloud Infra
Find out more

Introduction

To address the ever growing industry demand of executing the AI workloads, many different players are entering into the market of hosting NVIDIA GPU based infrastructure and providing it as a service; in some cases, by becoming a NVIDIA Cloud Partner (NCP). Essentially these entities need to offer GPU processing instances to their clients in a manner similar to how the hyperscalers offer their infrastructure -- API driven, on-demand, elastic, secure, isolated and usage based.

Let’s delve into various technical aspects for the implementation of such a GPU infrastructure that needs to be offered “as a service” by the NCPs.

Multi tenancy: The basic ask for any ”as-a-service” offering

Any “as-a-service” offering needs to fundamentally support “multi-tenancy” at its core. The same physical infrastructure needs to be logically sliced and isolated for every tenant without compromising on the throughput and latency requirements of the tenant workloads.

This “slicing” of infrastructure needs to span across all the layers encompassing host hardware (CPU & GPU), platform software (e.g. Bare Metal as a Service i.e. BMaaS / Virtualization / Container as a Service i.e. CaaS), storage and networking devices (Switches & Routers).  

Logical isolation of such an infrastructure also needs to be elastic and dynamic in nature. It should be completely software driven with no manual steps. All the required resources for that tenant should be reserved & inter-connected during the lifespan of that tenant instance and then released back to the common pool once the tenant instance is deleted.

In summary, for offering GPU as a service, the NCPs need to be able to dynamically provision all the layers of the GPU based infrastructure including hardware, networking and software platforms - based on the API driven request from tenants.

So what’s the list of functionalities that NCPs need to have for offering GPUaaS?

With this context, let’s now get to the precise functional capabilities that NCPs would need to implement in their infrastructure so that it could be offered in a GPUaaS model.

  1. Day 0 provisioning of  the GPU based DC -- This feature should bootstrap all the DC nodes and configure them with appropriate OS versions, firmwares, BIOS settings, GPU drivers etc. It should also perform day 0 provisioning of network switches involving Infiniband Quantum and ethernet based Spectrum switches. If included in the configuration, this software module should include provisioning of the Bluefield3 (BF3) DPU as well. In summary, this module should automate provisioning of hosts, GPUs and underlay networks. In essence it should make the infrastructure ready for usage by the NCP.
  2. Compute (CPU / GPU) allocations -- Next, the NCP needs the capability  to allocate CPUs and GPUs as per the tenant requested parameters. Allocation of CPUs for tenants is a solved problem (BMaaS, virtualization, or CaaS with correctly tuned OS packages and Kubernetes Operators) and the main focus is on how GPU allocations could be done for tenants. Here various options ranging from fractional GPUs to multiple GPU allocations to tenants should be done, based on the tenant workload requirements.
  3. Network isolation – The tenant AI workloads may be executed across multiple GPUs within the node and across nodes. The nodes could be connected using Infiniband Quantum or ethernet based Spectrum switches (BF3 soft switches may also be involved). Per tenant network isolation based on underlying network capabilities (e.g. PKEY for Infiniband and VXLAN for Ethernet) should be configured for ensuring tenant workloads do not impact each other.
  4. Storage configurations – The tenant specific ACLs should be configured on the storage solution so that tenant workloads are able to access their subscribed quota of the storage. The GPU Direct storage should also be configured as required for tenants that enables faster data transfer between GPU memory and the storage device.
  5. Job scheduling – Many HPC and AI training workloads require unique batch scheduling tools / algorithms that are different from the transactional CPU based workloads. These scheduling tools ensure maximizing the GPU utilization and optimizing the workload executions across the tenants.
  6. RBAC -- Role Based Access Control should support various personas for global admins, tenant admins and tenant users along with the respective privileges. The global admin should be able to create tenant admins and allocate the quota for every tenant. Tenant admin should be able to create enterprise specific users and project hierarchies within its ambit. The tenant users should be able to manage and monitor their respective instances and workloads.  
  7. Observability & Monitoring - For every user, based on their privileges, they should be able to monitor statistics related to CPU, GPU, Memory, CaaS, workloads etc. and take any manual or automated actions as may be required to maintain the health of their workloads.
  8. Usage metrics - The NCPs should get per tenant usage metrics as per their desired intervals for billing purposes. Based on the NCP BSS capabilities, batch or connection oriented interfaces should be supported for passing on the tenant related usage metrics.
  9. GPUaaS API - Finally all of these features should be made accessible to the tenants through an API based interface. Tenants should be able to invoke appropriate APIs on the NCP gateway for requesting infrastructure, submitting the AI workloads, getting the usage metrics etc. The API should also be available through other means such as GUI or Kubernetes CRs.

In addition to this primary functionality, there is additional NCP functionality needed as well such as image service, key management, workflow orchestration, active &available inventory management and policy engine. These additional components bring in the benefits of automating the day n operations and also helps NCP tenants with optimally sizing the GPU based infrastructure resources for their workloads.

Implementation of the above functional modules shall enable NCPs to offer a complete E2E GPUaaS infra to their customers. It should be noted that various NVIDIA solution components and few other 3rd party components do support a few building blocks from the above list but they all need to be utilized in a logical manner and supplemented with additional features to provide a truly “as a service” platform.

In our next blog, we shall delve into the solution blueprint that NCPs could implement for offering their GPU infrastructure to tenants in a “as a service” and in a fully software driven mode.

Please reach out to info@aarnanetworks.com to learn more.

Amar Kapadia

99.X% Availability? Why Most GPUaaS SLAs Fall Short and How to Fix It
Find out more

With the growth of GPUs, there has also been a significant increase in the number of GPU-as-a-Service (GPUaaS) providers. Conventional wisdom suggests that GPU users primarily care about cost and performance. While these are indeed crucial factors, other aspects are equally important, such as availability, data locality/sovereignty, service termination features (e.g., bulk data transfer options), disaster recovery, business continuity, data privacy, ease of use, reliability, data egress costs, carbon footprint, and more.

In this blog, we will focus on availability. According to BMC Software, availability is the percentage of time that the infrastructure, system, or solution is operational under normal circumstances. For example, AWS EC2 provides a 99.5% availability SLA (which is quite low, roughly 3.5 hours of downtime per month), with service credits issued if this SLA is not met. To be fair, AWS also offers a higher regional SLA of 99.99%, equating to approximately 4.5 minutes of downtime per month.

If you are a GPUaaS provider (or an aspiring one) or an NVIDIA Cloud Partner (NCP), you need to determine what level of availability suits your ideal customer profile. You’ll also need to establish how to measure this SLA and what credits (if any) to issue if the SLA is breached. As an aside, availability can be a key differentiator for your GPU cloud service.

Once you’ve set your availability criteria, the next step is to figure out how to meet the availability SLA. Here’s the equation to calculate availability:

Availability = MTBF / (MTBF + MTTR)

MTBF = Mean time between failures

MTTR= Mean time to repair

In other words, to calculate availability, you need to determine the MTBF for your GPU cloud and calculate the MTTR across all failure types. Automated failure resolution is typically rapid and nearly instantaneous, whereas manual resolution can take minutes or hours. The challenge is deciding which faults should be automated and which should be repaired manually so that the blend of repair strategies results in an MTTR that is equal to or lower than the required MTTR. At Aarna, we’ve developed an MTTR calculator to help address this question.

The calculator uses data from Meta on GPU Cloud MTBF. With this data, you can align your repair strategy with your Availability SLA goals. The MTTR calculator requires two inputs:

  1. The required Availability SLA based on your (i.e. the GPUaaS provider or NCP) requirements.
  2. Average failure resolution time, assuming faults are identified and repaired manually.

After entering these inputs, the calculator will specify which fault repairs need to be automated and which can be managed manually.

For example, if your goal is 99.999% availability and it takes your operations team an average of 2 hours to identify and repair faults manually, you’ll need to automate the following types of faults:

  • Faulty GPU
  • GPU HBH3 Memory
  • Software bug
  • Network Switch/Cable
  • Host Maintenance

Feel free to experiment with the MTTR calculator and share your feedback. If you make any improvements, please let us know so we can update the tool for the benefit of the broader community.

MTTR Calculator for GPUaaS Providers
%
Hours
Required MTTR: 0 Minutes
Monthly Downtime: 0 Seconds
GPUaaS Failure Analysis & MTBF

Additionally, our GPU Cloud Management Software (AMCOP) features fault management and correlation capabilities to aid in automating repairs. In the future, our product will also provide your BSS system with Availability SLA violation details and a list of affected tenants, enabling you to issue credits as needed. Contact us to explore these topics further.

About us : Aarna.ml is an NVIDIA and venture backed startup building software that helps GPU-as-a-service companies build hyper scaler grade cloud services with multi-tenancy and isolation. 

Amar Kapadia

Aarna’s Role in Enabling Sovereign GPUaaS Providers in India
Find out more

The government of India (India AI) issued a document titled, “Inviting Applications for Empanelment of Agencies for providing AI services on Cloud.” This document invites in-country GPUaaS providers to bid for sovereign opportunities. It is a detailed and thoughtful document and will no doubt spur innovation at all levels of the AI/ML stack within India.

If you are responding to this invitation or plan to, we would like to congratulate you! However, some of the requirements in sections 6.7 “Admin Portal”, 6.8 “Service Provisioning”, 6.9 “Operational Management”, and 6.12 “SLA Management” are complicated. They essentially require a GPU Cloud Management Software layer. And this cloud management software needs to be up & running in t0 + 6 months.

Let’s explore what your options are since it’s the classic “make” vs. “buy” situation. Here are the pros and cons of these two options.

Pros Cons
“Make” Option
  • Full control of the software with ability to differentiate and customize (it may actually not be possible to differentiate at the IaaS layer, so the differentiation argument might be questionable)
  • Requires very strong in-house development skills, esp. given the tight development timelines
  • Matching ongoing feature requirements will get challenging in the long term
“Buy” Option
  • Get access to a purpose-built 3rd party product
  • Save cost (since 3rd party will be less expensive than in-house)
  • Focus precious development resources on AI/ML rather than Infra
  • Customization will be possible, but might be more difficult than in-house software

If you are going for the “make” option, the rest of this blog is moot. However, if you want to explore the “buy” option, we can help you with the below requirements[1]

Section Requirement
General
  • Admin portal available within 6 months of LOI
  • Dynamically manage 1,000+ GPUs
6.7 “Admin Portal”
  • User registration/account creation
  • Service catalog and prices
  • Capacity dashboard
  • Utilization monitoring
  • Incident management
  • Service Health Dashboard
  • Ability to customize dashboard for the subsidy workflow
6.8 “Service Provisioning”
  • Online, on-demand instances that can be scaled up/down
  • Management portal
  • Public internet access with VPN
  • Support for BMaaS and VMs
  • MTTR SLAs and recovery
  • User notifications
  • Data destruction (so it cannot be forensically recovered)
6.9 “Operational Management”
  • Patch management
  • OS images with latest security patches
  • Root cause analysis and timely repairs
  • System usage
6.12 “SLA Management”
  • SLA measurement and MTTR improvement to meet incident management SLA (99.95% or higher)
  • Service availability measurement

 Finally, to our knowledge, we are the only GPU Cloud Management Software company in the market. If this blog sounds interesting, learn more:

●    Our GPU Cloud Management Software demo

And please feel free to contact us.