Runtimes and Compute Requirements

Infrastructure: Machine Learning Hardware Requirements

Choosing the right hardware to train and operate machine learning programs will greatly impact the performance and quality of a machine learning model. Most modern companies have transitioned data storage and compute workloads to cloud services. Many companies operate hybrid cloud environments, combining cloud and on-premise infrastructure. Others continue to operate entirely on-premise, usually driven by regulatory requirements.

Cloud-based infrastructure provides flexibility for machine learning practitioners to easily select the appropriate compute resources required to train and operate machine learning models.

Processors: CPUs, GPUs, TPUs, and FPGAs

The processor is a critical consideration in machine learning operations. The processor operates the computer program to execute arithmetic, logic, and input and output commands. This is the central nervous system that carries out machine learning model training and predictions. A faster processor will reduce the time it takes to train a machine learning model and to generate predictions by as much as 100-fold or more.

There are two primary processors used as part of most AI/ML tasks: central processing units (CPUs) and graphics processing units (GPUs). CPUs are suitable to train most traditional machine learning models and are designed to execute complex calculations sequentially. GPUs are suitable to train deep learning models and visual image-based tasks. These processors handle multiple, simple calculations in parallel. In general, GPUs are more expensive than CPUs, so it is worthwhile to evaluate carefully which type of processor is appropriate for a given machine learning task.

Other specialized hardware increasingly is used to accelerate training and inference times for complex, deep learning algorithms, including Google’s tensor processing units (TPUs) and field-programmable gate arrays (FPGAs).

Memory and Storage

In addition to processor requirements, memory and storage are other key considerations for the AI/ML pipeline.

To train or operate a machine learning model, programs require data and code to be stored in local memory to be executed by the processor. Some models, like deep neural networks, may require more fast, local memory because the algorithms are larger. Others, like decision trees, may be trained with less memory because the algorithms are smaller.

As it relates to disk storage, cloud storage in a distributed file system typically removes any storage limitations that were imposed historically by local hard disk size. However, AI/ML pipelines operating in the cloud still need careful design of both data and model stores.

Many real-world AI/ML use cases involve complex, multi-step pipelines. Each step may require different libraries and runtimes and may need to execute on specialized hardware profiles. It is therefore critical to factor in management of libraries, runtimes, and hardware profiles during algorithm development and ongoing maintenance activities. Design choices can have a significant impact on both costs and algorithm performance.