Storage: The unsung hero of AI deployments

As enterprises begin to deploy and use AI, many realize they’ll need access to massive computing power and fast networking capabilities, but storage needs may be overlooked.

Spinning up a chatbot or adopting an AI assistant aren’t likely to tax most enterprises’ storage capacities, but large AI projects with access to millions of data points may require many terabytes of new storage, potentially costing tens of millions of dollars, some AI and storage experts say.

A handful of storage options exist, and for some AI functions, cloud storage or traditional hard drives may be adequate, says Jeffrey Necciai, CTO of Duos Technologies, which uses AI with imaging to inspect railroad cars in motion.

But for AI systems that need to provide instantaneous responses or information, HDs and cloud storage that sits hundreds of miles away from the location of the AI workloads may be too slow, Necciai and other experts say.

For example, Duos Technologies provides notice on rail cars within 60 seconds of the car being scanned, Necciai says. In that case, Duos needs super-fast storage that works alongside its AI computing units.

“If you have a broken wheel, you want to know right now,” he says. “We don’t necessarily process anything in the cloud, because obviously, we don’t want the latency. We need to get that information out as quickly as possible.”

Not just the size of the drive

Enterprises considering large AI projects need to consider both the amount of storage they need and the ability of their storage to handle multiple tasks at the same time, Necciai says.

“We need to write to the storage rapidly at the same time for multiple threads, and we need to read from the story storage rapidly for multiple threads,” he adds. “It’s that ability to do things simultaneously to that storage that was so important to us.”

Last year, Duos scanned 8.5 million rail cars, with each scan potentially generating more than 1,050 images. The Duos Railcar Inspection Portal uses four high-performance storage arrays, each containing 16 NVMe drives, for a total capacity of about 500 terabytes.

The company also uses about 25 terabytes of more traditional storage for training and developing its AI in house, with less need for instantaneous results. “We want to leverage all of it to do what we need to do,” Necciai says. “It really comes down to using the right tool for the right job.”

Intense data needs

Like Duos, some other enterprises running huge AI projects are turning to high-capacity SSDs or NAND flash memory for their storage needs.

High-speed memory options are significantly more expensive than HDs, costing up to $1,000 per gigabyte, but they provide other advantages. For example, they can be nearly three times more power efficient and take less space than racks of servers and hard drives, says Roger Corell, senior director of leadership marketing at enterprise SSD maker Solidigm.

As enterprises adopt more complex, multimodal AI projects, and more employees begin to use AI tools, the demand for high-capacity, multi-thread storage options will only increase, he says.

“AI is so intense in terms of the amount of data that needs to be stored, and how rapidly these massive data sets need to be accessed,” Corell adds.

Furthermore, in addition to SSD or NAND options, some companies are using private clouds or co-location facilities for their storage needs, says Ugur Tigli, CTO at MinIO, an object store for AI and ML projects.

MinIO clients adopting AI are typically increasing their storage capacity by four to 10 times, he says, and he encourages large-scale AI users to look beyond the public cloud for their storage needs because the cost to use the private cloud or co-location services can be 60% lower than the public cloud.

“At the scale of hundreds of petabytes or an exabyte or two, the economics don’t work in the public cloud,” he says. “The overall cost would be in the tens to hundreds of millions of dollars per year depending on capacity, tiering, and data access profiles.”

Instead of the public cloud, enterprise users can build privately and “burst” to the cloud for additional GPU use, Tigli adds. “The key here is that compute is elastic but data has gravity and is growing at a predictable — albeit accelerated — rate, so it needs to be architected accordingly,” he says.

Storage as a platform

Another option involves petabyte-scale storage platforms, adds Priyanka Karan, field CTO at digital transformation firm AHEAD. Petabyte-scale storage platforms “aim to reduce the data movement challenges of bringing data from where it initially landed to places where it can be leveraged for AI training,” she says. “The goal is not to create a new storage silo.”

Some storage platforms available are built on NAND flash, which offers high throughput and low latency, essential for feeding data to GPUs and TPUs, she says.

With several options out there, some AI users and experts say the amount and kind of storage needed depends on the AI project an organization is deploying.

Offline batch processing has lower memory requirements than real-time workloads, Karan says. In some cases, secondary storage options can be used to hold vast amounts of data needed for training and running AI models, she adds.

Choosing the right storage option also depends on the often-mentioned data gravity — the size of the data set, whether it can be moved to the cloud for processing, or whether it makes sense to bring the processing to the data. In some AI projects, the data storage is co-located in a data center with the AI compute, in another public cloud, or at the edge where the data is created.

Enterprises have many other factors to consider, including security, and regulatory or compliance challenges. With cloud storage, “networking, distance, and latency are factors here, but they must consider the added cost variable,” Karan says.

In addition, beyond the cost of storage itself, there may be data transfer fees, access fees, and management fees for off-premises storage. On the other hand, on-premises storage options can involve huge upfront investments, as well as maintenance, power and cooling, and staff salaries.

“Organizations must evaluate their specific needs, including performance, cost, and scalability, to choose the best solution for their AI initiatives,” Karan says.

© Foundry