AI’s too pricey right now — fixing chips, software and networks could finally make it fast, cheap, and everywhere.

A new era of AI is taking shape. Over the last year, the demand for deploying trained models in real-time applications has surged, along with a continuous stream of challengers and disruptors entering the scene. AI inference, the process of using a trained AI model to make predictions or decisions by crunching new data inside neural-networked systems, has become a critical and complex growth area. It’s backed by deep investment and projected to grow at a compound annual growth rate (CAGR) of 19.2% through 2030.
Right now, a top concern across the industry is simple: We need to process significantly more data (tokens) through more AI models for dramatically less cost. Processing inference AI tokens is 10 to 100 times more expensive than it should be, forming a challenging barrier to entry that cuts across a wide variety of use cases with different input data modalities such as text, image, video and audio, as well as multimodal combinations of them.
Complexity and cost are tightly connected. Model size, data movement and computational demands all stack up and every one of them impacts ROI. To move beyond limited pilot programs and into real business impact, we must confront this challenge head-on.
We need to fix the broken economics to unlock adoption. Until the cost curve is repaired, we will fall short of realizing AI’s full potential for enhancing existing markets and creating new ones.
In the race to commoditize generative and agentic AI tokens, the winners will be those who can deliver the best, fastest and most cost-effective options. To get there, we must break with legacy assumptions.
What led us here
Looking back at the deep learning revolution, everything changed with AlexNet’s breakthrough in 2012. That neural network shattered image-recognition accuracy records and proved deep learning could deliver game-changing results.
What made AlexNet possible? Nvidia’s GPU. Originally built for gaming, it turned out GPUs were well-suited for the massive, repetitive calculations of neural networks. Nvidia had already invested in making GPUs programmable for general-purpose computing for HPC workloads, giving them a huge head start when AI erupted.
GPU performance for AI began outpacing CPUs at an exponential rate, leading to what became known as Huang’s Law: the observation that AI performance on GPUs doubles every 12 to 18 months. Unlike Moore’s Law, Huang’s Law is holding strong, fueled by a more parallelized GPU architecture and the evolving system architecture surrounding it.
Which brings us to today. We have powerful GPUs and custom AI accelerators, aka XPUs, but we’re connecting them to an infrastructure built with repurposed legacy CPUs and NICs. It’s like putting a Ferrari engine into a go-kart.
The legacy x86 architecture used by most AI servers’ CPU head nodes isn’t built to keep up with AI. It’s a general-purpose processor that can’t maintain the volume and velocity of processing required of the AI server’s head node by ever-evolving AI workloads. It ends up leaving expensive GPUs sitting idle, underutilized and underperforming.
One GPU is not sufficient anymore, so bigger arrays of GPUs are now used. They form a larger virtual processor to run those huge models and do it faster, which improves user experience and offers faster response time to agents in need of multiple iterations of model querying. Improved network connectivity and data transfer time between GPUs is now critical so GPUs will not be wasted waiting on data.
Employing the full stack
AI needs massive data flows in the blink of an eye (actually, much faster than that). To drive the cost-per-data-token down toward near zero, we need a full-stack approach with smarter software, purpose-built hardware and intelligent orchestration.
On the hardware side, GPUs and other XPUs are evolving rapidly. Their performance is improving year over year and not just because of more transistors, but because of better architecture, tighter integration and faster memory. Huang’s Law continues to deliver.
But these powerful AI processors are held back by the systems they’re embedded in. It’s like lining up Ferraris and asking them to race during rush-hour traffic.
New classes of specialized AI chips are emerging, fundamentally transforming computing, connectivity and networking for AI. This isn’t another GPU or XPU; it’s innovation at the core of the system — both from the head-node side as well as from the scale-up and scale-out network sides. These AI-optimized, purpose-built chips are masters of traffic control and processing that enable GPUs to run at full speed.
It is increasingly clear that we need faster, smarter, compute-enabled and AI-optimized NICs natively integrated into those evolving networking frameworks and stacks (nCCL, xCCL, etc.) while bypassing the CPU during data-transferring stages. The network becomes not just a superhighway, but part of the brain of the operation. These new NICs can adapt to new and future protocols designed for AI and HPC, like the ultra ethernet protocol.
On the software side, we’re seeing major advances. Techniques like pruning and knowledge distillation help make models faster, lighter and more efficient. Smaller models, like DeepSeek, outperform expectations by optimizing and balancing inference compute and data flows.
To truly reduce the cost-per-AI-token, everything across the stack must work in sync, from silicon and software to system design. The right combination can enable this synchronization to bring down costs while delivering new levels of performance.
The path to zero marginal cost
AI inference costs remain stubbornly high, even with massive capital investment. Tech companies and cloud providers often run at negative margins, pouring money into inefficient systems that were never designed for the demands of modern inference.
The fundamental issue is marginal cost. In any scalable business, success depends on driving down the cost of producing one more unit. That’s what makes them profitable and what made the whole SaaS business viable and scalable.
The same principle applies to AI. To be truly transformative, the cost of generating additional tokens needs to approach zero. That’s when the market stops subsidizing and starts extracting real, repeatable business value.
We can get there by closing the gap between Moore’s Law and Huang’s Law and by making sure networks leap ahead of GPUs. Doing so demands architecture that works with, not against, the GPUs and XPUs already leading progress.
This article is published as part of the Foundry Expert Contributor Network.
Want to join?