The multimillion-dollar fabric decision: Why neoclouds need unified networks
Publish Time: 31 Mar, 2026

The fabric decision that defines margins

An enterprise CFO reviews AI spend: training bills are spiking at a hyperscaler where they already have contracts in place, even if their data does not live exclusively there; inference performance is lagging; and a neocloud pilot is on the table. Only a year or two ago, practical choices were mostly limited to hyperscalers; the recent explosion of AI has made neoclouds a real option for many organizations as they mature and learn they have alternatives when they hit limits on performance, cost, flexibility, service, or GPU availability. The question isn't whether to use a neocloud; it's whether that neocloud can capture the full AI lifecycle-training and production inference-or just a one-time project.

Across the market, providers with comparable GPU footprints are seeing very different outcomes. Some watch customers train models on their infrastructure, then move production workloads elsewhere. For every dollar of training revenue retained, multiple dollars of higher-margin inference revenue walk out the door. Others are seeing inference revenue growing faster than training, gross margins expanding from the mid-teens toward the high-30s, and valuations that reflect durable platform economics rather than commodity pricing. With thousands of AI projects now underway globally, it is not surprising that different providers see slightly different patterns, but clear trends are emerging in how architectures and business models correlate.

The difference isn't better GPUs or temporary discounts. Providers pulling ahead have made one specific architectural bet: unified AI fabrics capable of running training and inference simultaneously at high performance, backed by a unified control plane. This is a structural decision that compounds over the years. Once you choose between dual fabrics and a unified fabric, you have effectively chosen your margin profile.

The economics are stark. A dual-fabric provider running separate training and inference infrastructures carries elevated capital and operational costs, constrained flexibility, and margins that tend to settle in the mid-teens. A unified-fabric competitor with a similar GPU count handles both workloads on a single fabric-capturing inference SLAs alongside training jobs, shifting the business mix toward higher-margin recurring revenue, and driving higher valuation multiples in the process. In realistic scenarios, the gross profit gap between these two paths can reach hundreds of millions of dollars at scale. That gap determines who has the cash flow to keep investing-and who gets left behind in a consolidating market. That makes it essential for neoclouds to ask not only how their fabric is built, but also what share of their business model is tuned toward higher-margin, recurring inference versus one-off training projects.

Platform or GPU broker?

Through 2024 and 2025, the dominant neocloud pitch was straightforward: GPU access at prices below hyperscalers. That differentiation still matters for many clients, but new decision criteria are emerging: Does the neocloud own and operate the GPUs? Do customers get direct access to level-3 experts in AI networking and GPU optimization? Can the provider troubleshoot across the full stack and offer dedicated or shared GPU environments with advisory and benchmarking support before a commitment? While these may sound like minor points, they become critical when a training or inference cluster stops working, and the question is: who can fix it, how fast, and when?

For some segments, the pure price gap is narrowing as the largest neoclouds and hyperscalers converge on similar capacity, while many emerging neoclouds still offer substantially lower effective TCO once service, support, storage, and microservices are included. In some regions and for some large buyers, hyperscalers appear to have caught up on GPU supply, yet many organizations with modest or even significant AI footprints still experience shortages in the type, timing, and location of capacity they need. Pricing continues to compress. Competing on "cheaper GPU rental" alone is a race to the bottom.

The providers that survive through 2030 are likely to look less like GPU resellers and more like integrated AI platforms-managing training, inference, fine-tuning, and iteration so customers can run AI as a business capability, not a one-off project. Platform providers command pricing power and stickiness: when a customer's recommendation engine, fraud detection, and personalization models all run on integrated infrastructure, switching costs become prohibitive. They don't re-evaluate providers for each new project. The common pattern is clear: the winners behave like platforms and offer differentiated services, not purely as GPU resellers with no value add.

The customer lifecycle makes this concrete. A retailer trains a recommendation model on a few hundred GPUs and now needs to serve thousands of inference requests per second with strict latency SLAs for their e-commerce site. A dual-fabric neocloud can't guarantee those production SLAs alongside other tenants-the customer is steered to a hyperscaler, and the neocloud is left with a one-off training win and millions in lost lifecycle revenue. A unified fabric neocloud deploys the same model into production on the same infrastructure, with no second vendor, no data migration, no egress fees, and no new tooling. Twelve months later, fine-tuning and new use cases land on the same platform. Within two years, the customer has standardized on the platform.

Why training fabrics fail at inference

Training and inference represent fundamentally opposed traffic patterns flowing through the same physical network. Large-scale training requires synchronized gradient updates across thousands of GPUs-bulk, predictable, megabytes per synchronization step. The workload tolerates brief delays; a congestion spike that extends training time slightly is acceptable. Traditional training fabrics optimize for exactly this: sufficient buffering to absorb bursts, high bandwidth, and congestion-aware routing.

Figure 1: Side-by-side comparison of training traffic-dominated by large, synchronized gradient exchanges-and inference traffic, characterized by small, irregular, latency-sensitive requests

As shown in Figure 1, inference traffic is the opposite. Requests arrive asynchronously from many customers at unpredictable intervals, each one small-kilobytes rather than megabytes-and each one latency-critical. When a production application expects 80ms and receives 200ms, SLA penalties loom. The buffering tuned for bulk training traffic can add latency to small inference requests queued behind gradient bursts. Operations teams often respond by segregating workloads onto separate racks and fabrics, creating two infrastructures with duplicate capital and operational overhead.

Unified fabric architecture

Unified fabrics bring workload awareness into the network itself. When gradient traffic flows, the fabric recognizes it as bulk synchronous communication, routes it to paths with appropriate buffering, and lets it queue briefly. When inference requests arrive simultaneously, the fabric identifies them as latency-critical and steers them onto the lowest-latency paths-protecting SLAs without starving training.

Figure 2: Conceptual diagram highlighting the Cisco N9000 unified architecture, where a shared fabric and control plane manage both bulk, high-bandwidth training flows and fine-grained, low-latency inference requests

Cisco N9000 Series Switches provide silicon-level support for this model: sub-5-microsecond fabric latencies for fast collective operations, RoCEv2-based lossless Ethernet with ECN and PFC for large-scale training, and deep shared buffers to absorb gradient bursts. At the same time, workload-aware congestion management and live in-band telemetry maintain latency guarantees for inference flows under heavy load.

At the rack level, Cisco N9100 switches built on NVIDIA Spectrum-X Ethernet Silicon handle GPU-to-GPU collectives while enforcing per-rack isolation for multi-tenant inference. Disaggregated storage platforms such as VAST Data serve both workloads on the same network-training checkpoints, model repositories, and inference request data-all with appropriate prioritization.

Real-time intelligence under load

The control plane determines whether unified fabric intelligence is usable at scale. Cisco Nexus One and Cisco Nexus Dashboard provide a unified management layer-centralizing telemetry, automation, and policy enforcement-so multi-tenant AI clusters operate as a single platform rather than a patchwork of domains.

Consider the pressure test: a large pre-training job running across thousands of H100-class GPUs, with inference endpoints serving production models for dozens of enterprise customers simultaneously. A customer's application goes viral; inference request rates jump two orders of magnitude in under a minute.

On a training-optimized fabric, the sequence is familiar: inference traffic floods into gradient bursts; P99 latency blows past SLA thresholds, timeouts cascade, and incident channels light up. Even after the training job is throttled, the damage to SLA metrics and customer trust is done.

Figure 3: Graph illustrating latency behavior at peak load; the training-optimized fabric experiences sharp latency spikes, while the unified fabric maintains steady P99 latency

On a unified fabric with Cisco Nexus One as the control plane, the response is automated. In-band telemetry surfaces the traffic shift; the fabric auto-tunes policies: inference traffic receives priority lanes, training traffic shifts to alternate paths with deeper buffering, and explicit congestion notifications guide training senders to briefly reduce rate. The training job's all-reduce time increases only marginally-within convergence tolerance-while inference stays inside its P99 SLA. No manual intervention. No SLA violation. The operations team watches everything on a single dashboard: training convergence metrics, inference latency distributions per tenant, and the fabric's own actions.

The cost of delay

A provider operating separate fabrics might tell itself that unified fabric can wait for the next budgeting cycle. Meanwhile, a competitor deploys unified fabric this year. Within a few quarters, that competitor begins capturing customers whom the first provider trained but couldn't serve in production. Their margins improve. Their next funding round prices in platform economics, not commodity pricing.

By the time the first provider decides to act, tens or hundreds of millions may already be tied up in dual fabrics. Retrofitting unified fabric becomes a multi-year migration instead of a clean build-and during that window, the most valuable customers are signing multi-year platform agreements with someone else.

The market is consolidating. The window to lead rather than follow is narrow. For neocloud CEOs, CTOs, and infrastructure leads, the fabric decision made this year will determine whether your organization becomes a differentiated AI platform or remains a GPU broker in a market that no longer rewards commodity capacity.

Unified networks: The strategic choice

Cisco works with neoclouds and innovative providers worldwide to build secure, efficient, and scalable AI platforms that deliver results across the entire model lifecycle. Detailed AI fabric white papers, design guides, and partner reference architectures-with full metrics, test data, and topologies-are available for readers who want to go deeper.

Explore how unified architectures from Cisco can accelerate your AI journey

Additional resources:

  • Neocloud Providers Are Making Waves-and Cisco Is Helping Them Do It
  • Neoclouds and winning the race to scale in the AI era
  • How neoclouds and sovereign clouds can accelerate GPUaaS and AI factories
  • AI-ready infrastructure design guides and reference architectures
I’d like Alerts: