Standard Ethernet simplicity vs. ultra-low-latency RDMA networks
You want NVMe performance on existing TCP/IP infrastructure without specialized hardware or RDMA expertise.
Every microsecond counts — HFT, real-time analytics, or latency-sensitive HPC workloads justify the RDMA infrastructure investment.
| Feature | NVMe/TCP | NVMe/RDMA |
|---|---|---|
| Latency | 25–40 µs | 10–20 µs |
| Throughput | ~95% wire speed | ~98% wire speed |
| Random I/O (IOPS) | ~1.8M IOPS | ~2.1M IOPS |
| Hardware Required | Standard Ethernet NICs | RDMA-capable NICs (RoCE/iWARP) |
| Setup Complexity | Low — standard TCP/IP stack | High — RDMA fabric config, PFC, ECN tuning |
| Infrastructure Cost | Standard Ethernet cost | 2–4× higher (RDMA NICs + switches) |
| CPU Offload | Partial (some kernel bypass options) | Full kernel bypass |
| Operational Risk | Low — familiar TCP/IP ops | Higher — RDMA-specific failure modes |
NVMe/RDMA achieves its 10–20 µs latency through two mechanisms that TCP fundamentally cannot replicate: kernel bypass and zero-copy data transfer. In a standard TCP stack, every I/O crosses the kernel networking subsystem multiple times — data is copied from application buffers into kernel space, processed through the TCP/IP stack, handed to the NIC driver, and the reverse happens on receipt. RDMA eliminates this entirely. The NIC reads from and writes to application memory directly, without involving the CPU or kernel for the data path. The result is that RDMA latency is bounded primarily by hardware propagation delay and NVMe drive access time, while TCP latency includes software scheduling jitter on top of that baseline.
That 15–20 µs difference is real and measurable. The question is whether it is meaningful for your specific workload. For high-frequency trading systems where decisions are made in sub-millisecond windows, 15 µs is significant. For a Kubernetes-hosted PostgreSQL replica responding to application queries that average 2–5 ms, the storage protocol's 15 µs contribution to that total is less than 1% — below any threshold that would change a business outcome. The trap is assuming that lower latency always translates to higher application throughput. It does at the extremes. For the vast majority of production workloads, other variables — CPU scheduling, query planning, application caching — dominate by orders of magnitude.
There is also the operational cost of achieving that RDMA latency. RoCEv2, the most common RDMA transport, requires Priority Flow Control (PFC) and Explicit Congestion Notification (ECN) to be precisely configured across every switch in the fabric. A misconfigured PFC pause frame can cause a livelock that takes down an entire storage fabric. iWARP avoids some of these issues but introduces its own implementation complexity. The NVMe/RDMA administrator needs skills that span storage, networking, and hardware firmware — a combination that is genuinely scarce. NVMe/TCP, by contrast, runs on the same TCP/IP stack your networking team already understands.
| Workload | Better Choice | Why |
|---|---|---|
| High-frequency trading | NVMe/RDMA | Every microsecond directly impacts trading strategy performance and P&L |
| AI/ML training (cloud) | NVMe/TCP | Standard infrastructure suffices; training throughput is GPU-bound, not storage-latency-bound |
| HPC clusters | NVMe/RDMA | Predictable ultra-low latency for tightly-coupled parallel workloads like weather modeling |
| Cloud-native Kubernetes | NVMe/TCP | No RDMA fabric to provision; ops teams use familiar TCP/IP tooling |
| Enterprise block storage | NVMe/TCP | Pragmatic TCO; the 2–4× hardware premium rarely yields proportional application benefit |
For most workloads — general-purpose databases, analytics pipelines, object stores, content delivery, and the overwhelming majority of Kubernetes persistent volumes — the 15–20 µs gap between NVMe/TCP and NVMe/RDMA is dwarfed by other latency contributors. A typical PostgreSQL query involves connection overhead, parsing, planning, index traversal, and result serialization. The storage access component might represent 10–30% of total query latency on a well-tuned system. Shaving 15 µs from that fraction will not move the p99 latency your users actually experience. NVMe/TCP, at 25–40 µs, already delivers dramatic improvements over older protocols like iSCSI (100–200 µs) while running on infrastructure you already own. For 90%+ of production deployments, that trade-off — standard hardware, standard skills, NVMe performance — is straightforwardly the right one.
NVMe/RDMA is the right answer for a narrow, well-defined category of latency-critical workloads where the operational and financial premium is justified. NVMe/TCP is the right answer for nearly everything else — and as kernel implementations mature and smart NICs bring partial offload to commodity hardware, the latency gap is narrowing. The deployment complexity gap is not. For cloud-native teams who want NVMe performance without managing an RDMA fabric, simplyblock.io delivers NVMe/TCP-based storage that integrates directly with Kubernetes — no RDMA expertise required.
simplyblock.io provides native NVMe/TCP block storage with automatic CSI provisioning.
Explore simplyblock.io →