RDMA is a network I/O technology that allows one computer to directly access another computer's memory without involving the host CPU or operating system, enabling ultra-low-latency data transfers.
RDMA achieves its performance characteristics by implementing a kernel-bypass data path: the RDMA-capable NIC (RNIC) directly reads from or writes to application memory buffers without copying data through kernel buffers and without interrupting the host CPU. This eliminates the context switches, memory copies, and interrupt processing that dominate latency in conventional TCP socket I/O. The operating system is involved only in setting up the memory registrations and queue pairs (QPs) — steady-state data transfer is entirely offloaded to the RNIC.
Three main RDMA transports exist: InfiniBand (a purpose-built fabric), iWARP (RDMA over standard TCP/IP), and RoCE (RDMA over Converged Ethernet). InfiniBand provides the lowest latency (sub-1 µs in HPC environments) but requires a dedicated InfiniBand fabric. RoCE v2 operates over standard Layer-3 IP networks but requires Priority Flow Control (PFC) or Enhanced Transmission Selection (ETS) to prevent congestion-triggered packet drops that would force RDMA retransmissions. iWARP trades some latency for simpler network requirements, tolerating packet loss through TCP.
The RDMA programming model exposes verbs: RDMA Read (remote party reads from local memory), RDMA Write (remote party writes to local memory), and Send/Receive (two-sided messaging). NVMe-oF over RDMA uses RDMA Write for data transfers and Sends for NVMe command capsules, allowing the target to push data directly into the initiator's pre-registered buffers with zero CPU involvement on the data path.
RDMA is the transport used by the NVMe/RDMA binding — the highest-performance but most operationally demanding member of the NVMe over Fabrics family. NVMe/TCP was designed explicitly as an alternative to NVMe/RDMA for environments that cannot justify the cost and operational complexity of RDMA-capable NICs and lossless Ethernet fabrics. In practice, NVMe/TCP running on modern CPUs with kernel 5.0+ drivers achieves latency within 1.5–2× of NVMe/RDMA — a gap that continues to narrow with software optimizations like DDIO, RSS-aware queue mapping, and io_uring integration.
NVMe/RDMA achieves 10–20 µs latency and near-zero CPU overhead for data transfers, making it compelling for latency-sensitive HPC and financial workloads. However, it requires specialized RNICs (typically $500–$2,000 per port), lossless Ethernet configuration (PFC/DCQCN), and dedicated operational expertise. NVMe/TCP requires none of these — any standard NIC and any Ethernet switch will work — at the cost of higher CPU utilization and latency in the 25–40 µs range. For most enterprise and cloud-native workloads, the simplicity of NVMe/TCP outweighs the incremental latency advantage of RDMA.