Protocol

What is RDMA (Remote Direct Memory Access)?

RDMA is a network I/O technology that allows one computer to directly access another computer's memory without involving the host CPU or operating system, enabling ultra-low-latency data transfers.

Technical Overview

RDMA achieves its performance characteristics by implementing a kernel-bypass data path: the RDMA-capable NIC (RNIC) directly reads from or writes to application memory buffers without copying data through kernel buffers and without interrupting the host CPU. This eliminates the context switches, memory copies, and interrupt processing that dominate latency in conventional TCP socket I/O. The operating system is involved only in setting up the memory registrations and queue pairs (QPs) — steady-state data transfer is entirely offloaded to the RNIC.

Three main RDMA transports exist: InfiniBand (a purpose-built fabric), iWARP (RDMA over standard TCP/IP), and RoCE (RDMA over Converged Ethernet). InfiniBand provides the lowest latency (sub-1 µs in HPC environments) but requires a dedicated InfiniBand fabric. RoCE v2 operates over standard Layer-3 IP networks but requires Priority Flow Control (PFC) or Enhanced Transmission Selection (ETS) to prevent congestion-triggered packet drops that would force RDMA retransmissions. iWARP trades some latency for simpler network requirements, tolerating packet loss through TCP.

The RDMA programming model exposes verbs: RDMA Read (remote party reads from local memory), RDMA Write (remote party writes to local memory), and Send/Receive (two-sided messaging). NVMe-oF over RDMA uses RDMA Write for data transfers and Sends for NVMe command capsules, allowing the target to push data directly into the initiator's pre-registered buffers with zero CPU involvement on the data path.

How It Relates to NVMe/TCP

RDMA is the transport used by the NVMe/RDMA binding — the highest-performance but most operationally demanding member of the NVMe over Fabrics family. NVMe/TCP was designed explicitly as an alternative to NVMe/RDMA for environments that cannot justify the cost and operational complexity of RDMA-capable NICs and lossless Ethernet fabrics. In practice, NVMe/TCP running on modern CPUs with kernel 5.0+ drivers achieves latency within 1.5–2× of NVMe/RDMA — a gap that continues to narrow with software optimizations like DDIO, RSS-aware queue mapping, and io_uring integration.

Key Characteristics

  • CPU involvement: Zero on the steady-state data path (kernel bypass)
  • Latency: 1–20 µs depending on fabric type (IB < RoCE < iWARP)
  • Memory model: Requires memory registration (pinning) before use
  • Transport variants: InfiniBand, RoCE v1/v2, iWARP
  • Network requirement: Lossless fabric (PFC) for RoCE; standard TCP for iWARP
  • NVMe-oF binding: NVMe/RDMA (TP8000)

RDMA vs TCP for NVMe-oF

NVMe/RDMA achieves 10–20 µs latency and near-zero CPU overhead for data transfers, making it compelling for latency-sensitive HPC and financial workloads. However, it requires specialized RNICs (typically $500–$2,000 per port), lossless Ethernet configuration (PFC/DCQCN), and dedicated operational expertise. NVMe/TCP requires none of these — any standard NIC and any Ethernet switch will work — at the cost of higher CPU utilization and latency in the 25–40 µs range. For most enterprise and cloud-native workloads, the simplicity of NVMe/TCP outweighs the incremental latency advantage of RDMA.