

13th ANNUAL WORKSHOP 2017

# **Asynchronous Peer-to-Peer Device Communication**

Feras Daoud, Leon Romanovsky

PeerDirect ASYNC

[ 28 March, 2017 ]



#### Agenda

### Peer-to-Peer communication

## PeerDirect technology

## PeerDirect and PeerDirect Async

# Performance

# Upstream work



# **Peer-to-Peer Communication**

#### **Peer-to-Peer Communication**

"Direct data transfer between PCI-E devices without the need to use main memory as a temporary storage or use of the CPU for moving data."

#### Main advantages:

- Allow direct data transfer between devices
- Control the peers directly from other peer devices
- Accelerate transfers between different PCI-E devices
- Improve latency, system throughput, CPU utilization, energy usage
- Cut out the middleman





# **PeerDirect Technology**

#### Timeline



#### **Prior To GPUDirect**

- GPUs use driver-allocated pinned memory buffers for transfers
- RDMA driver use pinned buffers for zero-copy kernel-bypass communication
- It was impossible for RDMA drivers to pin memory allocated by the GPU
- Userspace needed to copy data between the GPU driver's system memory region and the RDMA memory region



### **GPUDirect/GPUDirect P2P**

- GPU and RDMA device share the same "pinned" buffers
- GPU copies the data to system memory
- RDMA device sends it from there
- Advantages
  - Eliminate the need to make a redundant copy in CUDA host memory
  - Eliminate CPU bandwidth and latency bottlenecks



### **GPUDirect RDMA/PeerDirect**

- CPU synchronizes between GPU tasks and data transfer
- HCA directly accesses GPU memory
- Advantages
  - Direct path for data exchange
  - Eliminate the need to make a redundant copy in host memory



#### **GPUDirect RDMA/PeerDirect CPU Utilization**





#### **GPUDirect Async/PeerDirect Async**

#### Control the HCA from the GPU

- Performance
  - Enable batching of multiple GPU and communication tasks
  - Reduce latency
- Reduce CPU utilization
  - Light weight CPU
  - Less power
- CPU prepares and queues compute and communication tasks on GPU
- GPU triggers communication on HCA
- HCA directly accesses GPU memory



#### **GPUDirect Async/PeerDirect Async**



#### **Peer-to-Peer Evolution**







- Allow ibv\_reg\_mr() to register peer memory
- Peer devices implement new kernel module io\_peer\_mem
- Register with RDMA subsystem ib\_register\_peer\_memory\_client()
- io\_peer\_mem implements the following callbacks :
  - acquire() detects whether a virtual memory range belongs to the peer
  - get\_pages() asks the peer for the physical memory addresses matching the memory region
  - dma\_map() requests the bus addresses for the memory region
  - Matching callbacks for release: dma\_unmap(), put\_pages() and release()

#### PeerDirect

Memory Region Registration



OpenFabrics Alliance Workshop 2017





#### PeerDirect Async How Does It Work?

- Allow peer devices to control the network card
  - latency reduction, batching of management operations
- Two new supported operations
  - Queue a set of send operations to be triggered by the GPU ibv\_exp\_peer\_commit\_qp()
  - Test for a "successful completion" ibv\_exp\_peer\_peek\_cq()
- Dedicated QPs and CQs for PeerDirect Sync
  - Avoid to interlock PeerDirect Sync and normal post\_send/poll\_cq
- Device agnostic
  - Currently, built to support NVIDIA's GPUs
  - Support other HW as well FPGAs; storage controllers

### **Transmit Operation**

Create a QP -> Mark it for PeerDirect Sync ->

Associate it with the peer

- 1. Post work requests using ibv\_post\_send()
  - Doorbell record is not updated
- Doorbell is not ringed
- 2. Use ibv\_exp\_peer\_commit\_qp() to get bytecode for committing all WQEs currently posted to the send work queue
- 3. Queue the translated bytecode operations on the peer after the operations that generate the data that will be sent



#### **Completion Handling**

Create a CQ ->

Mark it for PeerDirect Sync -> Associate it with the peer

- Use ibv\_exp\_peer\_peek\_cq() to get bytecode for peeking a CQ in a specific offset from the currently expected CQ entry
- 2. Queue the translated operations on the peer before the operations that use the received data
- 3. Synchronize the CPU with the peer to insure that all the operations has ended
- 4. Use ibv\_poll\_cq() to consume the completion entries







#### Performance mode



[\*] modified ud\_pingpong test: recv+GPU kernel+send on each side.

2 nodes: Ivy Bridge Xeon + K40 + Connect-IB + MLNX switch, 10000 iterations, message size: 128B, batch size: 20

#### Economy Mode



[\*] modified ud\_pingpong test, HW same as in previous slide





#### Peer-to-Peer – Upstream Proposals

### Peer-to-Peer DMA

Mapping DMA addresses of PCI device to IOVA of other device

# ZONE\_DEVICE

•Extend ZONE\_DEVICE functionality to memory not cached by CPU

# RDMA extension to DMA-BUF

Allow memory region create from DMA-BUF file handle
 IOPMEM

• A block device for PCI-E memory

# Heterogeneous Memory Management (HMM)

Common address space will allow migration of memory between devices



13<sup>th</sup> ANNUAL WORKSHOP 2017

# **THANK YOU** Feras Daoud, Leon Romanovsky







OpenFabrics Alliance Workshop 2017

#### Bytecode

};

```
struct peer_op_wr {
    struct peer_op_wr *next; enum ibv_exp_peer_op type;
    union {
        struct { uint64_t fence_flags; } fence;
        struct {
            uint32_t data;
            uint64_t target_id;
            size_t offset;
        } dword_va;
        struct {
            uint64_t data;
            uint64_t target_id;
            size_t offset;
        } qword_va;
        struct {
            void *src; uint64_t target_id;
            size_t offset; size_t len;
        } copy_op;
    } wr;
    uint32_t comp_mask;
};
```

| enum ibv_exp_peer_op {                     |           |
|--------------------------------------------|-----------|
| IBV_EXP_PEER_OP_FENCE                      | = 0,      |
|                                            |           |
| IBV_EXP_PEER_OP_STORE_DWORD                | = 1,      |
| IBV_EXP_PEER_OP_STORE_QWORD                | = 2,      |
| IBV_EXP_PEER_OP_COPY_BLOCK                 | = 3,      |
|                                            |           |
| IBV_EXP_PEER_OP_POLL_AND_DWORD             | = 12,     |
| <pre>IBV_EXP_PEER_OP_POLL_NOR_DWORD</pre>  | = 13,     |
| };                                         |           |
|                                            |           |
| enum ibv_exp_peer_fence {                  |           |
| IBV_EXP_PEER_FENCE_OP_READ = (             | (1 << 0), |
| <pre>IBV_EXP_PEER_FENCE_OP_WRITE = (</pre> | (1 << 1), |
| <pre>IBV_EXP_PEER_FENCE_SCOPE_CPU =</pre>  | (1 << 2), |
| IBV_EXP_PEER_FENCE_SCOPE_HCA =             | (1 << 3), |
| <pre>IBV_EXP_PEER_FENCE_MEM_SYS = (</pre>  | (1 << 4), |

 $IBV\_EXP\_PEER\_FENCE\_MEM\_PEER = (1 << 5),$