Hi, I'm Yoshiro Yamabe.I'm working at NTT Software Innovation Center.And now I'm investigating about remote direct memory access,RDMA technologies.Today I will introduce about RDMA's overview and case studiesand several implementation techniques and my conclusion.Sorry.At first, I'll introduce about RDMA's features.RDMA is low latency and low CPU overhead,but hard to implement technologies.For example, in simple ping-pong microbenchmark,RDMA's latency is about one-fifth of IP overview's latency.However,RDMA program requires programmer to implementand understand several low-level mechanisms.So,in previous benchmark,RDMA's program is about 1,600 linesbut TCPIP program is only about 300 lines, about half.But,RDMA's merit is very attractive.So, I try to make use of RDMA potential.I apply RDMA to MXNetand open source distributed deep learning framework.MXNet adopts parameter-server architectureand this figure shows parameter-server architecture'sprocessing flow.At first, each worker calculates parameterand second, push parameter to parameter-server nodeand parameter-server aggregates themand last, each worker pulls parameter from parameter-server.There are two times of communication for each batch.So, I think RDMA is effective for this model.By applying RDMA,the data flow is changed as this figure.In existing implementation,there are four memory copies for each push or pull.User to user memory copyto user to user memory copyand two times of kernel to user copy.But, in RDMA implementation,the user to kernel memory copy could be reduced.So, there are only two memory copiesper each push or pull.Two memory copies are reduced.But, in ideal RDMA implementation,User to user copy should be reduced.RDMA can read parameter directly.But, it is very high cost.So, in this work, it is out of scope.It is my future work.And, in addition to RDMA's memory copy reduction,RDMA implementation techniquesis very important for achieving high performance.This table shows four representativeimplementation techniques in this workand the implementation status of each step.Step 0 is my first implementationand step 2 is my latest implementation.A first technique,RDMA operation meanswhich RDMA operation is to use.RDMA write or RDMA read.In this work,I used RDMA write only.So, all step using RDMA write.And, second technique isdetecting compression.The way to detect compression is two ways,pulling or interrupt.Pulling is high CP overheadbut low latencyand interrupt is high latencybut low CP overhead.And, third technique is ring buffer.It is the same as socket bufferin kernel layer.And, last technique isseparating detect compression threat.It is same asseparating receiving processand data processing data processinto user to kernel in TCPIP.I want to emphasizethis third and fourth techniqueis important.RDMA can bypass kernel layerso, RDMA can reducekernel to user memory copybut RDMA also bypassingkernel layer tuningfor TCPIP.So, programmers should beimplementseparate kernel layer tuningsinto user programs.It is very important, I think.And, result.Step 2is one and a halffaster than step implementation.I am very happy, butexisting implementation is alsofirst as same as step 2.But, there are still other tuningsso, RDMA can be faster.Next step through implementationis faster thanexisting implementation,I hope.And, conclusion.RDMA is efficient,but several techniquesis neededto achieve high performance.And, as a matter of course,it is needed to chooseappropriate applicationsfor RDMA.In this workloadand environment,network load may benot adequate.Tesla Volta 100is very fastand self-attain is right with.My conclusionto enjoy RDMA effecteasily,libraries providingRDMA design patternsare needed.And, trueZerocopy RDMAimplementationis neededfor higher performance.These twois my future works.Thank you forlistening my presentation.Thank you.