 Okay, so we are here to talk about a hundred gig Networking this is some new cool stuff that we've been investigating We both work for a financial company in Chicago and we have a operated lab Where we test a new a skill that comes out and we also interact with the vendors and make sure that the new products satisfy the requirements of our companies when we want to deploy this stuff and so Fernando is working hands-on with the stuff and he has a practical experience how these things work and I'm more the Old Colonel developer does the architectural stuff. So I'm more hands-off But I can tell you the story how this all this goes comes together Fernando is going to give you the details on measurements and things like that and If you have a question Just say something I I like interaction with a dial with the audience I really don't like to give monologues up here and I'll try to look into your faces and I see your board I will speed up if you're interested and make some kind of a movement. I will explore the area more, okay? So Why are we doing this? The Capacity and the speed requirements keep on increasing For data links. This is true for especially for a wide area providers like telecommunication companies and stuff like that most of these links today are 10 gig and So for one fiber you have 10 gig going across that with a hundred gig you can moderate 10 times that amount of data on the same link And so the same infrastructure would support a capacity of 10 times more and right now we have a large amount of work going on in the telco to Change the basic speeds from 10 gig to 100 gig because with the more the higher efficiencies The other thing is that 100 gig now is possible because at this point Most of the hardware that we have today the processors Broadwell Intel Skylake processor get to the point where they can actually sustain a hundred gig Data reception to memory the earlier processors and even the broadband sometimes struggles with these speeds and so If you have a memory subsystem that can't take the data in from the network Then you probably shouldn't be using this, but we're getting to that point now and it's very interesting because now we have a basically the fact that the Network speed follows the saturation speed for the memory subsystem sort of and there's a kind of a correlation now going on there um Then we have also development now of a huge amount of machine learning Artificial intelligence and algorithms that learn on their own extract new information from data and this is called machine learning and The machine learning is is better the fast you can go through iterations and the more data you can process and For these things for clusters of compute with a hundred gig you can bring it much much closer You can have much more data that can be exchanged between these algorithms and therefore your machine learning is much better if you Have a higher speed interconnect and so high-end deployments are going on this year now a lot of the new research clusters that are built in the United States are on a hundred peak in one form or another and They have also the corresponding processors and there's some pioneering projects going on from various companies that also compete with one another Then there's also the US Department of Education They are funding new developments in the computer industry And there's this vision of the excess scale computer for 2020 that has been around of all more than a decade And so the intent is to build a much more powerful super computer that can do one extra flop Computation, I think we are now at a hundred or 200 petaflops So we're gradually going to that and in order to get there We need to have many more computational elements that are much closer together a community get very high speeds And this is one of the key components that will enable this excess scale vision of the tip of enough education And therefore there's also funding available from the government And this is it will be looked upon as a strategic investment by the government in the area of supercomputing and If you look at the side here, there's another diagram Where you can see which standards for isn't it? Deployed in development. So we had the 1980s where we had the 10 megabit and actually most of the internet center was designed for 10 megabit Then we moved to a hundred around 2000 we get into the gigabit gigabyte area 10 gigabit ethernet is I think why they deployed these days came about about a decade ago and Then we have a hundred gigabit elements now provided by by the van providers and there's the Expectations that we're gonna end up with four hundred gigabits per second by the end of the decade They are fractional elements here of 50 gigabit and 25 gigabit That also be gonna be important for the development of the highest speeds so What kind of technologies are available for a hundred gig networking so The questions how much can you put on a on a strand of a fiber strand and things like that So there's an old standard called the CFP standard which requires ten links of ten gig each That's actually working with the existing old infrastructure of the telcos So just give you ten strands they can put a hundred gig on there and that works that has been out there for more than a decade now So that has been a use in my news for a while now for specialized long-distance Applications requires huge converters at each end and it's a lot of effort is very expensive The new standard is doing four strands of 28 gig links Which is called QSFP 28 This is comparable to the QSFP standard QSFP business 10 gigabit is the 40 gig standard And this is the evolution on that one. This requires higher modulation on the wires And that gives you the increase in efficiency that the long-distance provider seek for and also gives us the ability to close a Network the components of a cluster together So and this brings down the prices of the stuff from this huge Gizmos down to the small transceiver that you know from the 40 gig standard on the 10 gig standards and that Will fit into each server because there's a huge hundred ten times Ten transceiver really can't be built into a host. It's mostly a termination equipment so this is actually making it feasible making it cost-effective and Making it possible to deploy for cluster Then there are various standards to use for 100 gig networking Usually what comes first in networking at a speed is infinity band or maybe two to three years ahead of the curve This is standard the standard that was pushed by the melanocs corporation and They have Switches that allow transitioning from lower and finny when speeds to higher and then when speeds in their switches also exist And it seems that this is the most mature technology That's in existence today and all the stuff is available and most of our testing focuses on the 100 gig solutions from melanocs for infinity band Then ethernet Well, there's some early deployment of server-to-serial communication in 2015 But the deployment was focused there on just using the Linux network stack which gives you results and some kind of limitation on the speed The company that deployed it was very much against using offload capabilities and so and but it wasn't one of the major companies And so a lot of effort went into just getting the basic project stack right and doing a hundred Ethernet from the network stack The switch that was used for that purpose had some issues and the early switches were recalled and Reissued in March and hopefully they're fixed now There's some nicks under development, but They couldn't get any Nicks for a deep in time So they use the edr switches the edr nicks from melanocs because they can also be switched and into an ethernet mode So what the what came up with the large deployment there was basically melanocs nicks and ethernet mode with a specialized ethernet switch and So this actually has been resolved now and I think it's no more mature and probably now It would be the time if you want an ethernet fabric of 100 gig you can do this Then there's omni path that is inters solution of infinity band They're trying to challenge melanocs and take over the the area because they see it's very important for the future for this extra-scale vision of the government and for Machine learning and inter really wants to get into that area So they redesigned the serialization and made it compatible with infinity band if you have an opinion app You can run on omni path it supports more nodes because The infinity band limit is I think 16k or 10k nodes and the extra-scale vision calls for 100k to a million of nodes of compute and so and of course the most the extent of production readiness But we had some issues a lot of issues with our testing and we thought this was going to be more an alpha thing and Seems that it's going to be working That's that omni path solution and it's probably come up with a new version next year And we hope that one would be much better and it will be much more competitive with infinity band Then the form factors Here we have the the CFP factor 10 times 10 that's has been one for a long time Then that would require a lot of power and complexity to handle then the CFP to a kind of a shrink of that Then the CXP which is the another shrink of it And it almost turns you a former then there's a QSP fp28 Which is a new connector that you're used to also from 10. It looks the same way works the same way and With that actually we can upgrade or exist in things to a hundred gig from ten Then once the hundred gig was actually out people say okay It's four times 25 gig we don't really want a hundred gig to us to a server Most of servers are old that can't take a hundred gig anyways. So the idea was all right We want to have a high fan out and we have these switches with lots of ports So we just going to use splitter cables to split this thing from four times 25 into two times 50 or four times 25 So we have a huge fan out So there's that's resulted in the approach of 50 gig two 25 gig links or one 25 gig links for To increase the density of the sports So then thus the switches typically handle 32 links of 100 gig 64 of 50 and actually 128 links of 25 gig This is a massive amount of servers that you can bind together with just a single switch this has never been seen before and This makes it very easy to if you don't need to full speed you can use us to scale quite a nice way So but it was late and so the vendors got ahead with this before the standards were completed So this year that finally completed the standards. Hopefully I haven't heard of that yet one yet So the 50 is also in the works probably completed by 2020 hopefully and this may be the fall in the future because storage speed of memory increases and so it's a convenient compromise here and Then what happened then is okay? There's a QSFP 28 connector was four strands They dumped it down to a single strand and that then gave rise to the new connector standard transceiver SFP 28 Which is the same way looks in the same way as a SFP connectors just has more rate on the wire All of this allows you an easy upgrade path from the existing 10 gig deployments to 25 gig or 100 And we have some interesting cabling here The this adapter is basically a QSFP to SFP 28. You see cable goes into the switch The other one has four ends of 25 gig each So this is rise to a couple of these connectors a QSFP to QSFP and the then the octopus connector here So somewhat on the switches that are available For infinity band we have the an ADR based switch with 36 ports And this was actually ready in first quarter 2016 called the 7700 series Those cannot be split and Philly band doesn't have the capability. They bind these ports certainly together Then Broadcon came up with the Tomah chip and this actually is able to be split down to the 25 gig Segments and this was first released in fourth quarter 15 with issues and was unreleased and The Tomah chip is built into various switches by numerous vendors and sold of various form factors But you see I have the same underlying Switch technology from all these vendors and then there's the Minox Ethernet switch which can be split into the two two strands Because it's the 25 gig was not considered viable was also delivered in 2015 And there's continually firmware improvements with that as called the spectrum switch and then Intel has the Omnipath switch with 48 ports had to be higher than Minox and This was released in second quarter this year and scored the 100 series So this is the basic form factor. No, let's see what actually happens here Let's look back what we had before In the area where it had 10 megabits So there was about a hundred nanosecond per bits that you get there, right? And if you wanted to receive an empty size frame of 1500 bytes, it takes a 1.5 milliseconds So the time for a 64 byte packet is 64 microseconds and you can receive about 10k packets per second So 64 microseconds is a sufficient time to process a packet and send it back and do something with it. No problem That was how For what PCP was designed then we move into a hundred Mac and so now I just have 10 nanoseconds per bit 150 microseconds of a large pack 644 for small packets So we can receive 100k packets per second and so for 10 seconds We can maybe get two things For if you get small packets if you have small packets, we may not be able to process them fast enough because I think that's what 5 to 10 Macs to process a packet But that buffers out very well. So fundamentally, we didn't see an issue there Let me get to what you want gig you want the nanosecond per bit and now the time for a 64 byte packet is 640 nanoseconds So you can receive a million packets per second and now in the 10 microsecond frame You can receive 20 packets if they are small. So at that point the Kernel and the network stack is not able to process this anymore if you send a large amount of small packets to looks You will cause at some point an overrun and the system is not able to handle it or we get to the high latency mode So that's where the problem start and usually with this one gig we can buff all that This is only true for various more packets So hope let's hope that the most of the packets are higher size maybe 1500 bytes and there We still have 15 microseconds to process a packet. That's good enough if there are small packs in between We can buffer that and when it works out The application can still handle that when you're going to 10 gig now we hit the situation gets really worth worse With 10 million packets per second for this they're very small. This is a huge problem We can't really generate so many interrupts. We have various techniques to avoid interrupts for at that point If you get large packets in 10 microseconds We get six packets of largest frame size. You can't process this anymore Therefore in the 10 gig time frame, we've added a lot of Just a little hooks to the kernel to distribute the packet of multiple cores To do buffering make them bigger when they actually get to user space So all sorts of fancy things are going on to remedy the situation and still be able to receive the full benefit and make use of the full bandwidth of the 10 gig link So now we're going to 100 gig and the situation gets absolute catastrophic here so And 50 on about a packet takes 150 nanoseconds There's no time for you to process that you can get 60 of those of the maximum packets when within the 10 microsecond window You will never be able to process this stuff at full speed so This means that the existing mechanism and to compensate for this that we introduce in the 10 gig time frame must even become All sophisticated. We might as well find other ways to process this data. So Then we have flow steering at it So this is the system the Nick can figure out which network streams are coming in which port are they Destined and so that the packet does not enter the network stack because it is pre-selected by the Nick for a certain processor for a certain socket. So the Action of the host is taken out of the critical data path Ends up in user space directly via a different queue and that avoid Serialization and causes us to be able to support concurrent rates that finally aggregate to the full Bandit of a 10 gig link but note But no individual process or level core is able to receive the full bandwidth now because they're still enough current processing Even though it's not serialized anymore. That is probably on the range of three to five microseconds So you can't really receive full bandwidth So a lot of applications in the 10 gig time 10 gig area Trying to split the data streams in order to Receive the stuff in the correct way. So we now have a switch like logic on the chip on the Nick the Nick Routes the traffic to a certain core in order to avoid serialization on a central resource and here's some time frames of What you need to do here may never access make memory access it takes a hundred nanoseconds Remember takes a hundred and fifty nanoseconds to a four fifteen hundred byte packet come in This is a bit bad So a random access to a memory location takes 64 bytes and you have 1500 bytes coming in So you can only do that when you stream to memory. Let me have the Hanoke Knicks that we actually worked with the connectix for adapter It supports an idea on the finish better Hanoke Ethernet and has very sophisticated offloads They have tried to do what they could to improve on the existing flow steering scene The Nick actually has a flow steering macro language where you can program the chip to do various funky things with the packet So that you don't hit the kernel for serialization So this also is multi-host so the interesting thing now is to see if you have a dual socket processor Since we have the problem getting into one of the sockets this we have changed the PCI So that's that some pins go to one so some to the other if you put the Nick into a machine It can distribute the load over the two sockets and can basically service to Processor ships at once That is just a common business arrangement. There's also business server arrangement. There's also actually has ability to do four processors at the same time so you can dish out various PCI pins put maybe say two zeons there and maybe two arms or a GPU and it can do Can serve all of these and do hardware routing to the each processor so the The Nick becomes a very sophisticated switch into the that forwards data into the memory of various processors and attached Things this is yeah, because you are easily overrunning the stuff then we have the inter omni path adapter This is focused on MPI it does omni path only and Here we have a redesigned of the infinite event fabric to remedy all the deficiencies that we have seen in the last few decades On that one as I'm called a fabric adapter. It supports more nose and that are transfer sizes and infinity band but Yeah, it's it is its own protocol only so You entering a you must enter the omni path world if you want this whereas with the connectix for adapter you can Play around with ethernet and infinite band and do various other things So how do you actually work with this stuff? There's a socket API, but you usually use to send message receive message and stuff There is a huge amount of applications that run using this API and The largest development know how to use the programming language But yeah with the socket API you are relying on kernel calls and that limits the Throughput and it's a block-level file. I owe. This is another POSIX API There's stuff like NFS which can do NFS over rdm a there you can basically say to the colonel Okay, I'm mapping this area and NFS of a rdm a does direct transfers from from another host into the memory of a system and does not buffer in the colonel usually socket APIs Assume that you receive buffers in kernel memory and then you copy the buffers into user space memory and that is An additional latency that does not that makes it impossible for you to actually process this data while it's being received So this is kind of an immediate step and then we have the Rdm API. This is a one-sided transfers You say okay, I have this memory area available and tell them like okay you can transfer at liberty into this area and without putting it into the kernel first and this allows you to receive large amounts of data directly at full bandwidth into memory and the higher the speed goes the more you actually have to rely on this these Rdm APIs to do direct Transfers and you basically talk directly to the hardware you can avoid system calls with this stuff and Avoid the overhead there and that actually allows you to to take very high volume data streams into memory Then a new API has been developed called OFI mostly by the Intel engineers They want to have a nicer thing going on than the Rdm API which is said to be very difficult to program And so the OFI is an API to interact with the fabric more in an optic like fashion It's pretty new and you're not sure whether it's going yet and then the last solution is Maybe we just don't want to deal with this stuff at all in abstract way We want to direct a program the the chip and the registers and with DPDK you can get the Registers of various in the exterior disposal user space you can manipulate those things and then you can do whatever you want That is for the hardcore who want to mess at the lowest level The Rdm API still has an abstraction layer you Is this a defined way with how we interact with any hardware that does support or these transfers But the DPDK is basically you found that register level and you can tell a program and have full freedom and some people see This as the ultimate Solution here So if you try to use a socket API with a hundred gig if you have a fast talker It's easy to overrun the buffers. You can in a very short time frame transfer a few hundred Megabytes or a gigabyte into memory and this will have around any reasonable configuration of buffering that you have in the kernel Then you can use the flow steering of the socket API to scale and have multiple processors serve the same nick bypassing the centralized Q there We also can generate per processor queues for sending that on this ascending action also does not require a centralized submission queue So every queue sends their into the nick which increases the sending speed there Then we can export offloads to send receive large amounts of data So kind of we can give it a 64k both buffer The nick will hack this into pieces 1500 bytes each to make compliant with these net speeds So there's numerous tricks that you that are there And you may have to be aware of and they have to be to be configured correctly Then if you use a protocol with congestion control you want to be able to use a full bandwidth because The TCP instead of it's really not developed for a hundred gig. So what you in practice you see is bursty behavior that causes long latencies and Then Yeah, you're going to fight with the congestion protocol until you have something that actually works well So a lot of people just use UDP or forget about it congestion completely the Also, if you want to have so good socket API's you can't use this on on the fabric API's of Intel or on and finny band There is an implementation that emulates ethernet on top of and finny band But it has various non ethernet like semantics and this full of surprises If you want to try that it most of the stuff will work But then you have to probably chase down some details there here and there So the infinity band API Rdmi API is based on you have a small network and you have their buffers registered with a nick and The OS is on on the side It facilitates the permissions to the memory it sets up the connection and stuff and then one app can Say, okay, this memory range is now transferred to the remote host and it will ultimately do that directly without OS intervention memory to memory and that will allow you to override the overhead of the Kernel stack and go direct from memory to memory and that actually allows you to use the full capacity of What your memory subsystem and what the network can offer you and there's also This is actually natively infinity band, but later there was a variety a variation of this developed on ethernet called rocky This allows you to do the same thing with with ethernet So and that's a rocky original was not routable and rocky version 2 is a routable and So people actually have built Clusters using ethernet because I didn't want to use Something like infinity band that makes it easier to a top rubber shoe the existing tools work and you can easily route static water well inside of the cluster the problem there is that this is not that mature and They are It's just to have some some unusual things going on there. You need specialized gateways For some of these issues because of rocky and also the infinity band API or the army API Assumes that you have a lost as fabric So they have been standards added to the ethernet center called the data center extensions And when the data center extension you can actually make the internet network as also Lossless and only if you do that when rocky actually work So this is a bit of a bit deceptive You may have actually much more effort to get your internet fabric to work then you think it's not as easy as using simply the Ethernet protocol on a rocky Then there's ofe the new fabric interface. This is a recent project by the open fabrics Alliance to redesign the API It's based on rdma and driven by Intel And So it also allows you to have even more bypass of the kernel You can you can put actually the logic to steer the niggin all of that into a library Where you in an avoid having to upgrade the kernel when you get a new device driver? And there's some issues here that multicast is not there and which is something that kills all use for Use case here and then what kind of software actually runs this stuff EDR with metal nox connectix for It's available in Linux 4.3 redhead 7.2 fully supports that if we are menace connectix for this is for news 4.5 and 7.3 also supports that with rocky and everything There's also some stuff in 7.2, but it's so quality only you can't do too much with that on the path In only path you have all of three drivers in and they are also in Linux 4.4 staging And it's currently supported via an out-of-tree Kernel driver from Intel they're working on that to push it into redhead. I'm not quite sure where that is right now And but the intention is to be To work the same way as the EDS stuff from Melanox to be native in the distros And you want to talk about the tests? They told me to get real close to the microphone. So hopefully you guys can hear me So this is our test setup. We just had Again some omni path cards some connectix for cards and the omni path switch Which is like the version one omni path As well as the EDR switch from Melanox the SB 7700 For some reason we decided to put one gig up there if you're doing performance with one gig at least move to 10 gig as You can see the latency with one gig is terrible So yeah, as you can see Once you get to the higher packet sizes one gig is basically worthless but if you if you see the 100 gig ethernet and EDR basically follow the same line so you got about two Two microsecond latency throughout As you can see also omni path because this is their version one I Guess their version one in there in their their technology is somewhat higher latency than what you would expect from an infinite band fabric Because for financial data and trading and latency is essential to To this right and so the reason that we think it's worthless is because of the extremely high latency that we can get with one gig Like Ricky Bobby's dad said if you're not first you're last Here you can see you some bands with measurements It's 100 gig is hard to see but it basically follows the same line as EDR Which is the blue that is able to go to line rate at 12 gigs 12 gigs a second Everything Like basically EDR and a hundred gig are probably like the best solutions that you can get right now And it seems that speaking to vendors a hundred gig because it got delayed 200 gig it might never be a thing because the 400 gig basically timeline it's slowly coming coming through so Basically just focus on a hundred gig or 400 gig when it becomes available Again Let's see if we have we have some So most of the stuff that we do in houses multicast and this is are some of the multicast latency results that we got Again Omnipath is they told us Not to pay too much attention to these latency Results just because they're they will fix it at some point This is what Intel said. So hopefully they hopefully they do Here we have a comparison between RDMA the RDMA is the same test as we had before which is a IB Sanlat Compared to a test that's called UDP ping they use the sockets So as you can see the Yeah, the socket latencies are pretty bad even with this is the small payload test We'll see the big payload test in a second. And so this is a packet being sent just and then wait a while I just see how much is the overhead of processing and The yeah, the current has a very high overhead there and fast processing of packets means that you have to cut the latency down as much as possible and It probably doesn't matter too much But these are the Ethernet tests are cross-connected so there's no solution between so there might be a little bit of difference What once you plug in like an a wrist or a Cisco or something and something like that? And this is the big big packet test Again, also, I don't know why one gig is in there, but just for fun, I guess Most of people have one gig network. So I think this comparison is pretty important All right That's it for the testing Okay, so here's some references to more material the presentation is also on the website So if you want to find these links and read more about this, you can find it here And I have some stuff here what I would like to do in the future with this stuff I'm trying to organize an RDMA conference and workshop and then summit right now for the plumbers conference and Colonel summit and So we're gonna have a full day talking about the issues that we wonder do here on this level So what I want to do is basically I would like to have a full integration of the RDMA and direct messaging I'm gonna spend internet network stack, which has been a contentious issue for the last decade or so We really need more easier to use APIs to do this stuff We need a full integration with a network tool so that we can transparently use these high-speed adapters and Currently this is more like a side car of the corner, right? So this is you are in the network stack or you are in the RDMA substance in the other may have different tools You have different techniques, but fundamentally messaging is very similar and almost the same between both So you would like to have this operate was Like a regular network device The problem is this removes the kernel from the data path and this has one of the issues with the general network stack developers, but again already the the flow steering mechanism already takes a nick out of the Takes it the operation order the path to some extent and this is just going further than that So I hope we can have some fruitful discussions at the plumber's conference To figure out how to get make this easier to use because I think this is there's no way to avoid this in the future so actually what I've heard from the venus is they intend to Start a war norm who has the fastest network and every two years now We're going to double the speed of the networking and they have a idea in the middle 20 after 20 20s We're gonna end up with one terabit and a to a server So if you want to keep up with that and some fashion fashion, we need to find better ways to deal with this And so since we can increase the speed of the single core anymore in a significant degree We need to have some other ways to get directly Accessed into a memory or something So with that If you have some more time for any any questions, otherwise, I have some more material Yes, the packet is good serialized of all four links So if you if the switch supports it and you can and it can do this over just two links You get 50 on which is one you get this one, but it's a different modulation then Well, there's a de skew de skew logic in the transceivers and stuff. So they've been working on this for Okay, the question is what is the best practice of sending the empty you for a hundred gigabit Well, this depends on what you want to do with it If you want to transition to a regular ethernet fabrics that have a 1500 bytes, which is typical then you can't do nothing You just stuck with a 1500 byte approach. Otherwise, of course, it's advantageous to have jumbo frames main 9k frames But fast fast, I know all Diplomars that I've seen are at a 1.5k for the empty size Yes Well, the reason that that 9k is not used that much is because you now have a lot of stateless offload in the kernel This means you can give a 64k frame to the kernel and the nick will notice Okay, this is 64k. I have to hack this into pieces and 10 1500 bytes a piece And so if there is gear at the other end that can recognize the situation you can reassemble the 64k frame Otherwise, it's just regular gear that will forward the 1500 byte frames. So and I need to net this is not that critical anymore on On each net typically it's 1500 bytes or 9k. I think a 9 meg a 9k 9k No on an infinity band you have 64k so you can get a good greater to extremely large frames on that level, but As far as I can see it They are all feathers feathers 1500 bytes and the stateless offload and even 9 9k is not used that much So I would be skeptical if I if you want to try it in that you may be running two issues there The 1500 bytes is pretty standard and especially if you want to reach the internet and stuff You cannot avoid that the most application at some point hit the net near Xdb. Oh, I haven't I haven't looked at that indeed yet. Sorry Hey, he was asking if I could comment on xdp, which is a new technique To kind of avoid the doorbell issues and and stuff and I haven't looked at that It's also didn't come up and any of the synonyms that we had and we try to run Yes, I'll run off the mill as possible without getting too much booked down into details of Colonel stuff Speed. Yeah Well, I know that there's plans on the book to double this and triple this and four quadrabits any further So they probably gonna go to 50 gig series and 100 gig series as soon as I can and that is Their hope that they would that I can double and quadruple the speed again And so we have two major companies Intel and Milanox waging this wall right now who gets there first And because they see this is a major issue for Connected clusters and for the machine learning and everybody wants machine learning At 100 gig it means what you want to do if you wanted to do large transfers into memory That's easy. You don't have to process the data Just transfer into memory and then you have some other slow process that takes the memory and does something with it While more memory is being filled up with something else It depends on how many cores you have on what these cores are doing and if they have actually have to touch the data And if you touch the data you get into this Slowness, but if you can just forward the pointer to the to the memory to some other external accelerator External unit then it's very easy if you why I want to what it's gonna be slow Yes, because the UDP means it's first going to stream into the 1.5 k buffer in the corner Which then segments are copied over to your to your user space and that is pretty slow and the memory Is actually the the boundary here Which is the next might Yeah, we actually can get into memory performance issues here So how do we can give you 12.5 gigabyte per second right our throughput and DDR memory And then space configuration is at 6.5 gigabit per second so high-end candy is to 17 gigabyte per second. So Di3 memory is not suitable for 100 gig networking DDR4 can't get you there you can dare instrument into that DDR4 memory range and you can actually sustain that So you gotta be careful what you buy if you buy DDR3 and put a hundred gig in there Well, there's gonna be the inability to actually deliver the stuff on to memory And now we also have the adapters which have dual hundred gig connectors So actually you can already do 200 gig Which is you can if you have two sockets you can have two processors and each has a memory Substance you may make sure that one string goes to one processor's memory and the other to the others There's all these these problems that may arise So actually and the question is also what happens if you saturate the memory substance isn't with a transfer that takes you a couple of seconds Can the system still do anything or just just fail over fall over we haven't tested that Maybe very interesting to see more questions See if I have anything else here Yeah, looking ahead here. I think you know 100 gigs maturing and it's gonna be gallery rolled out You have 200 gig available in 2017-2018 as promised by both vendors and They won't tell about links by 2022 Maybe they're just making promises and they're trying to up and do another and this is what they do But so the software really needs to make the mature we need OS network stack to handle the speeds We have to get proper API is deal with the memory throughput issues and you have a need to have a deeper integration of CPU and memory There's some thoughts that maybe we should just directly put the network connection on to the die and Have the processor directly in the L1 and L2 caches deal with us At some point when you can't reach memory anymore at this speeds this may make sense L1 L2 cache is faster So you may be able to operate at that level in a much faster way and then be selective of what you actually commit to memory But for that you need to have new processor architectures. I can handle this stuff Which brings me to the end? Okay, if anybody wants to get involved, please talk to me we're gonna again we have these meetings at the colonel summit and plumber's conference to talk about these issues and Trying to find ways to improve the situation here And we have also have an extra meeting outside of the colonel summit and Plumber's conference on the set following Saturday to deal with issues that we haven't been able to get through doing our day at the plumber's conference and center phase so We also have a mailing list the linux audio may be our mailing list on video colonel org if you're interested to see what's going on there So I'd love to have more people involved on this one Any more questions? Yes Yes It's not there. No, sorry that's why we are saying we are hoping for omni path version two which comes out next spring and The internet has this connector to the chip. Do you have the right idea? the question is when are we getting there and Problem is also I've tried to look into some more detail at intended the photonics conference at MIT a year ago And I've had a series of photon specialist say what interest says the absolute nonsense can never be done because the photonics process is different from the regular processor process you can never integrate the chip with the Main processor and so what I'm seeing now is there's two chips on the on the package one for the network hookup And one for the processor and they are linked now this PCI express So we still hoping for the best in the future but we observe in the situation and hopefully we'll have something useful at some point Okay, so there are not any no more questions and thank you all for coming Talk to me later if you want