 I'll introduce myself as well. I'm Philip Papps. I'm a FreeBSD developer. I'm also a director of the FreeBSD Foundation. The FreeBSD Foundation has a booth downstairs. If you are interested in FreeBSD stickers or FreeBSD journal, you should definitely turn up and ask for one or take one if I'm not there. So this presentation is about modern network servers. And with network servers, I mean things that you would run inside your network, providing network services as opposed to application services. So I'm not necessarily talking about web servers, but things like authentication services or this particular presentation goes into DNS services and how you would run those on ARM64 rather than running them on AMD64 and why you would want to do this. So I originally wrote this presentation for network operator groups. So I picked DNS as an example. And I got some numbers from the DK Hostmaster when I say I of course is I stole these numbers from someone else who got the numbers from DK Hostmaster. DK Hostmaster runs the .DK CCTLD and Denmark is a smallest European country. The CCTLD zone gets a reasonable amount of traffic and I think it's a representative of exactly the top level domain DNS sort of setup. So as of December 2016, so about a year and a bit ago, nearly a year and a half ago, the traffic on the .DK DNS servers was about a thousand queries per second. So compared to say the Roots or COM or other larger DNS domains, this is pretty much a rounding error. There's about 1.3 million .DK domains in the DK zone. 11,000 of them are signed, which is actually a comparatively lower than the sad state of DNSSEC deployment in the world. And the main effects on this presentation as I'll show later is that the DNSSEC signatures do increase the zone file quite a bit, which I'll talk a bit later about what that affects. The zone file size is about 190 megabytes in total, which 1.3 million domain names. Most domain names will just have one or two NS records and maybe a glue record, but still the zone file is 190 megabytes text. That's because all of these 11,000 zones have DS records as well, and that sort of blows up the zone file size, but not a big problem. So depending on the name server, the zone file, so this 190 megabyte of text expands to about a gigabyte of space in memory. Yes. Do you have a ratio of how much of that 190 megabytes zone size was DNSSEC records or new records? Yes, but do I have it on the slides? No. But I would guess off the top of my head I would say about 30% of those 190 megabytes is probably the hashes, because you have two hashes and they're fairly large lengths. So I'd say about 30%, but don't shoot me if that's wrong. I can calculate the numbers later on. I would say 30%. But depending on the name server, it expands to about a gigabyte in RAM, so when you load the zone file, the DNS server is going to put it in a format that makes sense to the DNS server rather than to the author of the zone file. So about a gigabyte in RAM, which again is not a very large number, but it's something to take into account. The CPU load of these DNS servers is approximately zero. So when the zone is loaded to load this 190 megabytes to expand it and do all the funny things with it, it gets a little bit of load, but at runtime the CPU load is approximately zero. The CPUs are not doing much work. And these DNS servers have about 10 megabits per second of continuous traffic. So not a huge amount of traffic, right? If you have a web server or any application service, 10 megabits of traffic is not a lot. It's just a surrounding error. So what do the DK people run? They have four unicast servers. I think three of them are inside Denmark and one of them is somewhere else. They have seven unicast clouds. I don't want to talk about any cast too much, but DNS is a UDP service. So it's very easy to any cast that they have seven clouds and in total about 120 nodes. So across this entire infrastructure, which runs completely off the shelf Intel servers with 3BSD, they see resource utilization of about two to five percent. So that's all resources like memory and CPUs over this entire infrastructure is about two to five percent. So what these machines are doing is mostly heating up the buildings they're standing in. They're not really doing a lot of work. So let's talk about network servers. What do they need to do? They are using a mix of bind and SD and I think also not. Actually, I know for a fact that they also use not. So the DK zone is only authoritative, so they don't run any unbound. I don't think they run powered DNS. I don't know. I could ask, but I don't know. But not in SD. So when it comes to network servers, the resources they care about in order is CPUs. So especially in the case of DNS, you have a lot of really trivial operations running in parallel. There's no long running compute tasks. There's nothing to encrypt. There's nothing to decrypt. Nothing to calculate. Here's a record, just stick it in a buffer, send it out on a wire. The CPU just needs to do lots and lots of things and do it all at the same time. Memory and bus bandwidth is important. You don't want to fetch these things from the disk because that takes time. All of this information somehow needs to get into the network as quickly as possible. So the bus bandwidth and the memory bandwidth are important. The network is obviously important, but again, the DNS server, 10 megabits traffic, the network is not hugely important. And then finally, the least important of everything at all is disk IO. Under ideal circumstances, the DNS server will hit the disk exactly once when it boots and then it just doesn't hit the disk again. Disk IO is just not interesting for network servers. CPUs are important. Bus bandwidth and memory is important. Network is important and the disk is just very unimportant. So historically, there have been a couple of the power and high bandwidth CPUs. Sun tried the T1 processor, which died in a fire. It just never got off the ground. Cavium has a moderately successful MIPS implementation. The Oction, it's used by companies like Ubiquiti on routers, mostly residential sorts of setups, but they have some larger machines as well. Then there have been things like Tilera, Quanta. They've now eventually ended up with Melanox who make 10 gigabits Ethernet adapters, but their CPU just descended into irrelevance after a while. Intel tried the Atom CPU. Yeah, that was a nice try. And finally, the ARM, 32-bit ARM-RV7 is very popular. It's in every phone or, well, every phone as of a couple of years ago. But the architecture I want to talk about is ARM64, which is a new or newish attempt at a server-grade architecture. So ARM64 is widely known as the embedded architecture which lives in your phone, in your tablet. It's just, you know, it's the CPU in your pocket, but it doesn't have to live in your pocket. It can also live on servers. And there are a couple of nice things about ARM64 on servers, which is the same instruction set architecture as the ARM64 in your phone, but the way it's laid out on a board and its companionships is a bit different on servers. So the ARMV8 is the same as your phone. It's 64-bit instruction sets, but that's the same. It's also the server implementations of ARM64, which are very similar to Intel's AES native instructions, which you find on your Intel CPUs, and on FreeBSD you get pretty similar performance on them as well. On ARM64 servers, you find the same standardized on-core peripherals like you would find on an Intel CPU. So you'll have things like MSI, message-fast interrupts, you'll have a timer interrupt, you'll have an IOMMU. So all those things which are familiar from the Intel world also exist on the AMC64 world. And for one reason or another, the ARM64 server world also went for ACPI and UEFI on the firmware level so that all of this is familiar to platform builders on Intel. All of this knowledge can be reused on ARM64. So if you have an ARM64 server board, it will also have ACPI, which can tell you about the hardware configuration and bus topologies and things. And you'll have the UEFI firmware, which allows you to bring up your operating system and loads firmware blocks into various peripherals. And finally, all of the standard peripheral buses which are familiar from Intel also just work on ARM64, so if you have a PCI Express network art that works on your Intel server, you plug it into a PCI Express slot on your ARM64 server, it will work the same. So all of these buses are just there. I found a couple of commercial off-the-shelf ARM64 server boards. There's a Cavium Thunder X. The freebie of the testing cluster has a couple of these sitting there. The basic implementation has 48 cores and you can have 96 cores. That's a little bit more cores than you would find in a traditional Intel server. It's got 16 PCI Express with three lanes, so it's got pretty good bandwidth to the external world. And you can get it with 40 or 10 gig Ethernet cards. And a board like this will set you back about $3,000. So if you are running a CCTLD or you're running any other network service, about $3,000 for a server is pretty much what you expect to be. Depending on how many servers you have or how big your domain is, anywhere between $3,000 and $10,000 is what you would expect to pay for a server. So if I were operating a CCTLD, I would go out and buy one of these. No problem. A competing product though is an AMD design which has a Cortex-A57. It's also an ARM64 CPU. It's a lot cheaper, so for $600 you can get a four-core device with eight gigabytes of memory, some Ethernet, some USB stuff. So the price goes a lot lower. And between these two, there's a whole spectrum of things. So on the high end, you have the 96-core monsters with all the memory in the world. And on the lower end, you've got four cores with a little bit of memory. So let's look at operating system support for a moment. So on previous, the ARM64 is fully supported platform. So you have binary updates like FreeB as the updates and package are completely supported just as they are on Intel. They're also fully supported by the security officer and the release engineering team. So whenever bugs happen, they get fixed pretty quickly. And all of the 20,000 third-party packages that work on Intel will usually just work on AMD ARM64, I keep saying that. ARM64 as well. So you can just replace your AMD64 machine with an ARM64 machine without any problem. And in particular, so for this presentation, I looked at DNS. So the ARM64 platform packages are completely supported by Binds and is the not-powered DNS. So you can just go and package install nsd on an ARM64 server and it will just work out of the box without any sort of difficulty. So ransom performance comparisons on a completely fictional workload. So I took an Intel Xeon, I think this is the first generation it was, but I have 10 cores with hyper threading, so 20 cores in total, 128 gigabytes of RAM and an SSD which was moderately fast. And I ran this against the ThunderX with one socket of 48 cores, same amount of memory and spinning rust. So similar sort of priced hardware and run an LLVM Clang build. So Clang is a terrible, terrible workload to throw at something. And the ThunderX spent forever on it. So the ThunderX took 32 minutes of wall time or 20 hours of CPU time building LLVM. So the ARM64, you have lots and lots of cores, but they're not very fast. The Intel CPU managed to get the same workload done in 10 minutes and one hour of CPU time. So it's 20 times faster, you would say, workload, right? So this is, DNS servers spend very little of their time compiling. So the LLVM workload is lots of continuous disk IO. As I said at the beginning of my presentation, under ideal circumstances, the DNS server is just never going to hit the disk. It's going to boot from the disk. It's going to load the zone file and then it will never touch the disk again. Even as the zone updates, new records will come in. But under ideal circumstances in the steady state, the disk is just never going to be hit. Most of the LLVM time is just loading the millions of C++ files from the disk and compiling them and writing them back to the disk and then loading them back again. The LLVM build has a gigantic memory footprint and a lot of churn. So things coming in and out of memory all the time in different shapes and forms, especially the linker. And the CPU load could only be slightly parallelized to the compiler. You're compiling lots and lots of little C++ files, but you're not really exercising a lot of parallelism because you're always going to be hitting the disk. So your opportunities for parallelizing your load are going to be very much limited by how quickly you can fetch things from the disk. And if the disk is slow, you're just going nowhere. DNS on the other hand has a tiny bit of disk IO when loading the zone file and after that nothing at all. Your entire working dataset will fit comfortably in memory. So this machine has 128 gigabytes of memory. The zone file is 1 gigabyte of memory. I can fit 128 deadmarks into memory. Call it 127 gigabytes of deadmark or 127 deadmarks with overhead of operating system. But still it fits very comfortably in memory. And most importantly, it is very easy to parallelize DNS queries. Yes? I noticed that the R64 server has a rocket disk only while the Z-con server has... Right. You mean the server is putting the load into a database or something? Sure. But it would not change a lot. It would probably turn it from 20 hours into 18 hours because the contention is mostly not using all the cores because the way Calangia is constructed is it's lots of little C++ files that are linked into smaller object files and then all of these are linked together. And then finally you get the last stage, the linking stage, which is where it spends most of its time. The linker runs on a couple of cores but the linker runs on very few cores because there's very little for the cores to do. It needs to resolve some symbols but then it gets stuck waiting for something else. So you can't really... you can parallelize the compilation but the linking it'll just stop there. So in that benchmark I did which was very unscientific the spinning rusts definitely contributed to the slowness but the major portion of the slowness was just not being able to parallelize the workload because you're just sitting there and not have things to do in parallel. Whereas at the DNS server all it has to do is fetch a record, stick it in a network buffer and send it out and not even that, it just has to copy the answer into a record that's already there. So all of this can be trivially done on many, many CPUs at the same time. So that's on the load side of things but let's look at more. So Intel servers run very, very warm so the Haswell which is the last one I looked at in any sort of detail specifies 135 watts thermal dissipation idle. So just sitting in your... you just put it in your rack and it's warming up the room with 135 watts doing nothing. Under loads there's 250 watts being produced by the CPU. That's a lot of power and that's one CPU at the same time you've also got the platform, you've got the spinning rust, you've got everything else sitting there but 250 watts just for your CPU warming up the room. In some countries cooling a data center is a matter of opening the window in summer and opening two windows in winter. In places like Singapore cooling your data center is a little bit more involved than that. So if you've got a 250 watts just sitting there or 130 watts just sitting there doing nothing that's quite a lot of power to dissipate. If you look at the ARM64 on the other hand so this is the Thunder X, it's 120 watts so that's already quite a bit less and under loads it's 50 watts less dissipation and this is a massively over specced machine for DNS. You can run probably on a machine with 20 cores you can run the DNS as effectively. So this is just the machine I had available it's 200 watts under load, it's already 50 watts less with four cores that's not even opening a window that's not even opening the extra window in winter. So several vendors have much lower power CPUs you'll still need cooling than you do with Intel servers. So that's worth considering so on one hand you've got your server you're spending money on hardware that's actually doing something on the other hand you're saving money on cooling and then well I had a slight digression on security which I'm not going to go into in this talk I added that for another conference. So the conclusions to take away from this talk is that ARM64 is a very viable platform for network server workloads so depending on what your web server is doing it is probably not going to be very useful for a web server because web servers do quite a bit of non-trivial calculations building pages and sending them out but if your network service has a fairly trivial workload which is easy to parallelize ARM64 is definitely a better architecture than Intel so you'll have a machine which is still going to be expensive but it's actually going to work for its money and you're going to save a bit of money on cooling it so definitely consider ARM64 for your next DNS server also FreeBSD is very well supported on ARM64 so you get all the goodness of FreeBSD on an architecture which works really well for you and that is the end of this presentation does anyone have any questions? I can show it, yes so when I gave this presentation at Apricot earlier this year about is ARM64 easier to secure than Intel and we have 10 minutes of lightning talk disguised as a question as these things are known in that community about what this meltdown and spectra vulnerability are so it's probably not a huge concern on the DNS server so yes ARM64 is also vulnerable to speculative execution vulnerabilities so ARM64 is also an out of order architecture ARM64 is equally vulnerable as Intel to stuff ending up in the cache for reasons nobody understands from privileged instructions being speculatively executed in underprivileged mode yes they're vulnerable also they've been mitigated in the same way but on a DNS server it's really a least concern so this is a concern if you've got something like a web server with mutually untrusting tenants running on the same machine with a DNS server you're usually going to have one process running and that process is just serving public information so I didn't want that digression here but you know we've had it anyway any other questions yes the other way around DNS over HTTP so DNS over HTTP has been a thing for a little bit longer than recently I've not looked at it too closely well I've looked at I'm on the DNSOP IETF list I've not looked very closely at what it is they do but it should not be a huge problem because as far as I understand it they are just putting DNS packets into HTTP packets so they're probably almost as easy if not as easy to parallelize as DNS packets in their native formats either UDP or TCP the one thing which is probably different in DNS over HTTP versus DNS over UDP or DNS over DNS let's say is that DNS over DNS can be very easily optimized because the response format is exactly the same as the query format so the DNS server can just keep the buffer in the last level cache and just add the response and send it right back out whereas with HTTP even if I think they run HTTP over UDP as well but on TCP at least you have a lot more buffers in the way so you can't just go to the soft set write your reply and then send it out again because you have a lot more housekeeping to keep up on