 Good morning, everyone Welcome to the final day of the containers and Kubernetes track at scale By now all of you should be experts in container cloud deployment, right? The yeah, yeah, that's a spirit. So yesterday you heard a bunch of stuff about security and other stuff So we're going to start out this morning with a little bit about storage And then there's going to be other sessions today And you look at the schedule There'll be two more sessions today But I want to remind everybody the final talk in this room the one that would have normally been at 430 has been cancelled because the speaker is ill So that one will not be happening, but the other two will and I hope you'll join us for them So let's go ahead and get started With open EBS with Jeffrey who's going to explain to you how it works and where to store all your bits. Thanks. Okay. Thank you All right, so thanks for joining everybody First of all, let me introduce myself. My name is Jeffrey flew in from from Amsterdam actually to be here So pretty excited to be here and if I appear a little bit low on energy today It's probably because I'm recovering still from a from a little bit of illness. So Apologies for that and so today we're going to talk about open EBS open EBS is an open source project Which is sponsored by my employee, which is Maya data and and we we tried to build a or we are building a Platform for persistent workloads in cloud native environments and we believe that while doing so looking at how these new cloud native patterns are deployed It justifies to rethink somewhat how we do storage for these type of workloads. So we're going to talk a lot about that and as I mentioned the inception of the project is around providing a stable platform and the The problem that you typically may end up with these days is that you have different cloud providers on-prem or off-prem It doesn't really matter and somehow some way you need to make these systems work across these Vendors that have all their different type traits So what I mean with that is that you want the ability to move the data across these clouds And this is this is not a trivial thing to do obviously because you know Let alone The bet with issues that you might have there But how do you abstract away the differences among these different cloud systems? So when I say we try to abstract that away, we do that at a data plane level So that's you know depicted by the underlying arrows as well as the control plane itself. So we do both components on so based on that we have to declare a data plane where we collect all types of heuristics on which we collect and feedback into Different services that help you manage the way you want Your different workloads across potentially different cloud vendors. You don't have to have different cloud vendors obviously Or you could just have things on-prem But that's at a very high level The 10,000 foot view of what we're trying to do So as you can see one of the things that is the common denominator here is kubernetes Obviously you've guys have heard a lot about kubernetes already today or yesterday for sure But just to touch on that on that real briefly because it is pretty important is that kubernetes originally was based on the Google board paper If you haven't read it yet, I would recommend to read it It really is an eye-opener in terms of what they try to achieve And the unit of management here is container So kubernetes is a container management system and it was initially mostly for web-based applications So stateless we could argue if there is a thing that is stateless. I mean even a web server that has no Valuable state still has state to a certain extent And in this most simplistic view kubernetes is like a control loop where you give your desired state and the system will work its way through all kinds of algorithms to achieve that desired state and the key Fundamental thing that they wanted to achieve is abstract away the underlying differences between the compute platform in particularly for Google This was important because What they noticed is that every time when they needed to do a hardware refresh the prices went up because you know Google doesn't buy one or two servers. Let's say so It's very important to decouple away the underlying differences and have to develop or focus on the deployment and not so much setting up the system Figuring out the differences between the system and make things work from there and it had to be hardware independence so At a very high level schematic few I took this from the work paper At the at the core you have the board master on which is a We're basically the the Fundamental control code lives to sort of speak and then there is a persistent key value store They they refer to PAXOS here. They use ETCD, which is raft, which is a simplified version of PAXOS these days You have the scheduler that through the work that figures out where the workload should go based on labels based on Utilization and all kinds of metrics that there might be so Then we talk about so this is a storage talk so so what does this mean for having persistent volumes in a System that originally was not designed to have state, right? so there was a problem there and the Problem goes pretty deep in the sense that containers originally did not have persistent volumes at all everything was ephemeral So if the container died the data died along with it So in order to make in order to allow stateful workloads to run in the cloud this problem need to be solved and so so ephemeral data is is You know you can see in several ways But but it either means that the temporary data has no value or it can be regenerated So a good example of regenerated data is a compile farm where the intermediate object files can be Regenerated assuming that the code doesn't change You also might want to share data between containers in the last part of serverless Which is like the ultimate containerization Let's say where you only pay when your container does useful work and the intermediate state between the serverless functions is also a Form of persistent data so in order to achieve that what you want to do is abstract away the underlying storage The underlying data Abstractions and decouple the data from the systems just like you do with the compute platform and so the expectations have been set by the Cloud providers like Google with persistent disk Amazon EBS in terms of usability So how does the user define what he wants that he wants to have a volume? And he doesn't really have to care about how the volume is created where it is put it's just there It's replicated across multiple regions. So the level of expectations is pretty high already on so Let's look a little bit about in terms of how does this look in the amul? So this is almost an absolute guarantee to lose your data because the problem here is is that you Use basically a local mount point on the node where this workload happens to run So that means when Kubernetes decides to move your workload for whatever reason the data does not move with it So this is typically something that you don't want to do unless you know exactly what you're doing because there are cases Where you can't actually do this but you need to be very very well Aware of the fact that if your workload moves today, it doesn't move with it, right? So an alternative approach then is to use a cloud disk be like like the Google persistent disks in this particular example and this makes the problem a little bit Better in the sense that the disk is you know in the Google cloud So you could connect that same disk to multiple clouds But the the problem is is then okay, how many persistent this do I create? How do I make sure that these things get attached to the right nodes at the right time when? Workloads start to move so it gets better, but still not really good So if you evaluate the current process in both cases We failed to fail to abstract away the underlying details in terms of differences, right? So we are basically cherry picking pets out of our herd Which is a very cool thing to say I suppose in in Kubernetes speak But it's basically an anti pattern you should not create servers that have a very specific meaning in cloud native architecture So the second example with the Google disk is it's better, but you still have the problem Okay, so who mounts this thing when the workload moves and who knows that the workload needs to move and what's the order of doing things? So I should typically not start the compute container before the persistent volume has been mounted So there is this form of orchestration causality that needs to happen in a proper manner so Kubernetes obviously is evolving so The the persistent volumes in the persistent volume claims I'll go into that a little bit deeper in the next couple of slides as well and a container storage interface that helps and make this a lot simpler to manage at scale and And the other flip side of the coin is is that it helps Kubernetes to not have this Sprawl of different volume drivers in their repo because every vendor has their own implementation So it's basically an abstraction around these persistent volumes So if we look at the pvs and the pvcs We could argue in terms of how easier had has become but you typically create a persistent volume With the persistent volume you create a persistent volume claim and then within your application YAML you refer to this persistent volume claim and now you set things up and You know one thing that strikes me that's a whole lot of typing to do a very simplistic thing So there's some there's some of that right and all the Blue things are keys and the number of keys are higher than the number of values. So Doing this in YAML is is is not always all that easy to be honest But you know there are different tool sets around that to make that a little bit easier But this basically is the gist of it and when a workload moves then the pvc and the pvc In a CSI subsystem will work together to make sure that the pvc is at the right note at the right time So to sum that up a little bit you register a set of mountable things Then you can take ownership of such a mountable thing and when I say register you register to a Kubernetes cluster, right? Then you obviously have the problem Okay, so then I need to create 10,000 volumes add them to the cluster and then you know grab these volumes as I go And that's also a little bit You know unfriendly to sort of speak So what you want is a dynamic provisioner that on demand can create these mountable things for you when you've You're out of those persistent volumes now that sounds really good But the flip side of that is is okay So what happens if I have a storage system that can only do I don't know Thousand months and then you know that's the max because these storage systems have you know This absolute number of how many logical units they can create, right? It's sometimes it's hard-coded Sometimes it's a vendor trick to make you go up to the next level in terms of money But there is a limit to that So the dynamic provisioner makes it easier, but also can open up a can of worms with the with the storage administrator So the attaching and decoupling of the volumes is more or less now standardized through CSI Which is a GRPC interface that consists out of several components I won't go into the detail of CSI because there was a talk about CSI Just yesterday, but it consists out of three parts controller and an agent and identity module And the agent is running on the node and it knows how to talk to Kubernetes in order to mount the volume at the Right time and then also get the causality in place such that you don't start your compute before the PV is mounted and so forth So to visualize this a little bit into a picture. So imagine that I have three or two nodes here And I have my persistent volume claim. And so now I want to I connect them through the amel as a book as I mentioned before and The thing that needs to happen then is first of all the the PVC needs to be attached through Regular let's say storage protocols ice-guzzy or fiber channel and be me over fabric What have you but then it needs to be mount propagated towards that container? So that's an additional step that typically does not happen in a traditional VMware environment, let's say so This mount propagation is completely handled transparently through the CSI plug-in So you don't necessarily have to worry about that, but it's it's an interesting thing to know because you know the namespaces Volume from the rest of the system so that the PVC becomes this Almost like a snowflake in your deployment without the PVC the rest won't work right so okay So how do you make sure that your PVC is always available and always online and has the right dimensions? It's flexible enough to grow with your workload So there are some some things that we need to consider so we can ask ourselves You know is the problem solved? So there are a couple of things that you can ask immediately is that you okay? how does a developer configured the the persistent volume for his particular workload and this may seem a little bit far-fetched but These days you have all kinds of databases that can do all kinds of magic like Encryption and compression and what have you and so you typically don't need that on your persistent volume But if you have a persistent volume that is connected to a storage system that happens to have all these attributes turned on You get it whenever you want it or not and this doesn't necessarily have to be a bad thing because if you compress things twice Well, you know you just burn some CPU, but at scale obviously efficiency is very important as well so The other thing is is how do you abstract the weighted differences between the vendors, right? So imagine that you have through a merger You have two sites where you have two different storage systems and you want to deploy a cloud native applications across these two sites But vendor a you know does it vendor a's way and vendor B You know so and these vendors typically go out of their way to be incompatible with one another as much as they can possibly be because You know they want to lock you in so how do you abstract away these differences not so much in terms of features But also in protocol so what if I have a nice fuzzy system in a fiber channel system How do I create a PV that spans these two systems? Right? How do I do that efficiently and effectively so? There's that The other thing is is like I mentioned you want a cloud native look and feel so the developer that is familiar with EBS type like volumes In the way that he provisions them you want to have that On-prem as well. So how do you achieve that those PV's PVC's and CSI driver? They don't help you with that. They just help you with attaching these type of things So I guess you know to sum it up you want to have your data as agile as the applications that they serve The other aspect here that is is worth considering is a thing called data gravity So as data grows it has the tendency to pull the applications towards it and it with this I mean the storage system and the problem obviously there is that you get this huge blast radius if something goes wrong in your storage system Well, you could say well We could go out for a distributed system, but these distributed systems can fail as a distributed system on their own just like any other Non-distributed system, so it doesn't necessarily help the other thing is is that the Freedom degrees of freedom that you have achieved with Kubernetes You basically lose them the minute that you are tied to a storage platform because everything evolves around that storage system So you need to find a way that you can actually Break away there. So some solutions involve replicating the Sun Let's say you have basically have this big storage system and you replicate the whole big storage system to something else That model works, but also it exacerbates the problem, right? It makes it worse now You have to supernovas to sort of speak so you need to be careful not to lock yourself in and to stay a little bit in the Borg scenario Star Trek obviously so when you ask Picard, hey, let's replicate the whole Sun and then you know and Basically go with that in production. It's probably not the the best solution. So So what we we asked ourselves, okay, so what if storage for containers was itself container native, right? and so and I have to attribute that to Our CEO who is adamant about we need to understand what the problem we're trying to solve for our users and they have this In anthropology, they say that going native is the is where you actually Become part of the environment that you study So we really tried to go native and try to figure out what these users are trying to do what they are trying to solve Such that we could build a system that would actually help them instead of developing a system that we think is great And then enforce that upon them. So basically inverted the the engineering aspect here And when top down versus bottom up with storage vendors typically like to do so We studied these or studied is a big word, but we spoke to a lot of people About okay, so what does it mean to have a cloud native architecture? And one thing that stood out the most is that applications have changed and somebody forgot to tell storage And what I basically mean by that is when you look at the cloud native architectures, they are distributed systems themselves, right? They they include They don't necessarily have to but more often they're not they have these Sophisticated protocols embedded within them, maybe not even consciously chosen But you know part of the solutions of the software stack that they're using that already provides high availability and in scale out So do you still need a distributed scale out storage system underneath? The other thing is is that these applications are designed to fail, right? Usually you had this piece of software that ran on a hypervisor and the hypervisor could restart if the note fails but your storage system again was the you know the basic principle it could never fail right so These applications are designed to fail across nodes across regions across racks Scalability is built in batteries included ha proxy envoy engine x you guys probably know this better than I do Then you have these databases with built-in sharding capability So these applications are fundamentally different than the applications that we were talking about 10 years ago And again as I mentioned earlier and with that in mind that allows us to free up Certain complexity that we do not have to implement to build a product that better suits these cloud native environments So the other thing that we observed is that data sets of individual containers are relatively small So does some of it is still you know, whatever the value is but the individual containers are relatively small The data that we've seen right now is it's around 20 to 200 gigabytes So that's practically nothing and what that allows you to do is that it allows you to probably Reconsider certain for example rebuild algorithms because the problem that the domain is is the same But but the scale is a lot smaller so things that didn't work out might work again because they are at different scale so Prefer having small things small stars over a big sun I suppose the other thing is the rise and fall Maybe we'll see of these cloud native language ballerina metaparticle where they basically Develop in units of containers as well right and within these languages they the PV creation and things like that are implicit So, you know it is no point in and not figuring out how these systems work internally in order to figure out how to build a good storage system so Other trends that we saw hardware trends and storage strengths towards system trends So hardware obviously is not standing still a 40 gig 100 gig RDMA capable devices NV me and NV me over fabric NV me actually is a transport people Implicitly assume that has anything to do with SSDs. It doesn't necessarily have to Increasing core count where the concurrency primitives are built into the languages, right? So if you look at modern languages that are popular these days go is one of them Rust is another one the concurrency primitives are built into the language you do you typically do not type p-thread create anymore It's like you know you type something and implicitly it happens in the background and it's because these these cores core counts keeps increasing So storage limitations are bubbling up in the software design And what I mean with that is the storage won't affect how you unroll your for loop But it in terms of infrastructure as code it does affect how your application is deployed and you get friction in between teams So don't run your CI CD pipeline while I'm running my backup job, right? These are very typical things that we see because you don't want to back up build artifacts from the CI CD pipeline that failed, right? So you get this friction. So the developer needs to be in control. That's basically what you want to achieve And because of this friction what happens is that you get a lot of shadow IT So people start to deploy something in Amazon it works I've delivered my job and now they need to move it back on-prem And that's when the problem start because the storage properties in the cloud are different than the one that you have on-prem so The other extreme which was very interesting that that we came across is that This guy had to budget to buy any storage system in the world But he eventually decided not to buy any at all because he said I'll just give each developer his own server And I'll buy three of them and he just needs to make it high available and it's not my problem He needs to take care of backup He needs to take care of scalability all by himself And the reason for that is that all these vendors had all their different things and limitations And it was just impossible to keep up with that Obviously there are no enterprise features either. So that's the ultimate Extreme and I think he was a little bit burned Let's say by all these problems that he had in the field with these type like storage systems So, you know storage seems to be like an agility anti-pattern thus far. So To zoom in a little bit on the on the hardware trend. So at the right top this is a super micro server and It has one petabyte of NVMe storage and that is like a huge huge amount of storage, right? And that's just the one you server the other thing is is that the CPU core count and the Frequency so that the images overlap a little bit, but you can see that the frequency has around 2004 ish Basically it was flat and it's not increasing anymore, but the number of cores is increasing linearly, right? So how do you make sure that you utilize these cores as efficiently as possible, right? When you are building not just your application But also a storage system and then you have at the lower right corner. I have a typical NVMe system where each CPU has a lockless ring on which it can basically put Requests and you know these requests are consumed by the NVMe driver and vice versa You have a completion queue submission queue and an admin queue and every CPU has its own queue, right? So there is not it's not shared So if you have a lot of cores each core has its own queue So this is a little bit in ties back into you know the the hardware trends that enforce a change And then a third thing which is probably the most important thing is that the persona has changed, right? So normally you would talk to storage Admins or CIOs or whatever. Yeah, we need a new storage system of a petabyte or two or three But these days you're more targeting the actual developer himself because he or she knows best what he or she needs for its application, right? So But storage hasn't changed at all right thus far So the idea that we have typically Is on the left side you see a Kubernetes deployment that is completely stateless and then on the right side We basically do the exact same thing where we have a stateful application through Kubernetes and YAML and when we actually deploy the application next to the actually compute container We also spin up a data container and within the data container We run the storage subsystem that makes sure that your data gets where you want it to be, right? so that's the idea and But based on the idea and based on the conversations that we had we imposed some design constraints upon ourselves and So we wanted to make sure that we build it on top of the substrate of Kubernetes and When we when we started this out two years ago almost it was not all that obvious that Kubernetes was gonna win You had different Orchestration platforms they are still around but you know, I think it's fair to say that Kubernetes basically won the war on that So we wanted to make sure that we you know We are hundred percent built on Kubernetes and not just something that's jacked into Kubernetes afterwards, right? So we also chose not to build a another distributed storage system because there are so many In fact, there are if you look in the Linux kernel There are several distributed file systems that you could just use you know and get started So you don't necessarily need anything at all, but somehow some way these systems are not cutting the cookie, right? So and there are multiple reasons for that one of which is because they're inherently complex, right? And secondly The reason that we decided not to do this is when we looked at the data sets They are relatively small so you don't need to have one volume that spans. I don't know Half a petabyte or or 10 petabytes or what have you right the individual workloads are small, but we should not confuse or Or yeah confuse Distribute with scalability. It doesn't mean that it's not scalable. It's just not that we have one big volume And when you do to write the volume spans multiple Systems, although there is nothing that prevents us from actually doing it But we believe that putting distributed systems on top of one another is an operational nightmare Declarative intent as I mentioned to persona so the developer through yaml defines exactly what he needs for his workload And we will take the yaml and make it so Runs in containers for containers. So it needs to run in user space And this obviously opens up a can of worms which I will go into in a bit because performance Obviously is important and when you're talking about NVMe type like workloads and you put a system on top of that that can't You know and reach these performance numbers You have a problem and the other thing is we want to make the volume omnipresent So what I mean with that is is that if you have a a net app system and emc system and some other system in three different Data centers, we are the abstraction layer in between that you know Make sure that the IO goes to all these three systems at the same time. So you don't have to worry about that So and I usually say it's not a cluster storage instance rather a cluster of storage instances. So Decomposing the data Picard likes that a little bit better. And so that's basically what we do So instead of having this big fat storage system, we create these very small isolated storage systems and then a lot of them The other way to look into that is you have these the the sand nas versus DAS, right? So the sand and ass typically have all these cool features enabled like pooling your storage for example or aggregates and Snapshots and clones and what have you and then you have the DAS which is like the complete opposite It doesn't do anything. It's just really really fast stays in the box. So if you combine these Three together what we basically ended up with was CAS which is container attached storage and that's where the term comes from so How does that look? We actually have several implementations, but this was this is one of them Where we basically have the PV and the PV sees that we mentioned earlier We haven't changed that at all But the request goes through the open EBS provisioner. That's the dynamic provisioner I mentioned before and then we have this M API server and don't ask me why but if you Create something that works with kubernetes that uses the kubernetes API. It's called an API server And that talks back into the cube API server Then in turn talks to the scheduler and that spawns these containers that constitute then your container attached volume, right? So the fundamental difference here is that typically this would be At the right side where you see those containers spin up that would typically be you know Let's say offloaded to a storage system that you have whatever the storage system is, right? So then we also started to work on topology Visualization because you can imagine that if you have all these containers How do I visually verify that indeed my containers are connected properly? So we we integrated this with we've scope where you can have your workload and you click on your workload And you can see the PV and how it is connected and all these type of things. So This is upstreamed, you know, it's it's it's pretty nice I suppose and you actually do need proper or good visualizations if you go to 10,000 Workloads so the other thing that you could could typically do and I'm not saying that you want to do this But you could do this is that if you talk about make storage is agile as the applications that they serve You have this thing in Kubernetes called blue green deployments So imagine that you want to upgrade your storage controller What you could typically do is you could swap out the storage controller and if it fails you can swap it back Right, that's basically the whole notion of blue green deployment The other thing is canary deployment like you leave the bird alive until the last one dies right these type of things So you could do this and this enables you to do that at a per application level So you control the impact because I think that everybody who has upgraded a storage system where a complete VM Where a farm is running on top of everybody is squeezing their butts when you type fail over right you never know right I mean it should work right, but you typically don't do that all the time and with this you can do that all the time And so you reduce risk right that's the whole idea Around it. So as I mentioned, you know route your data anywhere you want. So this is basically a For lack of a better word, let's say a proxy, right? So you have your container attached volume and you have the box one the box two and the box three which is a reference to Silicon Valley obviously and so you can basically say, you know, make sure that this data lands on these three systems Whatever they are right. I don't really care how you do it. Just make it. So right Composable where the developer has full control in terms of what he wants the system to do So if you if you go back to the PV you have different Protocols that you typically have in a storage system and in Kubernetes Terminology they call that ingress and egress. So I put the terms up here as well Um, and then you have the egress is basically your output. So this could be a local NV me device Could be a local SAS device doesn't really matter and could be something that is remote But it's important to understand that we are the wanted initiate the connection So we are not using the host to build up the connection with the remote storage system, right? And then Through different transformations, you can basically, you know tie systems together However, you want where you can do compress encrypt and mirror. I would by the way do that in disorder Not encrypt and then compress But that's how you can compose your storage for any particular workload. How you see fit so okay, so how do we do that because I Basically have this situation But how do I connect my PV to my CAS instance and my PV is just amount point abstraction from the operating system So the operating system somehow some way needs to connect to my CAS instance, right? So we abstracted that away Or we are abstracting that away through different ingress protocols So network block device NVMF NVMF or NV me over fabric TCP which got ratified November 18 so it's very young still NV me rdma and then there's the vertio family I think the vertio family is something that is really important because if we look into these developments of these New micro VMs like firecracker and kata containers where they basically stripped down a hypervisor to a level where it can only boot Linux and that's it but it creates this form of Isolation in terms of security which is not achievable with plain container systems today, right? So very small hypervisor that only runs linux G visor is another thing from Google that does the same thing and in these type of situations hypervisors typically communicate with vertio Right and vertio would not be a surprise But looks a lot like having a ring buffer which is shared between two processes where one writes something in it The other one grabs out of it, right? So but then you need to go out obviously, so how do you do that? Well basically the same way And and again, I'd like to point out that we are making the outbound connection here and Then you can basically mix and match Everything the way you want because within the cast layer we abstracted the IOs in reads and writes and some unmapped calls to for for SSD based subsystems, but we abstracted these differences and protocols away and then you can basically say well, I want to have a RDMA connected cast volume that writes to a cluster instance that writes to my local device through AIO and Through an NVMe device. I'm not saying that that's a really good idea, but that's what you could do, right? So let's talk a little bit about performance then because as I mentioned that is a little bit important So user space and performance So one of the things that I think is a game changer for sure is NVMe So the SCSI layer is two years well probably a little bit older than that if you count the development cycle But it's older than I am right and NVMe basically completely breaks away with the SCSI model And that's why I'm so excited about it because it actually is truly innovating. It's not just billing on top It's just gets rid of it completely and the NVMe spec looks a lot like in FiniBand Which is a technology from a high-performance computing segment, which is basically resulted from a merger I can't recall the exact companies But in any case this is the snippet that you see below there This is an email from Jens Expo or if I pronounce his name wrong probably but in any case he is the Linux block maintainer and he Send out this code where he was reworking some of the internal block structures of the Linux kernel to Short-convent among other specter and meltdown and it just shows you the word ring and this is just the first paragraph So everything is rings rings rings lockless completion cues submission cues and things like that you see that everywhere These days, so how does that look so on the left-hand side? You have the NVMe driver And then the right-hand side you have the traditional wall at least for me on the right-hand side Discussy layer the block layer HPA driver and then the storage systems, but probably is every storage engineer knows that Behind the HPA there is so so so much firmware that it just makes you Want to cry and run away because you have expanders and jbis and switches and all these type of things So it's a very complex network, right? So with NVMe. It's it's like completely gone so as I mentioned these these This this these changes called IO u-ring it's merged in Linux 5.1 So it's it's it's very very new and probably before it becomes mainstream available I don't know how long these cycles go with with these Distribution vendors, but I assume it's going to be one and a half to two years for sure but you can see the performance of the IO u-ring which uses these lockless cues, right and Compare that to a lip AIO, which has been in the kernel since forever And you can see that you know, it's it's like no comparison and below that there is spdk Which is a storage development kit from Intel which also uses these same exact rings But does the whole thing in user space, right? And you can probably guess what we're using there to build the container attached storage systems in user space as well And so it's not about the absolute speed here We did not pick spdk because it's I don't know hundred thousand IOps faster That's not that's not the point the reason here is that you want to have your Storage system run in user space because you do not know which type of kernel you would run if you move your workloads from cloud to cloud Right, so if you run Ubuntu on Amazon, that's not the same Ubuntu that you download and install on your laptop, right? if you for example use Kubernetes in Google it's cause it's a container operating system or something which is a spin-off of chromium OS Which has no kernel modules at all, right? So you have to do these things in user space right so get rid of the differences in kernel You can just go to user space the other thing is huge pages which is also an important aspect here So we need to have huge pages to work huge pages is something that has been used for a while now in databases And the most important thing here is not so much the MMU although that is important But in our case, it's the IO MMU which is Which is there in the first place because of virtualization But I don't know if it's readable completely but on the on the right-hand side you have 22 million Misses right and on the lower side you have zero misses and every time that there is a miss the operating system to do Additional work in order to fix that miss to sort of speak so with with two meg huge pages we can basically Get you know no misses at all and the other aspect here is that these two meg pages are pinned in memory Which is an undocumented feature of a Linux kernel, but they they they physically don't move and because that they physically don't move they become Viable to do BMA which makes it go fast right from user space. That's the trick basically so the other thing is pole mode drivers PMD so We're not using interrupts at all So interrupts did the frequency of the interrupts would be so high that you would either Accumulate several interrupts before you actually start the interrupt service routine or you know some other technique like that But the problem is is that the CPU gets interrupted so often that you know it doesn't make any forward progress at all So what you basically do instead of interrupting you basically? Constantly pull so you're in a tight while loop reading to register and your initial response might be well That's not really efficient is it right and that's true But if you compare that to the lower so the lower one uses pole mode driver the upper one uses interrupt driven you can actually see that the The idle is higher For the pole mode driver right so higher is better in this particular case The other thing that you can do is do adaptive pulling which is what Linux also uses by the way Is is that it once there is an IO it keeps track of how many Ios there were and if there were no Ios for a certain amount of time it increases the sleep and when you the sleep goes beyond the latency threshold for interrupt driven It switches dynamically to interrupt driven until load comes right so these are optimization techniques But the gist of it is you typically don't see or you should not have systems that are running idle to begin with right That's the whole thing about efficiency. You make sure that everything runs I wouldn't say to the max but for sure 70 to 80 percent before you Add a new box. So so rings in container attached storage. We have this core Which is constantly pulling this this this ring if there are new Messages or new events these events are basically a small data structure with a function pointer to the actual function that needs to be executed and two additional pointers to call back arguments and Basically, we are pulling that ring constantly and then we have socket connections in this particular case ice-cazi Which we constantly read the through a polar with an interval typically is at the zero Where we read the socket non-blocking? So it would say I would block or not if it said I would block then we would read things and things like that Then we have a gRPC interface attached that so that we can orchestrate it through CSI because CSI implicitly Uses gRPC and then we're actually pulling the devices the NVMe device that has hardware registers that signal that a certain IO Has completed as well and the nice thing about this is is that this scales very easily in terms of the number of cores you assign to it, right? So the final quick part then is is okay, so you just don't build a new storage system. That's it's not trivial Storage typically tends to be very important because it's you could consider it the crown jewels of your company to sort of speak Because if you lose the data, you know, you lose all the assets So as I mentioned Kubernetes is like a control loop So we we we tried to look and see how can we expand this control loop to make it Fit the model of control loops and make it do something useful in our domain and that storage. So Kubernetes has these Ways to extend it I alluded to them original a little bit Earlier with API servers, but you have these operators that you can build Operators basically allow you to inject some domain specific knowledge Let's say of the container that's running your application. So you you basically add the knowledge of the fact that it's an SQL server or a I don't know Kafka server and then you have an operator that you can scale not the container but the Kafka instances So these operator frameworks do that you can write your own schedulers. It's not all that complex actually The API servers and there are other things out there as well like like meta controllers and whatnot But you know, I don't really want to Go into all these things so hence the dots But I don't know if you guys so this is European notation by the way the American notation is a little bit different So if you're confused, you know, that's fine, but this is a Model of a model reference adaptive control system and you can forget about that It's really not all that important But the thing is is that you can basically split the thing up in two different loops Where the first loop is the one that we saw before and the second loop is where we measure and try to adapt the storage Parameters based on whatever we want to see the storage system do right? So this can be simple things like Disc busy or a number of errors or things like that But also a little bit more sophisticated where we inject a reference model knowing that based on the hardware based on the workload This is these are the numbers that you typically would need to get right and that's through Sass offering that we are building called my online that you could use to get this reference model So the way that we built that up a little bit is Through we have written this this this performance tool. This is all written in rust by the way So if you're into rust, you know ping me, it's I think it's a very interesting language But in any case what we do is we we have this CAS instance and we connect different type of Protocols to it So it speaks all these protocols inherently and then we through Kubernetes we scale the workload up and down and see if the system behaves the way that we want it and then we collect Telemetry in order to see if we have regressed right so for example if we are adding a thing Let's say like compression is the the cost of adding compression is it within boundaries of what is acceptable? And not only that we think it is important that we can quantify what the impact is if you were to enable compression roughly right and compression is not Deterministic it actually has this Band within which it operates because it's not you know not all data is even compressible So therefore the decompress rate is not always constant so So if you take that a little bit further what we do then is is that we use this thing called This actually the name is not on there, but there is this tool where you can record the Ios at the block layer submit it to the Linux kernel So instead of spinning up MongoDB and scaling that up to 10 nodes or 20 nodes or whatever and then have it consume memory And all these wonderful things that it does we basically record One Mongo instance and then we can replay that through file and then we can scale that up to the roof Let's say and see where the system starts to crack up if at all But the cool thing and then is as we inject errors into the system in terms of read errors smart errors Whatever type of error you want we can then collect and figure out what the Disruption of or what type of error? Impact your application in what way right and based on that through this is a little bit far-fetched I know but through lab the functions because they have to get the whole CNCF namespace in there obviously so but through Knative lambda functions and whatnot you can call back into the declarative data plane and for example the site that you're going to move This workload because based on the pattern that we see it's better to move it because you will get neg negatively impacted by The errors that we collect and see while running it so the idea then eventually is that storage just fades away as a concern and With that if there are any questions, I would be happy to answer them We have about 10 minutes for questions. So who's first? nobody Head still spinning from all the storage stuff I Actually have one then okay, so are there currently a lot of limitations in terms of Applications and platforms and layers of the stack that have proper support for NV me Well, so so so DD So from a from a host perspective envy using NV me is completely transparent So the common denominator is is that you still have this block layer and underneath the block layer That's where the difference is bubble up. So the application doesn't necessarily need to change however, if you want a biggest bang for the buck you would typically Ideally have your application immediately right through the RDMA fabric directly to whatever the subsystem is So what you see in NV me space developing as well is like key value stores, right? So Rocks to be you guys probably heard of it It has this tremendous amount of very very intelligent and complex software to make You know dumb fat discs behave smart and fast, right? But if you have this unbelievably fast media, do you still need that software, right? So that's a little bit of the playing field that that is happening So you see key value semantics being built on top of NV me devices directly And so then that means that a application like Rocks to be would definitely benefit from using direct RDMA and VMF type like API's to bypass the operating system altogether for example but other than that DD NV me is is still relatively young of course so the NV me TCP is only released in 5.0 very recently. So and they You know, it's expected to have a couple of bugs But yeah, you need to develop it first in order to make it usable of course But it we work with ice because the NBD vertical all the same. So Yes, sir What's the current maturity of the project and can I use it now? right, so most of if most of the things that I talked about is Something that is being actively worked on by myself and a couple of others We currently however have The open EBS project is available on the internet and we have two systems that More or less do the same just not that highly optimized in terms of performance because the first thing that we needed to do when we Started out this project is is like is this an actual problem that needs solving and we did not necessarily know that that was true We thought it was true, but we did not know container attached storage would actually take off and fortunately for us It did so now then becomes the question. Okay, so how can we do this efficiently and fast? So we we didn't optimize for speed in the beginning at all So we focused on the workflow We focused on the usability such that you know devops guy who has no knowledge of storage could actually spin up an open EBS volume And I think that one of the biggest features that we have is that we are very easy to get running Today, it's just apply some ammo and you're done and people seem to like that My main concern at this point, especially where we're at in our current set of projects is It's robustness and its reliability more than its performance, right? Those are the big issues for us. Yeah. Yeah, you know, that's that's some That's the gift in the curse with storage, right is that it needs a long time to cook In general and this is true for the protocols even in in the Linux kernel my first experience in you know in the Linux kernel is developed by the brightest minds in the world and you know I run it and it just breaks on me. So shows you how complex this is but The open EBS subsystem only Concerns itself with blocks so we don't have you know the complexity that you'll find in file systems like Z of S For example, right? We we don't that's like, you know a whole different level We do have some Tim provisioning and things like that But that's what I tried to say that because the dimensions are changing you can actually reapply technologies that were deemed Unfit because of the size Started to grow but you can now reapply them because the volumes are smaller again So if you look at things like device mapper in the Linux kernel or MD RAID and these papers are date back from the from the 70s right you can reapply these technologies again and probably Brush off the dust to sort of speak. So we don't necessarily have to reinvent them But I understand exactly what you mean. It is storage needs to cook for sure Hey, Jeffrey, there's a number of Devices that the hardware guys are coming out with that do things like key value store inside the NVMe disk Yes, so how does your architecture extend to allowing those devices to shine? well, so if the so it It depends a little bit on the interface that the application would Know so no, let me put it differently It would depend on the interface that the vendor would give the application developer to use the key value Semantics as far as I understand and it's been a while since I looked into it But the it's basically based on right stream Type-like approach where you basically say so instead of saying I'm writing at this LBA with this length You're basically saying which is typically fixed 4k or 512 or whatever But now you basically say I'm writing at this LBA one meg and that that leaves to the ability to abstract key value type like semantics, but The way that we would directly integrate with that is probably hopefully expose a standardized key value interface towards these Key value interfaces that might not be standard where we in these translation layers Let's say where we do the translation But it would be interesting to see what the standardized API is going to look like above all To see how that evolves. I hope that a form of a standard comes along Okay, well, thank you very much. Thank you And we will return to this room after lunch with a talk on how to integrate Council into your Kubernetes clusters So thanks Wallroom B Mike check check check Mike one check Mike one check Jack, mic one level check ballroom B. Mic check ballroom B, check, check. Check, check. Check, check, check, mic one check. Mic one ballroom B, check, check, check. Check, check, check, check, check, check, check, check, check, check, check, check, check, check, check, check, volume, check, microphone 2, ballroom B. Video corrected now in ballroom B? Testing 1, 2, 3, testing 1, 2, 3. Sounds like we got audio. Can you guys hear me okay? I mean, I was just talking low. Okay. Welcome back. From lunch, I know this is always a hard slot to fill because your stomach is making your brain go back to sleep, probably, but for our next talk, we have John DeHoney from HashiCorp talking about Service Mesh and just one announcement, the final talk in the track on virtualization containers for today. If you're looking at the paper schedule, the speaker is sick, so that one's fallen off the schedule. Just FYI. Okay. Is this better? Got it. Okay. So by virtue of being at this conference, you more than likely are somebody that is advocating change. You are moving the technology ball forward in your company. And that's what HashiCorp is. If I had to think about our DNA, it's change, specifically cloud native change, the change that digital transformation has kind of come down and pushed on most business organizations that are causing many of these different technology shifts, the shifts to DevOps culture as a result of going from waterfall to agile. So many different changes that are happening. Static infrastructures that we found in our traditional data centers, our brick and mortar comfort zones now out to the cloud. Things like changes from high security, high trust to no trust, zero trust, and the network, the internet is the network. So today, I just want to talk about this little topic of transitions because I think it's hard that many products impose and don't think about getting these brownfield applications that we all have and these legacy applications into containers, into microservices and technologies that can facilitate that transformation a little bit easier. And then, of course, the focus of the talk is service mesh and console connect. So console connect is a new feature in HashiCorp's console that focuses on service oriented networking that we'll kind of dig into. So a little bit about me. I'm a solution engineer. I just came over to HashiCorp from Mesosphere. I was Mesa's guy for quite a few years and then DCOS. I used to work here with Jeff. I was VP of engineering at Cars Direct. I kind of did the whole automotive thing here down in the South Bay area. So I had a break with Insanity and went into management for a while. And then I got back into individual contributor roles, mostly working with startups. My technical pedigree is DevOps, systems architecture these days. I started out in development and databases. So that's sort of where I started. And before I even got into that, I was running into some guys here at JPL. I was in the embedded systems world. So that was kind of where I started outside of my college when I kind of retired from active military, but then I still was a reserve officer up until about 2008. And then my interests are obviously technology, as all of you are. And then obviously the cultural impacts of technology. I think it's just fascinating seeing what we're doing. And then, of course, our crazy culture that we're living in. And I love anything outdoors. Still ride a bike. I still love the fresh air going through my lungs. So anyway, transitions, they're hard. Waterfall, scrumfall, agile, brick and mortar to multicloud, bare metal to containers. Digital transformation, driving corporate culture. It was interesting that even back around 2000, working with a very hard to get along with CEO, having these discussions that we're really a technology company. Remember that, Jeff? I mean, just trying to like, no, he wanted to think we were a content company. But if you didn't have the technology, that content didn't get out. So I mean, yeah, you can look at it as a business side of that. But technically, and ultimately where we're going from now are these systems of record in these brick and mortar data centers, opening them up with API gateways using service discovery to become systems of engagement, mobile apps. Different types of engaging web experiences. And then of course, as I was saying, the traditional business coming to the realization that they're technology companies. For some, I don't know why this is a hard thing, but it just is. So what happens if transitions don't occur? I was with a startup, an API company in Palo Alto. And my first couple of clients was this new company called Netflix that nobody ever heard of. And another one was Redbox that nobody had ever heard of. And these guys were cooking with charcoal when we came along. I mean, moving along, we were exposing everything via APIs. I mean, the world was massively changing. Digital transformation was starting to make an impact. And I remember going into this company and talking to them about, hey, you guys are gonna get left in the dust. And we got a great idea that can help you out. I mean, obviously, we want to sell you our product. But what about an app that would take care of all those penalties for returning those videos, right? I mean, everybody in the room has probably coughed up a few bucks for that. And then of course, Reed started Netflix because he got pissed off, because he got charged 40 bucks for returning Apollo 13. That's a real story, by the way. So all of these things were things that actually happened. And the point of putting the timeline up there is that today, for non-adopters, for people that don't change, that people that don't. There's a saying out there about change. I don't mind change, but I hate the act of changing, right? I think we can all probably relate to that one. I mean, I think most of us are probably more comfortable with change than the normal person, just by virtue of what we've been doing the last five years to 25 years. But they did not evolve. And now there's one store open. And I find that fascinating, going from 9,000 stores to 1,000 stores. And it's because they didn't change. You can think about a lot of other things as to why they didn't. But that's just presupposition. But changing was critical. Look at Redbox, what they did to them. And then of course, Netflix, how they've involved, right? And the other thing I want to point out is it's the open source pedigree that I think is an extreme differentiator. It's something that we want to bring up in our companies. Take a look at these three companies. Blackbuster, one of the biggest companies of its time. But by ICANN, it went from 9,000 stores to 1,000. Redbox, well, that was a technological obsolescence thing. Maybe are these things still out there? I think they are, yeah, they're still out there. But they're probably not doing this good. Obviously, and now look at streaming. But also look at the contributions of Netflix to the open source world. That's what we're here for today is open source. And I just look at some of these tangential cause and effect type things. But once again, just blowing the open sourcing. So as a military guy, before we jump into a gunship, we would get these briefings and given rules of engagement. And the number one rule of engagement we have to think about with change is fear. And fear is, well, just having been in management, it just seems to be epidemic. But the thing that dispels fear is knowledge. So that was the point of Bill Gates' comment. And then just understand that many times when you're coming under condemnation that you're not alone, that was the point of Albert Einstein's comment. I think we've all seen this. And I don't like to think of us as any better than somebody else. But I think that we have more courage when we're willing to put our reputations on the line and move the technology ball forward. But then the other thing to be truly disruptive and truly change, we have to keep one other law, and that's Conway's law in mind, that it's easy to fall prey and fall back into very comfortable habits because of this. So disruption requires breaking that rule. I mean, it's key. If you want to be in a disruptive company, that rule has got to get broken. So I sort of see these as our rules of engagement. And this kind of leads to this discussion in console now. So we've made this evolution. And we are making this evolution. And I see this every single day. I mean, people are in one of these states. They're in multiple of these states. Corporate acquisitions, you could have one company running containers and three more over here still in the data center running off of IBM mainframes. So we have a lot to work with and deal with in our space. So as you guys know, traditional networking is pretty much predicated on VIPs and static IP addresses. We deploy the application. We put in a ticket. And we wait, right? And one of the major, as I see it as mega trends in digital transformation is velocity. It's getting stuff done faster. It's what caused the DevOps culture. It's what caused Agile to come about from four releases a year to we're pumping out software every two weeks now on Sprints. So it's got to be done faster. So how do we do that? So with console, we come up with this whole idea of a service registry, enable routing. We're going from IP address to names because things are going to move around. You have a cluster. Pods can move around. They don't necessarily stay on the same node. And then, of course, being able to find each other, being able to establish configuration. And then finally, how do we secure all this? So we've got a lot of different tools out there to choose from. I personally love our segmentation approach, where we use intentions. So we'll get into that today. So these are the big three, service discovery, service segmentation, and service configuration. And when I say service configuration, I talk about this configuration. This is what you read your config file. You go out to your database and read this. We have a different way to do this in console. It's just a distributed approach. So let's talk a little bit about service discovery and connectivity into dynamic infrastructure. So the whole idea is that service discovery pairs much better with dynamic infrastructure. The whole static way of doing things was fine in the client's server world. We passed around an IP address. We embedded that in a couple config files that were easy to manage. And life was OK. It wasn't that bad. In fact, I'm even thinking in the day, Jeff, we didn't even put that stuff in version control, did we? And that was one of the things that early on, I sort of felt that we should be saving off our configuration, because what happens if that gets slammed? Today, I think if DevOps users aren't well versed on Git, they probably are out looking for another job. So how does it work? So the basic traditional way that this worked was you would stand up a monolithic application and you would have within this monolith several different subsystems. And the way these subsystems would work is you would have basically an in-memory call. It was pretty easy. They knew how to obtain data from each other. But now what happens when we start putting these in containers and we have multiple replicas? Is the replica we're trying to get too healthy? Is it active? Where is it? What node is it on? So people have solved some of these solutions with things like HAProxy, load balancers. But now you have issues of high availability. You've got a single point of failure with some of these. One of my colleagues likes to think of this as an anti-pattern. I don't know about an anti-pattern. It's just not highly available. I mean, it certainly works good. It's not a bad thing. But the service-oriented solution is a little bit different. What happens when a service comes up? They are going to register their service. In this particular case, with console and service registry, then a client of this service will query, typically via a REST call or an API call, to find out the IP and port that's available and then make the call. So today, we have two challenges. And this is an evolving space today. So in different types of cluster, you have different approaches. So we have the north-south approach. So basically, that's getting traffic into the cluster from the internet. And then there's the east-west challenge of what you do inside the cluster. So we have different ways. So I was a Mesa Sky. I was a DCOS guy. So we had Minuteman that took care of east-west there. It was built on top of VXLAN. Very elegant, very fast. We do round robin load balancing. Pretty cool. But what we're going to talk about here with Kubernetes, but I've also got console to work. So that's another thing. You can go look at, I have an older GitHub repo. John DeHoney Jr. that has a marathon implementation that I did as a 12-factor app. I was kind of actually, I wrote a paper called the 13-factor app because I felt like they left health checks out of the 12-factor app. So that one used environment parameters to set up the console config file. So that's still out there. In fact, I looked at it before I came here. I've got some newer stuff in my new one. It's John-DeHoney on GitHub. So that's where I'll put these slides when I'm done. I had a little more work than I thought here in the conference and I didn't get to that. So these are the bullets of console. The service registry, we have our own DNS interface in there. HTTP interface, basically, that implies our rest. And all of Hashicorp's tools have HTTP interfaces in them. Load balancing and health checks. And that's where I put the big green emblem on there for open source. And then multi-data center, some of the aspects there are in the enterprise version, as well as our Sentinel product, which is application of real-time policy to your network. So service registration. So this is the process that we looked at the first set of slides that any service is going to go through to register and advertise what it's going to do. And in this case, the other thing that I think with any service that anybody creates, this health check, as lead developers, as lead DevOps guys, you really have to hammer the developers. I don't know why they hate putting slash health on their microservices. But how many of you guys have seen that? No slash health. I mean, it's crazy. And when you're in the containerized world, what that health check does, whether it's Minuteman, whether it's Console, that health check is going to say to the executor or to the control process, are you healthy or aren't you? So if you were to do a DNS query using Console's DNS, that would never come back in the answer section. So you did an SRV query. You would not get that service import to come back and a dig with an SRV query. It just wouldn't happen. And it's the same with the rest. You would query for healthy. And all you're going to get back are healthy containers. So in this case, to be a success in DevOps with this type of service segmentation, this is an area that you got to get a little bit angry about every once in a while and say, guys, you've got to make this a review criteria. It's got to be a check off. Do you have an API interface with slash health? It's got to return a 200. And then depending upon the platforms you're using on, there may be other information. But typically, it's the 200. Anything else that comes back is going to fail the service. So we have the DNS interface. So we can query based on the Console domain. The domain is configurable, so that can be changed. Oh, and by the way, one of the things that I was thinking about when I looked at consoles, I was like, wow, I had been in HashiCorp for five months. So I have used all of HashiCorp's tools before I come here, but obviously very pointed aspects to them. So I was working for a cruise ship company, and I was worried about satellite timeouts and WAN type configurations. I was working for other types of companies using Terraform, and there was other things that. So now I've got to know everything. And I sort of think of all the, when I look at command line parameters, I think of them as like tree rims. You know how you look at a tree and go, oh yeah, this thing's this old? You can almost look at a piece of software and think about it with the command line parameters. So think about that as you look at things. It was just an analogy that I thought, yeah, I can see how something's been around. Okay, the HTTP interface, okay, so we do have REST here. So this is just showing a query for a specific service. Now, think about your API gateways, okay? And name-based queries. This is how, now this only has one instance, all right? But if you scaled out your replica set, you're gonna get multiple ones back that are healthy. So let's just say you installed Ambassador or Apigee or some other type of thing. You want to do a lookup on your service before you dispatch your call, because you don't want to dispatch to an unhealthy instance, okay? So that's sort of the point of the catalog query. And then, of course, you can query for all your services, but in this case, when you're typically dispatching a call from an API gateway, you want to dispatch to a healthy instance, right? So, and then once again, the other thing to think about when you're setting these up, depending on how different east-west load balancers work, be sure to read the documentation. Sometimes lease connection, sometimes around Robin. Just think about these. You know, what's the best strategy as you're putting this together as on the DevOps side? And then health checks. I can't say enough about these. The gossip layer provides liveliness checks for all the nodes. Agents run the health check locally. So this is very much like some of the other architectures and how they perform health checks. You can perform health checks from masters, but you can also have executors performing health checks. So we do this locally, which I think is a much better approach. It's the only way that's done. And then there's several different types of checks that are available, so you just have to check on the documentation for your particular situation, whether you use a Nagios or Docker as an example. So here's just a sample of a health check. And oh yeah, shameless plug for the 13-factor app paper I wrote there. It's just the idea, and really the thesis of this really boils down to, get the developers to put the health check-in as a review criteria for code inspections. I mean, it's just gotta be something that's gotta be checked off, okay? And then of course, as DevOps, we wanna enable that when we create our service configurations. And then our web UI, if you haven't looked at console for a while, used to have a green bar across the top, version around, version six, seven, eight, and now we got the red bar. So that's sort of what the web UI looks like, and we'll take a look at that in real life here in a minute. And then service discovery using console, a couple different approaches for using this. I think one of the bigger ones with our enterprise product is the Multidata Center resiliency for cross-region clusters and situations that you guys are gonna run into. I mean, we're a global community today. I mean, acquisitions are being done for other countries. So this is, if it's not on your radar today, it'll get there very soon. So service segmentation I like is, I think this approach here is very elegant. I remember back to the first times we were using CNI with Mesos. I thought that, gosh, Weave and Calico were so cool. Right up until the point in time that I started looking at the IP tables for that, and it was allow all, right? So I mean, it was nice, but then it created another problem of how am I gonna secure all this? How do I do this, you know? I don't know about you guys, but I think I'd rather go to the dentist and manage my IP tables. Yeah, I got a few last, right? So this is the challenge, right? So how do we secure service communication in a dynamic world, right? I mean, do we wanna be up all times of the day and night, you know, with IP tables, or I'm gonna show you a way with service segmentation and our intentions that it's all name-based. So everybody remembers this one, right? You know, we had good old IP rules, firewall rules, and this is how we, and now we go to dynamic world, you know? Now we have our five main business application, but remember, because of horizontal scaling and all these types of things, we may have multiple instances of these. So how do we do that? How does that work? I mean, in a way that we can still maintain, you know, like we're not gonna pay a lawyer for the divorce, right? We'll do that. So how this works with the console is that we use proxies, right? So these proxies, these can be inside Kubernetes cluster. This can be inside Kubernetes cluster. This can be outside Kubernetes cluster. Only big difference is that I've set up this proxy and typically there's two proxies, and one of the things I found out about in doing this is focus on onboard documentation. It's just much more dynamic than our vanilla console proxy, okay? And when I was told by one of the engineers that we're gonna be putting a little more effort into that. So from the open source side, but also from product side. So it will be in our source product as well as our process. Just focus a little more on the onboard material. It's one of the things, I'm only three steps ahead of you guys in terms of learning this one because it's so new. But it was one of the things I would process and I had less problem when I started using the onboard. So this web application can be inside the cluster as well as the DBB or one side cluster, one outside cluster. So these proxies will work both ways. And the way it works is you will instrument your Kubernetes one with some limitations basically enabling injection of these proxies into your iPod, okay? And then what happens is you basically get a unique certificate, mutual TLS is established and everything is secured between them, whether it's in cluster or outside of the cluster. And here's the fun part. So once you have these proxies debug output, there's like this little key with Spiffy X509 back and forth. So it's all secure. It's locked down. It's probably the best and latest of the X509 PKI specifications. And then we get rid of this mess. Okay, I mean, what do you wanna maintain? Do you wanna maintain this with IP tables or that over there name-based? Okay, I mean, this is definitely going to keep the wife happier. You're not gonna be having to leave the family and run into the other room to fix firewall rules because somebody launched a container and forgot to tell you about it. Yeah. Or if you have to run into the other room, how long is that gonna take versus this? And then of course, you're gonna forget one over here and they're gonna call you two o'clock in the morning when something migrates to another node. So the sidecar proxy is kind of cool. Okay, no code modification. So you don't have to change your Docker image. Okay, so you don't have to touch that. The only thing you will do is, and of course, all the other cool things here, I mean, there's managed and unmanaged, it's flexible, it's very, the performance overhead is negligible on this. The only thing you will change on this is you'll add some annotations to your deployments. That's it, it's one or two lines. And we'll look at that. I don't wanna say trust me, just look at it. So, I mean, here is a connect proxy that would be for the database outside the cluster. This is not a Kubernetes annotation. This would be an annotation for a service outside the cluster, okay? You're still using console out there. Now, service configuration is another area. And I think this is where a lot of people come to console first. You typically don't get to Kubernetes on step one. It's usually here. So, how the scenario plays out is, okay, I'm a company and I have been using config files and databases to get my configuration into my C sharp code or my Java code, right? And now, we're gonna put this thing out on the network. And my database may be a container and it may move around and all kinds of things. So, what's a better way to do this? So, as I said, service configuration before, we had command line arguments, static config files, registries, environment variables, and God forbid the hard coding magic number thing. Remember yelling about people about that one, Jeff? And then, what happens here? It's the challenges, convergence, hours to make configuration changes, okay? I mean, that's the thing. And once again, we go back to that mega idea, that meta idea velocity, right? We just gotta do stuff faster these days. And then, the reaction to dynamic services, it's not gonna happen. It's not gonna happen. So, what we do is we implement a distributed key value store, okay? Changes are made to the key value store and clients are efficiently notified. So, I'll kinda walk through when I'm talking through network architecture, I walk through how this works. And then, honestly, service configurations can be pushed out in literally seconds. I mean, seconds, yeah, probably to the first server, but updates probably take a little bit of a few seconds to propagate through the WAN. But it's not hours, okay? It's seconds. So, let's talk a little bit about the architecture. Basically, we run off raft for a state, very similar to SCD, ZooKeeper, where you've gotta have three to five, three, five, seven, nine, is your quorum for these types of configurations. Requests, forward client requests to the console server. So, everything goes through the agent to the server and then the server worries about replicating. So, they're sort of the source of truth here. Some things are cached by the agent. Each agent does manage the local services and health checks because we want that to be fast, right? I mean, for that particular cluster, usually there's not a WAN concern. It's more of a LAN concern, right? So, here's how this works. So, basically agents talk to each other. There's a leader election here. So, when that occurs, there is one server agent. So, basically when you start up console, either a command line parameter that says dash server or a config file entry that says server is true. That signifies the difference between an agent and a server. And then here, what we can see here is a key value update where it comes into an agent. Oh, sorry. And what happens here is the update comes into the agent. The call is made to the server. If it's a follower, it's gonna forward the call to the actual leader where the update is made and then it replicates out, okay? So, that's how synchronization is done on just a simple key value update. And like I said, this is where a lot of people are gonna start with console. This distributed key value store, what we call service configuration, is where most people dip their toes into the water moving from monoliths into microservices. And this is just getting their monoliths out to the network, okay? This is getting out of the data center. Maybe your company bought a company that used Google Cloud and AWS. So now you've got to work in this multi-cloud environment, okay? Or, hey, we're gonna do business in China, so we're gonna be on the Alibaba Cloud. So all kinds of fun things headed our way. So sources of truth, agents hold service registration and health check data. And everything syncs back to the server. So the servers are the authoritative sources of truth. Registered service is using V1 agent, not V1 catalog. I put that in there because the documentation, I don't think was clear and this one bit me. Like, why isn't this working? So I just put that note in the slide there. And then console servers hold everything else. So they are the sources of truth. And then everything is made via a network call to obtain the source of truth. WAN Federation is available in small degrees in open source, but more of it is in the enterprise version. Agents don't change. Console server agents form a WAN cluster. They use the same Gossip protocol. So not too much different. The only thing you need to remember about console, I put this in here, is don't forget about your UDP ports when you're, like you say, using NAT gateways and things like that. You open up the TCP side and like, crap, it doesn't work. Don't forget about UDP. So it's typically 8302 and 8600 are your UDP. I think 8301 too. And then later on, when you get to GRPC, we do have a GRPC port. So that'll be something when you're using protocol buffers you want to start thinking about. It's really fast. In fact, some of the guys were talking about a performance test they did last week with HTTP2 and protocol buffers. And they were like, wow, this is awesome. Single digit millisecond type response times. It was really good. So this is a multi data center cluster. And you can take that data center word and put regions in there for the clouds. It doesn't matter. You can take data center two and turn that into Azure region, okay? Or AWS region. It doesn't matter. The configuration's gonna be the same. So once again, here's the key value update. You know, I talked through this already, but I forgot I put this slide in here. But I just wanted to show that how this works is that assuming the call goes to a non-leader server, it gets forwarded to the leader and then replicated out. And then of course would be synced back. So architecture. How would we use console, okay? So this is a traditional eight region, 100 microservices architecture, okay? So we have the typical F5 configuration of a global traffic manager and a local traffic manager. And we're gonna get traffic down to our microservices through VIPs, right? And for any of you that have gone in and worked with this, it is, man, I can't believe all the configuration behind this. It's incredible. So changing to a console architecture, and I didn't notice some of the letters got chopped off here. What we would do, now there's some other cool tools that Hashtag Corp has. I threw in console template here. So you can, this could be HA proxy, this could be NGINX, but as the configuration changes, you need to reload that reverse proxy. So console template makes that stuff a breeze. You know, once you get kind of get past the whole configuration language on how to write it, it's a lot of curly brackets, but once you get around that, it makes updating NGINX and HA proxy very simple. So check out consoles, consoles, I mean, yes, it's very, yes, very similar to, yes, very similar. So we have that for files as well as environment various. One's called console ENV, the other one's called console template. I think they're going to eventually take out the word console because it really has nothing to do with console. Sort of kinda, right? But I mean, because we want people, we want to think of it more as a general tool. On the host. Yeah, it doesn't have to be on a pod basis. Go ahead and we say your question. Oh, I was just gonna ask if the agent needs to be deployed on a per pod basis or does it need to actually be just running on host as it's something, but I guess you answered that. Just on a node basis, yeah. And we'll actually, unfortunately, I'm using Minicube. I kinda got beat up on, dude, why didn't you use EKS? Because I fear the network. You know, like I've been to some of these conferences where it's time to give my talk and the network decides to die. So I'm sorry, I'm lame. I use Minicube. I didn't use EKS or some of the other big boy tools. So what's that? Yeah, yeah, yeah, just for the demo. And by the way, anything that is done here under Minicube, I've definitely got this thing working on EKS, EKS. I've used it through Terraform. I've used it through EKS CTL, GCloud, as well as Terraform on Google Cloud. So console works on all of these. So no problems and vanilla. I do have a vagrant configuration that I can stick up there. But it was kinda funny, you know, it was vagrant-acted weird with persistent volumes. So somehow the persistent volume claim registered, but the persistent volume wasn't there. So I didn't wanna leave you guys with headaches like this. I'm like, oh, here's an apply you need to do with cube control to set up the persistent volume. So there were some things I just like, okay, I'm not gonna do this. I don't know why it's doing it. And I'm sure there's a good reason, but I just don't like to leave demos like this. I like a demo to just work. So anyway, back here to our configuration where we're gonna make a request in through our global traffic manager to our reverse proxies and then down here to our microservices. So basically the Northwest is being handled at this level. I mean, the North-South is being handled at this level. And the East-West is being handled with console connect. Okay, so this is how this would look. And as the gentleman here asked, this would be on each node, the console agent, and then we would have a quorum of servers. I should have made that a quorum. It looks like one, but it's really a quorum. You've gotta have a quorum. Yeah, three, five, seven, nine, yeah. So to talk about distributed systems, and this is where I'm gonna get into Kubernetes here. We're doing good on time. So I mean, we all started off with Hadoop file system, running big data jobs. Then of course, the cloud came along. We pushed things out to S3. Then we went to distributed memory. Now we are having distributed compute with different types of containerization applications. Oh, sorry. So basically, to start into the whole Kubernetes discussion is discussion of a job schedule. Because whether you're a nomad, whether you're Kubernetes, whether you're Mesa, basically what we're talking about is a job scheduler. The only difference is how it schedules, how it considers fairness, and obviously, the granularity of objects in the system are a little bit different. But the big thing is that we need to run our applications and provision compute resources. If there's a failure, we need to have the ability to maintain the desired state. We'll talk a little bit about that. Obviously generate logs. If any of the nodes become unhealthy, move the pods or containers to the ones that are healthy. And then obviously, microscaling. And I like to make a distinction between scaling inside the cluster and things like auto scaling groups that you would do for actual nodes. Because the scaling inside the cluster is going to be pods. It's going to be number of replicas. So basically, this is Kubernetes in a workshop. So the workhorse here is the pod, right? So we have master nodes, and we have worker nodes. And on the worker nodes, compute control starts off pods. And these guys will load up containers from whatever container registry, way, Docker hub, wherever that we're going to be getting these things from. So we have many different types of Kubernetes resources. And then here's just a sample deployment, right? So this is just a simple API that gets deployed and runs on port 8080. So Kubernetes is a desired state tool. So what you do is you give it a deployment, you give it a replica set, as this is what I want. And Kubernetes drives to that desired state. This is very similar to, if any of you are Terraform users, Terraform is a desired state with its declarative nature, unlike the configuration management tools. You tell it, I want five servers. And if you have three, it's going to provision two more. It's just a different way of doing things. So it's driving towards a desired state. So this is the big thing that people need to understand about Kubernetes is that it functions as desired state. And the executive in Kubernetes will drive to make that desired state happen. So how does console and Kubernetes work together? So we advocate the use of Helm, all right? Helm is our tool for deploying console. It's very simple. We have Helm charts that are out there on the internet today. While I'm sitting here talking, you can Google it and find it very simply. We have copious documentation on how to use this on our website. So these are your friends. Now, I do want to say one thing. I do know, having worked with certain government organizations that can't use Tiller, a security problem. So there's such a thing as a Tillerless install with Helm. So what happens is you run Tiller on your local host. And then it uses the API to download your Helm charts. So just be aware of that. Now, the other thing that you're going to run into, now you don't run into this with MiniCube, but you do run into this with the big boy Kubernetes implementations is most of these have to be appeased with RBAC. So there is an RBAC role that needs to be set for Tiller in order to allow you to utilize Tiller on these pods. In other words, what's going to happen if you start to download a Helm chart and it just sits there? Well, actually, before you do that, just do a Helm list. And then make sure that Helm is talking to Tiller. And of course, nothing will come back on your first one. But at that point in time, you'll know that, oh, I remember that guy talking about RBAC. So I will put those RBAC credentials in the repo that I put out there. So you'll have those. But the alternative always works is you just bring up another window, start Tiller in that window, and that's set the environment variable to point, I think it's like Tiller host. I have it written in my notes. So I'll tell you what that is before we get done here. So here's a data center configuration with console. And this is what I was talking about, is you're going to have console loaded up with Helm in your Kubernetes cluster. And then out here, where your monolithic application is, where you'll have your proxies that will conceivably communicate with your cube applications, this is where you can manage your intentions. So you do have to use the proxies, and you want to use the proxies. Because what's cool, I know, how many of you guys are using Istio now and using Istio gateway? You guys using that over IPsec or using something like AWS Connect? See, the cool thing about this is you don't have to worry about that stuff because everything's encrypted. You don't have to deal with IPsec, you don't have to deal with Direct Connect. Okay, 10 minutes, oh my gosh. Okay, let me quickly get through this. Cluster networking. Okay, Console Connect takes advantage of local host, uses the connect proxy. And our tool suite is basically Kubernetes, Console and Helm. And this is our solution to networking with Kubernetes is Console Connect. So what I want to do is jump into the demo, but also, let me just real quick, so we'll come back, yes, we are hiring, okay? I always get asked this question, we are hiring. So definitely go out to the Hashtag Corp site and take a look. There's a lot of stuff out on the internet. And what I really want to point out here is a couple of things. And one thing I forgot to put up here. So learn.hashtagcorp.com, great tutorials. Most of them have little videos. The gal that put these together for Console worked with me at Mesosphere. So she's a rockstar, she's really good. We have some great code repos out there, vagrant configurations, things like that with Console. We have a lot of videos on our resource. Actually there's three, I counted the number on our Hashtag Corp resources, and I don't know, the first page was completely full on YouTube. So everything from our founder, Arman, talking about high level ideas, that would be great for your managers to understand, down into some hashi conf talks on bit twiddling type stuff. Distributed semaphores, things like that. All of our engines use policy as code, so Sentinel is our policy engine. And then our community out here, we do have a Google Groups community on Console. So if you need help, that's the best place to go on the open source side. So let me get to this demo. And the only thing I feel about this demo that makes me grumpy is I sort of feel like I'm having to do this with one arm behind my back. There we go. So can everybody see that? All right, so what we're gonna do first is, okay, so what I like about this is it's fast as hell. Even on EKS, you're not gonna be standing around waiting too long, so like I said guys, I can't type and this is horrible. So here's what we've done is, and the dash dash name, if you don't put that in, you know like with Docker, you get that crazy word, the crazy name for your pods, well that thing gets added on here. So if you forget about that, and you go, wow, that guy had a name of console dash server and console dash UI. If you leave it off, it's gonna be something like fuzzy tiger dash UI, okay? Exactly, yeah. I mean sometimes it's kind of funny, you know when your wife comes in the room and she looks over your shoulders, she goes, what do you do with it? So, okay, so let's take a look at the UI here. So let's, and what's interesting with the way this is working, is that right? I don't think that's right. I'm trying to find the Google, let's just start again. So let's go back over here and do this again. And the question is, the default browser, just a little awkward having to use it like this. So we see our console service, we have our nodes here, and we have our distributed key value store here. So I have the ability to be able to query this, and I did cheat, I put some of these queries over here, so I created that, the network changed ports on me. Yeah, that's the one thing about turning 40 here guys, this is a division thing, doesn't get good, as well as this mirrored display thing. It's not working out good for me. I'm gonna let this thing kick my butt, I can't work this way. But basically you have the ability to curl anything and everything in console. You have the ability to set the intentions, whether it be via curl or the UI. We have a command line interface, I was gonna cube exec into these things and show you, but I can't work this way. But anyway, apologies about the demo, are there any questions? One minute left. Are there any plans to implement dynamic DNS in console? Dynamic. Dynamic DNS? I've seen that on the roadmap, but if you want to, after the talk's over, I'll give you one of my cards and you can shoot me an email and I can talk to one of the product managers to find out if that's a requirement for you guys. I'll see if it is. I don't remember it being on the roadmap, but it's hard to say. We're an open source company and things change all the time. Okay, thanks. Yep, give you back 30 seconds of your day. Have a good one guys. Sorry. Yep. We'll get started in just a second, but ahead of that I wanted to let you know if you're looking at the paper schedule, the 430 talk in the containers and virtualization track is canceled because the speaker is ill, so just FYI. I don't know because I don't have the paper schedule on me, what it was called, and I've only got the online and they've removed it there, so I can't, no longer discoverable. So the one that was canceled at 430 was state of the art, edge routing according to the printed schedule, just FYI 430 not happening. So it's three o'clock. This is then gonna be the last talk in the containers and virtualization track. We have Michael Ducey from Cystig talking about audit events and Falco. Thanks everyone for joining me. So I'm Michael Ducey, I run our open source programs projects at Cystig, one of which is our namesake project, Cystig. Another one is Falco and then there's a few other kind of ancillary tools that help you use Cystig or Falco. I'm primarily gonna talk about Falco, but I'll give a quick demo of Cystig and kind of what you can do with it and how it works, because explaining the concept of Cystig help you better understand the concepts of what we're doing inside of Falco. So I'm gonna run through some container security best practices real quick. And this kind of sets the stage of why we're doing what we're doing and why we feel like you need an additional control in your environment, mainly because container security best practices are hard, not from the perspective of their hard things to implement, there's just so many knobs that you have to tune that traditionally have been tuned for you by the hypervisor, right? And providing that very strict isolation that the hypervisor provides in a containerized world, you have all these tunables that are now collected by the abstraction that we know as a container. So it's really good to understand what all these tunables are and how you can tweak them. I like to think of the layers of container security to kind of focus around this idea of having controls at the infrastructure layer, controls in the build time, and then controls in the actual runtime environment itself as well. Now some of these bleed together, controls that you put at the infrastructure level also affect the runtime environment. So things like network policy, while that feels like it's a control that's being put more in when you actually go and launch, you have to have the underlying components in place in your infrastructure to use those components or features of Kubernetes. So let's talk about infrastructure ideas and layers of security in that area. So container security really starts with kind of what the host operating system provides and we've had these things around for a while. So we have SecComp, mandatory access control systems like AppArmor and then SE Linux. These are also commonly known as Linux security modules. There we go, I thought of it. And you can kind of see over on the left-hand side what you can actually do. So this is an example of a policy that's restricting the program TTP dump. And AppArmor tends to take more of a focus around what individual processes are allowed to do. SE Linux takes a more focused approach around the actual whole host operating system. So you can tune either on the process level or on the actual host level depending upon what you want to do. And then SecComp is something that you actually use every time you launch a container but you never necessarily know that you do. SecComp is what provides a lot of isolation around what Linux containers can actually do. It can restrict what can actually be called from a system call perspective. So it actually restricts on a very fine grain level which you can actually call and use of the kernel itself. There's also resource usage and this is really important because in a containerized world, if you put no restriction around scarce resources such as CPU RAM and Linux tunables, then you can have a situation where somebody can easily DDoS a system or a DDoS a container host because they're able to use too many resources and then consume all the other resources of other containers that are running. And so it's always important to make sure that you're putting in resources and restrictions in your deployment yamls around CPU, RAM and storage. It's also important around file descriptors at times as well. If you have a program that leaks file descriptors then it's possible to have that program actually consume all the file descriptors of the actual node your containers are running on as well. And so these are all important things to put in place inside of your deployment yamls as well. There's a lot of best practices around the Dockerfile as well and how you actually provide infrastructure components into that Dockerfile. So not forwarding privileged ports, not using the root user and forcing the user directive so that you're not automatically running as the root user. Even if you're not running as a privileged container, running as the root user can then lead towards if somebody was able to break privileged isolation, they then would have root across the entire host instead of having access to a particular user instead. And also really important is that when you build your containers, make sure that you're removing unnecessary packages. Containers really give us this ability to restrict the blast radius when somebody actually compromises an application that's running inside of a container. And the more tools that you limit inside of a container keeps that person from being able to use that container to hop around and start to explore the environment, do data filtration, and other things like that as well. And then one other component that's important is this copy or add, which should I use in my Dockerfile. Copy is much more explicit, saying exactly what you want to be copied in, whereas add allows you to do things like pulling down files from HTTP endpoints, having them automatically extracted or tar balls automatically extracted. And that's good in some cases, but what it doesn't allow you to do is to know what was actually inside of that container when you actually went and built it. For copy, you can be a little bit more explicit. So how do you control privileges in this environment? So you need to enforce mandatory access control systems like we talked about, making sure that applications are able to run as a root user. So still, just like we had to before, we have to create user accounts or application accounts or service accounts for our applications to run at. You also might want to look at removing any programs that allow you to do setUID or groupID permissions, as well, or try to avoid using those if possible. Do not use setYourNamespaceToHost. If you set your namespace to host, essentially what you end up going to do is have access to the entire host namespace, which then could allow you to actually go and escalate privileges further, as well. And then, if you want to check a lot of this host configuration level stuff, there's Docker Bench Security. There's also one Kube Bench Security, as well, which will check the actual host your containers are running on to make sure that you're enforcing a lot of these things. These are enforced in a couple of different ways. They can be explicitly enforced via the configuration of the container runtime, or they can be enforced when containers actually launch, as well, and you pass the flags into the actual containers themselves. Some other considerations, and this might actually more go towards the build side, but making sure that you're always pinning a version. We watched Docker Hub polls for our open source projects, and it's pretty interesting to see how many people deploy with always pulling the image and always just constantly using latest, because as soon as we do a build and people turn their nodes, we all of a sudden see a spike of a bunch of people downloading the new versions of our software. So you should try and mirror back what you need into your own container registry, and then you should also be very explicit about what software you're allowing to be pulled into your environment. Avoiding image in container sprawl is important, as well. If you're able to control what images are actually available on the host, you can also then make sure that you're not going to have software that you don't want running in your environment. If a container with a vulnerability is left laying around, then somebody can actually go and spin that up possibly, and if you want to try and enforce these best practices around container images, then that's one area that you want to make sure that you're cleaning up as well. Yes. Why do we have what? Yes, because when you, sorry, the question is, what do you have not mounting the Docker socket? Because you basically can get control to the entire Docker daemon if you have access to the Docker socket and being able to run commands directly across that Docker socket. And so you actually then have a root level process that can control a lot of things on the host underlying operating system. So you're able to change namespaces and other things like that of running containers and manipulate the actual container runtime. There are use cases where you do need to do that. We actually do it because we have to interface with the Docker daemon. If you were going to be building containers, then you might have to interact with the Docker socket as well. And that's why rootless containers or being able to build rootless containers has started to become more of a thing to where you don't actually need the Docker socket to actually go and build the container itself. You assemble the layers without needing to go through a container runtime. All right, so let's talk about Kubernetes real quick. So in Kubernetes, we have this concept known as RBAC. So we're one abstraction layer higher than the container runtime at this point. And now we're talking about who can actually interface in with my container orchestrator or the platform that I'm building to offer up developers resources. So we have things called user service groups and groups. And those are kind of defined as your actors. You have then resources that those actors can use. And then you have roles of those resources or actions that you can actually perform against those resources. So it was initially kind of confusing when I looked at RBAC to understand how it works. But it actually starts to make a lot of sense when you kind of understand the three components that you need and how you piece those components together. So basically, you create a user. That user then gets access to resources through roles. And you can very explicitly define what resources can be used inside of that role and then what actions can be performed against that particular resource. And it's really important, especially around you have a lot of things that are running in Kubernetes that needs to get access to the API server to query information about the other things that are running. And so, for instance, Falco needs to talk to the Kubernetes API server to get information back about running pods and other things like that as well so we can include that in our alert. So we create a service account which limits our access to the API server. But we get just enough of the access that we need to actually go and do our job. So it's important to make sure that RBAC is enabled and that you're actually using it in the same way. And you're not just giving everyone cluster admin rights across the entire cluster. There's also security policies in Kubernetes. And this is actually an abstraction on top of the container runtime side that I just described of all these settings. And so this actually gives you access to go in and create policies for your pods to say, in this environment, I want to drop these particular Linux capabilities. I want this particular .conf profile to be used. And so you can say for an entire namespace, maybe you have a namespace that's dedicated to serving PCI workloads. And that can have its own pod security policy that allows you to control a lot of those things that I had talked about before. And then your other environment can have a different pod security policy. And so we've created a pod security policy advisor. It's one of our kind of side open source projects. It'll scan your namespace, and then it'll give you a recommendation based upon what's running there of what your pod security policy should be. And then you can go and modify it and tighten it up if you want to as well. And then another feature about Kubernetes that's pretty cool around how you can enforce policy dynamically, there's what's called Admission Controllers. And what Admission Controllers allow you to do is basically authorize actions inside of the cluster. So you can have a back-end service open policy agent as a good example of a back-end service that you can use for this. And what this allows you to do is go in and say, somebody's creating a new deployment. So what will happen is that deployment will then go back and ask the Admission Controller to say, should I allow this in based upon this set of parameters and what's trying to get launched? So what you can do is say, I want to make sure that I'm only ever pulling from my private registry where I want everyone to pull from, and someone's not trying to pull from the public Docker Hub registry. And if they are, then you can deny that request, and then the deployment won't actually create. There's also mutating webhooks that you can do as well to where you can have that question get asked. And maybe you don't want them to pull from the public repository, but you have that exact same image in your repository, then you could just change the actual image and include the URL of your private registry, and then they're able to still, that workload's still able to spin up and move on. And then additionally, there's Kubernetes network policy. Networking policy is typically handled by tools like Spanel and Palico. And this allows you to do on a very fine-grained basis to control who can actually talk to one another inside of a Kubernetes cluster from an individual pod or deployment or service perspective. And then the last component of the Kubernetes platform security features. So secrets and certificates. So the one important thing to emphasize is that if you're using a key management system, one feature that those, or one power that those give you is the ability to rotate your keys much more often than you do when these are static things that we used to have to manage in our Etsy Apache directory. The process around rotating keys were always problematic. Now that we have an API that we can request keys in a very seamless way, and of course, you can integrate this in with Kubernetes as well, so you can get secrets very quickly and deploy them onto your pods or your applications. The other important thing to think about is your users as well. And so you want to be able to rotate those users that we were talking about in our back world and their tokens as well. If the tokens don't expire, then those tokens can just live forever. So if there's an old laptop or somebody's laptop gets stolen or any of those sorts of things, those keys are still out there. And so if you have this policy in place that has these tokens automatically rotating for you every two weeks, then you know that eventually those tokens that got leaked will eventually expire and no longer be any good. So it's good to have a policy around that as well. And then that goes the same thing with your service account tokens as well. Service accounted tokens should be treated just like user tokens, and those should be rotated on a periodic basis as well. So let's talk about the build layer. And the build layer really consists of image scanning as well as looking at some best practices of what actually gets into the image. So image scanning is important because the layer design of Docker makes it very easy for us to abstract away and build new things on top of work that other people did. We can provide this base level image that you want everyone to build images from. But the problem is what ends up happening is when you have vulnerability in a lower level, you have to understand how your impact did across your entire environment of that one particular vulnerability that was put in on the Apache side. And now we're vulnerable on the WordPress and the PHP image as well. So there's a couple of different options that you can use for that. One that we're fan of at Sysdig, we partner with them and we use their engine is Encore Engine. Encore Engine provides the centralized service for inspection analysis. And more importantly, it provides user-defined rules or policies. It also looks at a lot more than just your container images and what the operating system has installed. You can point it to look at things like RubyGems, PIP packages, and the other ones that I can't, NPM packages, and so forth. The other ones I couldn't think of off the top of my head. I'll talk about that here more in just one second. There's also Clare by the CoroS team. Yes. The question is, which one of these is in Quay? And I believe Clare is the one that's in Quay. I'm not sure if they add in additional features on top of it or not. The limitation when you kind of compare Encore to Clare is that Clare tends to just give you a report, or at least it did in the past. It would just give you a report of what was vulnerable, and it didn't allow you to have any intelligence around whether if you wanted to flag something as important to you. Like it might be a low vulnerability, but what Encore allows you to do is define policies to say, if it's a low vulnerability, then don't stop the build and just keep going and don't consider it as a failed image build. Whereas you have to use the analysis from Clare and have your own engine to define what that policy is on your own. Also, more on the host side, because we do actually have to worry about vulnerabilities on the host that we actually deploy our containers on. There's ball.io that helps you do Linux vulnerability scanning. It's more towards the host side and not necessarily towards container images. And then there's OpenSCAP as well, which allows you to scan for a lot of best practices around passwords and other things like that as well. And it's a NIST tool. So what Encore does is it allows you to check packages, as I said, Python packages, Ruby Gen, binaries and other things like that as well. You can create a loud list of ones that are cached to speed up the process. And you can also analyze Docker files as well. So you can analyze Docker files for things like exposed ports, running as a privileged user, using latest versus pin version. And then Encore will help you when you have this process of building the container image, you can have Encore scan it, and then you can have Encore scale fail the actual build itself. So there's lots of users that find a loud list, denied list and policies that you can create using the Encore engine. So it's a pretty powerful tool. All right, so let's talk about why we're all here. So let's talk about the runtime side of things and how FACO works in conjunction with all of these different layers. So FACO is a behavioral activity monitor. And so what we do is we detect suspicious activity defined by a set of rules. So it uses our sister project system filtering expressions and I'll go through those here in just one second. And it has full support for containers and orchestration. So we're able to connect into the container runtime and pull metadata out of it. And we're able to connect into the Kubernetes cluster and pull back metadata about what's actually running inside of that cluster as well. Then you can create rules around that take into account maybe container name or Kubernetes namespace name or particular deployment or application labels and so forth. You're able to alert to a variety of different locations. So file, standard outs, this log, calling programs, also HTTP, probably the easiest way to alert is to set up something like Fluent D and just have that part of your logging process and have these logs go back and process however you want them to be processed. And then it's open source. So anyone can contribute rules or improvements. We're looking for more help, especially around the rules side of things. And then also from an engine side of things as well. The engine's written in C++. So some C++ experience would be useful to contribute to that. The rules are written in YAML. Yeah, you had a question. Does FALCO require Sysdig? It does, but we build, we static would compile in what we need. And I'll tell you what the overlap is in just one second. So it's also a CNCF Sandbox project. So we've been in the project or in the CNCF for about six months now. And it's helped in a lot of ways of especially seeing what other people are doing in the CNCF and understanding kind of the direction of how people are trying to deploy these platforms such as Kubernetes at scale in their organizations. So let me give a quick example, but I'll give a quick example not on the command line or not on a GIF. It's a little washed out, I apologize. So I have MiniCube run in and I'm just gonna run jump into MiniCube. And a cool thing is, is that if you wanna play around with Sysdig and you have MiniCube installed, MiniCube actually ships with Sysdig. And so the default build has it in there. So all you need to do is go and run sudo Sysdig. And I can see everything that's happening from a process and system call perspective. So I can see that SSHD, which was processed 26192, read from the file descriptor devptmutex, and then the file descriptor was number 10. And that's how many bytes that they took from that read as well. So let's see what else we can see here with Sysdig. So I can run it also in a mode where it's more interactive. And so I can see this is kind of like H-top. I can see everything that's running. I can actually drill down into any one of these components. So let's go and drill down into the Kubernetes API server. And then I can see all the threads for the Kubernetes API server as well. So if I hit function F2, I can then go and see if the Kubernetes API server is listening on any ports. And we can see here that 8443 and some other ports that it has traffic on. So let's see what else is happening. Why don't I go and hit, see if I can grab this when it comes back up. Yeah. So now I'm just looking at that particular port and now I'm actually seeing the raw HTTP traffic that's going across that particular port. So I'm seeing everything that's happening on the system at a very low level. I can actually see the raw HTTP headers because this is an unencrypted connection as well. I could also go in, there's a whole variety of chisels that you can run as well. And so these chisels will show you so we could spy on a particular port. And so let's spy on port 80. I'm sorry, on that port that we were looking at earlier. And so there we can see we're just basically catting out what was actually sent across that particular connection. There's lots of different chisels that you can use but the way, so let's jump into kind of the architecture and how Sysdig works. So what happens is we insert a kernel module or as of late, an EVBF program. And what this allows us to do is basically add in trace points around particular system calls that we care about and that we want to trace. We can then get this event and we can send it back into our rules engine. So that's how Sysdig, I'm sorry, that's how Falco works. On the Sysdig side of things, what would happen is instead of those libraries taking it to the rules engine, it takes it to the Sysdig command that I was running and then it parses that event out and then displays it to me on my terminal for me. But what the rules engine allows you to do is then load up these rules and these rules are based upon what's called filter fields. So if I switch back over here, if I run sudo, so let's just do a ps, I think, no, sorry, we should be able to do it. So sudo, Sysdig, croc, name. And so now I'm seeing everything that the Kubernetes API server is doing. And so just by being able to go and say croc.name equals kube API server, so now I can actually see everything that's happening. So if I wanted to go and see what containers are running, so see, so here's all the containers that are running in this environment. So let's pick one of the, so let's pick this container ID. And so instead of my croc.name, I can say container.id equals.id and I should have seen something and I'm not sure why I don't. What's that? Oh, I picked the pause container. There we go, thank you. So I'm seeing the exact same thing. I could then get more specific and I could say and croc.name, so if there were multiple processes running in there, I could then say kube-api server. I could also do maybe fd.port equals 8.765 and then I would be able to see any system call that happened that manipulated that particular port and that's just another view of what we saw earlier of this post happening to this particular port. So we use those, the idea of those filter fields to create rules and then whenever a rule is violated, we'll send an output. So some example rules, so we have event type open, so from the system call we basically take the rule or the event.type and from that event we have that field set to open and then we get another piece of the system call which is fd directory and then we say is the fd directory in that list and if we get a match what will happen is we'll fire this output and then from that same system call event we have the user.name field and we'll pull that out. We'll put it into the output. Same thing for the command line and then also for the filename as well. So we pull all of that information that we saw on that terminal screen. We can piece all of those components together much like you can do with tcp-dump and Wireshark and other tools like that as well. So Falco can be extended to work with other kinds of events as well. So we just created initially this idea of system call events but what we wanted to add in was this idea of how can we trace or pull in Kubernetes audit events as well. And so this is a new feature. I feel like this date or this version number is wrong. It's at least in beta in 1.11. So if you wanna try this you need a 1.11 cluster or higher. And what Kubernetes audit events allow you to do is create a chronological set of records documenting changes to the cluster. So anytime somebody goes to the Kubernetes API server and performs some sort of an action you'll get a unique log in your log to identify that activity. So it's a JSON object for each record. And then there's what's called audit policy. So you apply audit policy to your Kubernetes server. Basically say these are the things that I want to log on or these are the events. So you can basically say I want everything or you can then pair that down to say I only wanna be isolated to these particular 10 sets of activities. And then log back and control where events are sent. There's log files of course and then you can also use generic web hooks as well. And so you can just have this JSON posted directly to a web hook you specify. And that's how we integrated in with the Kubernetes API server is using a web hook. In 1.13 I believe these are coming what's called audit syncs to where you can configure this more via the API server. Right now in 1.11 you have to go and modify the Kubernetes API server itself to make these changes to turn on these events. So let's take a look at what one of these events look like and then we can talk about what we did in Falco to differentiate them from system call events and to pull the data out that we want. So we can see the events that this was a successful request because we see the response complete and we have a code of 200. Somebody tried to do a delete. We can see the URL on the name space or I'm sorry on the API server that they hit. We can see who it was and the username. And then we can also see the object that they were referring to. So they are going to delete a namespace that's called foo. And then we have a unique identifier, a GUID for the audit ID as well that gives us a unique identifier. And then also you can have these annotations to where if you had some backend service making some sort of authentication decision for you, then you can see whether that was allowed or denied based upon that engine that you're using. So what we did to modify the Falco internal architecture was we embedded a web server to consume these events. So the API server will then go and push these events into our embedded web server. We modified the rule engine to be able to go and parse these events. And now you can actually use that filter expression language to filter off events that you want to alert on through your Kubernetes API server. So now we just have two different event types. So before we just have the system call event. Now we have an event type for system calls. We have an event type for audit events as well. And then the rule engine differentiate between the two and decides what rule set that you applied to one versus the other. So rules now can only apply to system calls or Kubernetes audit events. You can actually use fields from a Kubernetes audit event and a field from a system call event in the same rule because the event doesn't have those particular fields and the engine will actually crash on you. Yes. The question was who do we use for a rules engine? This is actually a rules engine that we wrote ourselves. There's some talk of seeing how we can integrate in with the open policy agent and see if we can use some of what they do around rules to make our rule language a little bit more complete or easier to use. And also some of the rule detection things a bit faster as well. What is it? No, flowable. I've never looked at flowable. All right, so what we did in the actual FALCO code itself is that we created this generic events interface or class and what this does is it extracts at least the event time and then you can also extract or extract other values using fields. And then from that we inherited that class and we created a Kubernetes audit event object and then we also have a system call event one as well. And so what this allows us to do is then extract out the particular JSON that we want and store those as we create new filter fields inside of the engine itself. And then we can use this concept known as JSON pointers to extract values and create macros. And I'll show you how we did that here real quick. So this is what a JSON pointer looks like. So if I just reference the slash food pointer, then I'll get the value back of one. If I reference barbos, then I'll get back T. And then in the actual code itself, the way you actually use these either directly inside of the FALCO rules or in the actual C++ reference, you just say j.event.value and then that JSON pointer that we had on the previous screen. So it just gives you a very easy way to get into the JSON and get the data out as a data structure. And so what we ended up doing was we implemented a bunch of macros to make it easier to actually extract values so that you don't have to write the j event value. You just reference kaverb, kauri, username and so forth as well. If you want to see the full list of the filter fields that you can use with the Kubernetes audit events, then you can run FALCO with dash, dash, list, k, that's audit. All right, so let's take an example, look at how an audit rule looks and then we can jump into a demo real quick and I'll show you kind of it in action. So we have this macro, I'm sorry, we have, yes, a macro. I'll tell you what a macro is real quick because I haven't talked about that yet. So you saw that I could write these filter expressions and I can chain them together and have much larger expressions. In rules, it also becomes unwieldy to have to go and reference the same thing over and over again or you might need to use this definition of something containing private credentials over and over again. And so we just create a shortcut by using a macro. It also makes the rules a lot easier to use and then if I have to go and change this condition I can just change the condition itself. I don't have to go and find where this condition is used or this big long string of condition over and over again in all of my rules. Then we have another macro to basically define what a config map is and so that's K8 target resource, config map. And then we have a macro, what we define as somebody modifying something through the API server. So there's a couple of different ways to actually change something. So you can call create path or update and so we create a macro called modify that keeps on that. And then we have our actual rule itself. So we have the rule name. We have a description of the rule and what it's trying to do. And then now that we've defined all these macros our condition in our rule becomes much, much easier. So if I'm a config map and somebody's tried to modify me and I contain private credentials then I want to trigger this particular rule. Make sense? All right, so let me jump into a demo real quick and show you how this works. So in this particular environment I already have Falco running. And so I'm just going to do a kubectl exec. Oh, by the way, if you never knew there was dash dash previous it's very useful to think when you're trying to figure out why a pod is crashing. I learned that the other day. And I'm just going to tail the logs over here. So you notice right here I got two events right away. So I ran that kubectl exec and you can actually see that I got an attached exec to a pod, the username. And this is actually coming from the Kubernetes audit event. So I got an alert there. We do need to actually, one bug I noticed, we need to specify the rule set name that it's coming from, or event stream. So this one is from the Kubernetes API server. And this is the actual system call. So from a system call perspective we also saw that the system calls were actually executed. So the request was made through the Kubernetes API server. It was granted. And then we actually see the system call is actually executing. And so now if I go over to this pod I can do things like touch then hello. And I can see over there that I get, sure enough, an error message that said that I tried to open a directory for writing. And then I can do things like I'll get out of here. So I'm going to create a config map real quick. And so I'm going to create that wonderful config map that somebody was going to store their secrets in, which might be a little contrived. But how many people have ever had somebody trying to store private credentials like that? And then the thing is, if it's an AWS specific configuration file for the AWS command line tool or something like that, it's always in a particular format. And so while it's not going to catch state actors, it is going to catch Bob from accounting who's trying to store his credentials. I don't know. It's supposed to empower the developer, right? Anyone can do it. And so I create this config map. And sure enough, there's the error. It took a second. But sure enough, there's the warning that I created a config map of private credentials. The other nice thing is that you see that we've actually put all of the data that was in the config map out there. So at least now that secret has been leaked. It needs to be rotated. Yes. Bob is my name Ben Coyne. That's another use case that you can use this for. And we have a few blog posts around it as well around how to catch Bitcoin miners running in your environment. All right. So talk a little bit more about how this can be used. So one thing that we're a fan of and we're going to be putting some more work in to think about how we can actually do this on a larger scale is this idea of creating a response engine and a set of security playbooks around it as well. And so the idea is that you detect abnormal events with Falco. You publish these events to some sort of a pub subservice. You can use FluentD and Kaka, a lot of also the cloud vendor agents. You can push the logs off into various locations and use Stackdriver if you want to use that, or CloudWatch logging and so forth. This also can be done with SNS or Google PubSub and whatnot. And then you can have different subscribers subscribing to Falco topics. So in the case of NATS, what it allows you to do is say I want all alerts or I just want alerts that are of a critical. When we publish the alerts to the pub subservice, we also put the name of the rule as well. So you could have actors that act or react on particular rules that are getting fired as well. And so the nice thing about this is then you can have multiple subscribers. You just have to publish to one location, and then those multiple subscribers can take action. So killing an offending pod when you detect something like all of a sudden, it's trying to connect to a Bitcoin mining pool over a well-known mining port. Or it's trying to, you know, it's got stratum TPP in the plan line, and you would never see that anywhere else except in mining cryptocurrency. You can taint the nodes to prevent scheduling, so that way all the workloads will come off of that node, but then that node still persists and is around, and you can use it for forensics. You can isolate the pod with networking policy, which is a similar thing that allows you to then go and do forensics with it later. The other thing that we saw somebody doing is they use lambdas to go and actually push things off into S3 buckets for long-term storage. So every single event gets logged and kept in long-term storage. And then they have more of a real time where they send it off to Elasticsearch before they send it off to Elasticsearch. They use lambdas to go and enhance each event and put in more metadata about the environment and where the event actually came from. And so to give you a more graphical representation of how this works, you have NATs that's picking up the message that's published from Falco. And then you can use, you know, these are all kind of pluggable. You can use whatever Kubernetes functions framework you want. You could use whatever PubSub service that you want. You just need to have that ability to have triggers and other things like that as well. And this, of course, works with AWS as well. And we have a blog post around that on how to set this up. If you use the Helm chart, you can actually, there are options to the Helm chart where you can turn these things on very easily to easily go and push things into AWS and SNS. And this is actually a good paper around a similar idea that somebody presented at Hot Cloud 18, which is a usenext conference, where they talk about using CloudWatch logs and lambdas and how to build a whole event processing stream. And this is something that I think that we should probably look more around, is how do you begin to write functions to process these event streams and take action now that we have this ability to run small pieces of code. It can be very specialized pieces of code. It makes our automation a lot easier because you can have this reaction type model. And these events that then have action taken and small reusable components that are maybe easier to maintain than some of the automation that we've had in the past, especially around when we start talking about what we've done with runbook automation back in the day, the ability to have these small functions that are taking action or notifying you in different ways can be really useful. So that's all I have for you. I'm having to stay around and ask questions. But these are some useful links. If you want to check out the projects, you'll probably be interacting primarily with me if you join the Slack channel and some of our other developers as well. But I do a lot of the community work and project management work, managing issues in roadmap and other things like that as well. So we're always interested to hear what you have to say. We're always interested in trying to help your problem as well. My Friday and Saturday was spent debugging using GDB for the first time in like 10 years. With that, thank you very much. And I'll be happy to take questions. So Cystic recently announced support for BPF. How does that work with Falco and the event structure? It just replaces the kernel module, and we get our event stream of system calls from the EPPF programs versus the kernel module. This allows you to load it in things like calls where you don't have full access to the operating system. Generally, we have one performance problem that we're trying to fix right now, and then we'll be able to better tell if there is a real performance difference between the EPPF probe and the kernel module. Initially, we thought it was about a 10% difference. But it's just another way that we can create the traces around the functions that we want to pull the information out. You still need to be able to compile the probes against the kernel headers so we don't necessarily get rid of the kernel header problem. But we're starting to publish more pre-piled probes and kernel modules for well-known platforms. So you can download the one from us if you don't want to build it yourself. I guess, but on the event side, would the event structure change, or can you use the same rules? No, it's just the libraries understand that they're using the EPPF probe based upon an environment variable. And so they just try to attempt to use the EPPF probe instead of trying to open .sysdig or .valco. One of your previous slides, before you started the Kubernetes thing, you had a bunch of things to do and not do with security. One of those things was, don't use Docker Zero in production. Well, we're violating that. And what's the alternative? Should we be doing there? That's a good question. I didn't create that slide. So I'm not 100% sure around the limitations on that. I think mainly around the fact that it gives you too much access to the Docker local network. And then it allows you to do ARP spoofing and other things like that so you can impersonate somebody else on that particular network. And poison an application to the ARP table and get access to it that way. Yeah. What? You can push commands to the host. You can push commands to the host as well. Thank you. Other questions? Well, thank you very much. And I have some stickers. I forgot to order some before this trip. So if you want a sticker, come up and see me. And just a reminder, if you're going to come for the remaining session in this track in this room, it's been canceled due to a ill speaker. Was one broken?