 I have to turn on. There we go. Perfect. All right. Good morning dad if you're listening One more minute. I heard 100 slides is the world record for a one-hour talk Wow, and most people you find them boring you slow them down. There aren't many people. I mean you speed up There aren't many people that you slow down Listen to it half spin. It is 1130. I am going to start So good morning everybody, and thank you for choosing to hear this talk about the Linux ABI In referring to Linux ABI people very often quote Linus Torvalds famous slogan for the kernel don't break user space So I'm going to explain where that slogan comes from and what it means and I'm going to discuss The changes and drift in the Linux ABI Which I would say has some large print and some fine print The Linux ABI applications binary interface is different than the applications programming interface With which you may be more familiar. We'll talk about the difference If there's anything I want you to get from this talk. It's that the Reason the stability of the Linux ABI is important is it means that User space and the kernel can be independently updated If we did not have a stable ABI in Linux if the ABI changed all the time But that might mean if there was a CVE a security problem like the ones that Eva Blacks Just been talking about where the kernel needed an update It could be that you'd have to completely Reinstall the system and that would not have been good for the adoption and popularity of Linux So that is the reason that ABI stability and not breaking user space is important Now let's talk first about breaking this large pile of glass shown in the right of the image was My shower door just minutes before the photo was taken The kernel can also break if you do an upgrade and that makes people nervous so sometimes in particular Commercial projects are hesitant to upgrade the kernel things that can commonly go wrong our Performance regressions new bugs which are worse than the old bugs to which we were already accustomed Or perhaps changes in behavior of the kernel Disappearance of features on which we rely This third category Changes or disappearance of features. That is how ABI breaks typically manifest themselves So that's kind of a general overview of the talk. Let's discuss some of the details Obviously, I'm going to explain a little bit more clearly what an application's binary interface is and Distinguish it from the API Then I'm going to discuss some examples of ABI breaks That have been in the news in recent years. One is the apocalypse why 2038 that is I think a success story for the kernel and User space and their cooperation second one is less happy It's about priority inversion and real-time Linux and problems that robotics is is having I work on robotics So this problem affects me personally in my work Those two first topics are basically about the large print of the of the kernel guarantee The second two topics are more about the fine print One is a rating and maintaining log monitors and reading the kernel d-message buffer with various tools and Second one is mounting old file systems Does the ABI stability guarantee? cover metadata for file systems and Finally the topic very much in the news Recently discussed with respect to ABI stability in the run-up to the kernel summit is the ABI stability guarantee or not for BPF programs That is currently a hot topic and then at the end I'll discuss a little bit about how projects any project not just the kernel might design a good software interface To avoid ABI breaks and all this heartache and controversy that we're going to discuss But first let's talk more about what an ABI is an ABI Is Something that any compiled program has and not just the kernel or not just a C program the kernels ABI is Obviously affected by the fact that it is a long large C program For C programs in general the function signatures that would appear in the header are part of the API and as far as how they're expressed in the header, but the Layout of the binary programs is the is the ABI So that is a difference between the ABI and the API I think of the API as the programming interface as what you see Whereas the ABI is what you actually get when you try to address the kernel with system calls or or user space applications Besides function signatures, there's the layout of structs and the ordering of items and structs And also the meaning range and ordering of enumerated values or in fact any other constants C has the feature when a program is linked that the linker will choose whatever function is in the library that has a matching name To that specified by the application in languages like C++ and Rust that support function overloads You may get a different Different symbol a different function from the binary if you change the function signature So in those languages dealing with having different versions of the same function is But with the same name is a little bit easier than it is in C I should say at this point that understanding C and C++ or Rust heaven knows is not really necessary for the rest of this Talk all compiled languages have these same problems This view of a bi stability is kind of a very nerdly geeky one Equally valid way to look at a bi stability is to think of it as cooperation among different software projects In the case of the kernel we need the kernel community to cooperate with the compiler communities and the user space applications to some extent and also Most importantly for the rest of the talk the C library that implements the system calls system calls you may recall are since Linux is is a Unix operating system read rate open close that Files are addressed with but there's also things like pull and select and mount and unmount and Clock get time that I'm going to talk about in just a minute and so all these software projects have to cooperate with each other and It's when people in different software communities don't necessarily agree how to solve a problem that we get some Some hard feelings So with that I can't have this talk without telling you What Linus actually said and showing you his warranty for the kernel Which is presented in one of its more safer work versions here. There are many Instances of this rant in various forms over the years that you can find online Fundamentally Linus says that not breaking user space is the number one rule of the kernel community and everyone should know it and Then he goes on to say that people he needs to know if people break the rules so that essentially he can bash them Cast them out from paradise. So if you if you read this I should say they're hyperlinks to all the resources I have In the talk in the slides here and the slides are linked from the abstract at the scale website So so this is very serious. There is a a footnote down here at the bottom terms and conditions may apply to the kernel warranty like any good warranty and Those terms and conditions exceptions and five prints and further pages are that subject to the rest of the talk So with that You may still be a little bit Puzzled as to exactly what an ABI is so let's dive into an example that has been in the news is important and Is easy I think to understand everyone knows how an odometer works Everyone understands the concept of an odometer rolling over this vehicle if driven one more mile The odometer is going to go from all nines to all zeros So this is a situation where in terms of the odometer the needed a bi break would be to add another digit Otherwise it would look like the car was a new It's just so in the case of the apocalypse the white 2038 problem where 32-bit time is about to roll over If you don't upgrade your software so that you are using 64-bit time You will be rocketed back to January 1st 1970 which is the zero time for Linux and So everyone agrees that an ABI break is needed 2038 is a long time from now. So why do we care and the answer is IOT? You can easily see if you think of these kinds of devices many of which are now running Linux many of which may have 32-bit processors that they may well be installed in 2022 and Not be replaced and then there's the other question of will their software be upgraded between now and 2038 so the kernel community was very motivated to get a solution out there so that newly installed devices could be running 64-bit time So in fact, there is some good news in that regard as reported here by Fronix in fact glibc and the kernel community So the system call implementation and the kernel that they address are ready for 64-bit time And you don't have to have 32-bit time anywhere in your code if you're using Linux So that seems like great news. We have 16 years left all as well except all is not well if you're a Linux distribution that supports 32-bit architectures and Here's the reason why if you think about it if you started shipping 64-bit time packages now and your Santas or Fedora or Sousa or Debian then your 32-bit architecture customers will have to recompile their user space applications to use that those new Libc libraries and so a lot of people wouldn't be very happy While they should do it should fix this problem in 2022. Maybe they will want to fix it in 2037 instead You you can't make people You can't make people stop procrastinating so that so the the distributions could offer going forward 32-bit and 64-bit packages for 32-bit architectures or They could do what Debian apparently has tentatively planned to do which is make people reinstall and have a 64-bit version for example of I 386 called I 386 T And so Debian's recommended way of dealing with this problem will be to upgrade your system From I 386 to 386 T which is going to require Reinstallation so this is really where the rubber meets the road with a bi breaks. This is why the kernel and g-lib C Try to avoid them, but here's an example You think of the odometer there was no way to avoid an ABI break so while the situation with 64-bit time is a little bit painful I think people have done a good job of getting ready for it now I'm going to talk about another situation where an ABI break is needed and has not happened and has had from the point of view of those of us who work in automation and Control of real-time systems in not a happy ending at least up to this point so I call this section of the talks servers one robot zero because The distros haven't had to have an ABI break, but robotics community has Has has some problems And So let's dive into that and see why To explain what's going on. I'm going to give you a two-minute explanation of real-time Linux so Why do we have real-time Linux real-time Linux is about low latency and Predictability for control systems. It is not about speed Which is a common misconception to explain about Real-time Linux and the need for it. I'm going to talk about a System a robotic system that has three threads a high priority thread a medium priority thread and a low priority thread so we have here a xy plot of priority on the y-axis and time on the x and The problems arise when the low priority thread and the high priority thread which say might be Operating a scalpel in a medical robot or Machine tool or a motor When the when that high priority thread and the low priority thread need to Write this or read the same critical data section. And so they share a lock and So if the low priority thread takes the lock Then it gets preempted and the high priority thread starts to run The high priority thread will have to stop and give up the processor when it hits the section where it needs to lock That doesn't sound so terrible if the low priority thread Finished its work and release the lock However on a non real-time system We have what's called priority inversion where a medium priority thread can take the processor when the High priority thread needs lock and it may run for minutes It may run for an unbounded period of time. That is the opposite of predictability and so if the surgical tool is not getting the processor because Disdefragmentation is running. This is a bad situation So that was non real-time Linux in real-time Linux. We have a Fix for that. It is called priority inheritance. It is the simple idea that when the high priority thread needs the lock and can't proceed it lends its priority to the low priority thread which finishes its work releases the lock and meanwhile the Medium priority thread just has to wait until these two other threads or processes and user space are finished so clearly we need both user space threads and Colonel threads to have this priority inheritance behavior and so the difficulty that I'm gonna talk about is Described by this bug report which says that g-lib see The GNU see library, which is the most used see library with Linux Does not actually honor priority inheritance in its locking So that is exactly the problem that causes priority inversion And what's remarkable about this bug is you can see that it was reported in 2010 and I took this screenshot last month and this bug is Still open 12 years later So how could this bug be open for 12 years? Why would that possibly happen? And you know, it's not marked won't do or Closed or invalid or something. It's still marked new So of course the reason that this problem which people acknowledge for which there are Reproducers that in fact are attached to this bug ticket if you follow this link The reason for this unhappy situation is that fixing the problem would require an ABI break in fact, it would require changing the function signature of a system call and If I ran the world Then we would have fixed this problem at the same time as why 2038 when we're already making everybody recompile but alas that did not happen and I'm not going to go into the details of What is wrong, but let's just say if you're using Say lib standard C++ the implementation of locks under the hood can easily have the problem described in this bug report and So Who knows what the solution for this problem is let's say we have an open source Library and it doesn't have a feature we need What could we possibly do about that a Fork indeed it is so Yes, so in fact there is a fork because the situation considered Tolerable and the fork is called lib RT PI. It's a fine product that I believe originally came from Vmware and IBM and it's now also supported by national instruments and Lib RT PI is Just a p-thread library just a concurrency library that is the same as the NPTL The p-thread library that's inside glibc except that it solves this problem So it's very easy for people to adopt this p-thread library and use it Or if you don't want to use glibc plus lib RT PI You could just use an entirely different C library and in fact the muscle C library Is already supports priority inheritance honoring locks so This is a situation where it's a little bit extra trouble for the robotics community to Adopt this alternate C library, but with open source at least we can we can solve our own problems So I'm now at the halfway point of the talk And I'm going to switch gears a little bit now. I've talked about two problems why 2038 and Priority and inheritance honoring locks that are both classic ebi break problems in that we're referring to Changes in behavior of system calls however There are so system calls and Linux address the kernel said previously through sysfs and procfs However Linux exposes all kinds of artifacts to user space Which are not sysfs and procfs and which can be addressed by Programs you write yourself which are You know not using system calls possibly beyond open close read write and so Question is which of these other things that the kernel exposes to user space are part of are part of the stable a bi Not everyone agrees. This is the fine print on pages four and five and eight point font That are in the kernel Ebi guarantee and you won't be surprised to hear the individuals don't necessarily agree about them and The answer to you these questions is not necessarily known in all cases I would say or it changes over time in Less obvious ways than changes to sysfs and procfs, which everybody knows about So let's talk about the first example. Let's talk about the dmessage buffer and files in slash dev I Want to clarify I'm going to clarify the part a principle that applies to applies to the fine print in in general so If you consider a bi breaks as a board game And an aggrieved user says you broke my a bi It's an open and shut case in general if sysfs or procfs has changed Then in general If it's not one of the parts of sysfs that's not covered by the bi guarantee that has changed It is a regression and the patch will be reverted and you win and that is happening because You are a user you are a Linux consumer who wants the system to just work you aren't trying to tune it or Configure it You just want things to function as you anticipate and so the large print says that If the change is not Visible and sysfs or procfs then in fact it's not part of the stable a bi and if you're Consuming one of the other things in the list You're probably a developer and the user space guarantee as it turns out covers developers so that's What the fine print is about so the first place That I'm going to mention where this comes up. Yes Okay, but do you want me to give you the microphone? Is it on okay the did you go back a slide please yes, and That no you lose. No you could win Powertop it was using debug of s and And If a change I made broke it and it got Lienus, and I had a huge argument about it But yeah, I got my change got reverted and my the answer was I had to go change power top to not do to do It's the new way so The yes you win is definitely true, but no you lose Most it matters what the tool was well that that is the perfect introduction to what I'm Talking about next so actually while we've paused here Does anybody else since we're at the halfway point more or less have a question I want to ask Does anyone agree about an acronym I've used that they don't understand or some other special jargon no Bill I've had some people tout the virtue of simple versioning in libraries and stuff Is there anything you would have to say about that? Uh Sure, I mean simple versioning in libraries obviously helps But it's not true that you can update the kernel and user space independently in that case right you might need to Change the user space to so the goal of the kernel as a project really is to make them completely independent so I mean you you get new version of Linux libc dev Very often when you install the kernel But that you know that has the headers and so forth in it yeah, well, so just to revisit this Previous slide again for a moment The other thing you hear is you can just recompile when you get the new headers But if if the meaning of the constants and headers changed that's not going to help you if the format of the device file changed or if it's now You know six zero zero and it used to be six four four recompiling is not going to help you so the The really hard a bi problems are the ones I'm going to talk about now that recompiling is is not a solution the first half of the talk was about the cases where it was a solution so With that let's talk about D message which is a utility to read the kernels own log file that a lot of people use Dmessage under the hood reads a file called devk message They devk message is a real binary file in that it has what look like strings in it, but they're not all terminated And a lot of people write their own log monitors to read devk message and That's where they may run into a bi problems So here is an example of a user who's Had a problem with the ABI Well, no, I'm sorry. Here's the addition of the feature That that caused an ABI problem Some developer proposed why don't we make the Conversations in the D message threaded. Why don't we order the messages by which thread contributed them You know threading email certainly makes it easier to understand and it'd be easier for log monitors to parse some messages to If they were ordered in this way the problem is of course that it is Necessary to add another parameter, which is the thread ID to the D message devk message in order to support Threading and guess what adding another parameter to a comma comma separated list in a binary file is an ABI break and then and so Someone naturally complained the people who complained were from the GDB community GDB used to be able to read the D message buffer and something something broke in Python when this parameter was added and so the User was upset This person Was complaining that the problem was occurring in Scripps GDB that is critical for Understanding that although GDB is a developer tool Somebody who's using GDB is debugging the kernel and isn't hoping for it to just work because they want to play their game or listen to their music the Scripps GDB is a reference to the kernels source tree and so The program that was broken by the ABI change is in a tree in the kernels own parlance and So this is important because while the kernel community feels free to change the internal API's On every release and does change them people aren't crazy, right? I mean they don't want One script in one part of the kernels own source to stop working with the rest of the kernel so somebody who's Maintaining an in-tree program which power top is power top is not in in tree. No because it's Intel specific Oh is it okay somebody in the audience points out that power top is cross-platform very good So so anyhow the kernel community isn't crazy obviously if GDB is broken and GDB script That is in the in-tree no longer works with D message. Everyone agrees that that this is a problem. So this this case And the problem was fixed and GDB went on working. I mean obviously this case shows a little nuance here in the fine print in that Even if your problem that you're complaining about is a regression Is not part of normal user activity if the code that's broken is in the kernel source tree Then people agree that you have a grievance and your your problem should be addressed However, there's yet more nuance here I'm probably page 11 now of the kernel warranty That is in the case of talking about file system metadata, which I am going to turn to next If there's any kind of stability guarantee that people really care about General they want hardware that they have been using to continue working But in addition they would certainly like old backups on disks or USB sticks With important files on them to continue mounting when they are attached to a Linux system they don't want a kernel upgrade to make it so that file systems no longer mount and There was guess a couple years ago now a discussion when a developer Posted to the next mailing list and said hey my file system no longer mounts and You know, I was under the impression that I was Covered by the warranty It certainly seems like mounting file systems as ordinary user behavior I am a user and fix my problem or actually be honest He said here's here is a fix to my problem since he was secretly also a developer However, the maintainer of the file system said You know a bi stability guarantee does not cover all the Metadata that the kernel happens to accept now Relying on metadata that is not specifically part of the a bi guarantee can result in undefined behavior Which everyone fears and dreads And in fact the problem arose because the developer had written his own file system His own file system creation tool He did something slightly different with the metadata and put the kernel mounted it and so Fundamentally he compressed the super block so it took up less space for those of you who know that kind of detail and and The maintainer responded that the a bi guarantee covers file systems, which are generated using official tools So this is yet a new wrinkle. I would guess that power top at some level is a official tool And other official tools might be things like E2fs progs that has utilities like make FS Dot ext4 or butter fs utils or zfs utils, but consider these programs are not part of the Linux kernel source tree. They are in separate packages and The developer who is not very happy pointed out that it's a little bit breathtaking to say that Maybe fat file systems created by Windows machines are not covered by the a bi guarantee Because they're not created by one of these official Linux tools So there was quite some unhappiness in this Linux kernel mailing list discussion thread The new picture now of the communities cooperating is the communities making these official anointed Linux tools are cooperating with the libc mount system call maintainers and When these two entities are working together then the kernel a bi guarantee is preserved That's the position that the developer took so this shows an example where People people were even more unhappy than the robotics community I would say particularly if their customers had a lot of discs that wouldn't mount. I don't know the details So we've decided that D message device files file system metadata are kind of a gray area Power top probably also is in the same category. Let's go on and talk about valid BPF programs This is a very a current topic BPF as a lot of people are probably already aware in this audience is a It's a system call. It's also a virtual machine inside the kernel that will Execute programs on behalf of a user It's a mechanism both to read data out from the running kernel and also to do simple control functions The original purpose of BPF was to harden the kernel against denial of service attacks So that user space could present to the operating system filters for packets that should be dropped It's been wildly expanded to have all kinds of purposes for both debugging and control and the controversy arose because It sure seems like BPF is a developer tool of the kind that wouldn't necessarily be covered by the ABI guarantee and So In fact the kernel documentation basically says that BPF is a system call. So obviously it has the same system call guarantees that other system calls It has the same stability guarantees that other system calls have but Because BPF can interact with vast parts of the kernel internals that are no way Possibly part of the ABI guarantee then it's clear that you can have BPF programs that are that are not covered And The controversy that arose Was in the mailing list leading up to the kernel summit Developer proposed the idea of essentially having a you dev rule for a human interface device That would rely on BPF You dev rules a lot of people probably know are now a part of the system deep project They are responsible for helping the kernel recognize hotplugged attached hardware Looking up device drivers and setting configuration from user space And human interface devices are mice and keyboards and joysticks and like so human interface devices are clearly used by ordinary users and Ordinary users therefore relying on you dev rules although I'm surely they are unaware of it But they would be come aware of you dev rules Indirectly if all of a sudden they stopped working and they Plugged in their mouse and or attached it with Bluetooth and it didn't work That is exactly the kind of ABI break that Linux would like to avoid and So the conversation about this topic was very interesting the developer who wanted to make a you dev rule that relied on BPF made the point that seems very reasonable that it would be Seemed inappropriate to change a device driver in the kernel by submitting a patch if you merely wanted to change the backlight color on the keyboard or The button assignment of the mouse that's kind of a heavy lift and So the origin of his desire to make a you dev rule depend on BPF Since BPF is a great solution to this problem was clear but If you think back to the diagrams with the handshakes among the projects You know he's creating work for kernel maintainers who now have the Responsibility of making sure that they don't change an internal so that BPF stops working so that the mouse stops functioning he's creating a lot of work for people And so the position that was taken on the mailing list Harkens back to the problem that gdb had with devk message, which is basically that People suggested if BPF rules if BPF invocations that were in in you dev rules We're in tree and the kernel source tree and that at least gives people the opportunity to have tests that will break if The kernel internals change so that the You'd ever will no longer work. So this would be putting BPF scripts for the first time on the same footing as the scripts used by gdb But it would be a change a little bit of the change in the way the kernels handling BPF Since this controversy was going on during open-source summit Linus with himself was asked by dirk hondel about BPF programs and a bi stability and Linus said basically A bi stability is guaranteed whether BPF is used or not But of course he's kind of talking about page one of the kernels warranty and not all the messy details so stay tuned as we Move together and try to figure this one out and get a happy ending for everybody I'm going to cover the last topic here very quickly Which is how to design software to avoid a bi breaks as with many of the rest of the comments The solutions here are not particular to the kernel or to see programs, but just good design ideas in general Or maybe bad design ideas depending on who you ask one method The kernel community uses to avoid a bi breaks as we have a Couple system calls that are very generic You can tell you don't you don't need to know what PR control does to see that a system call that takes all these Unsigned long arg1 arg2 arg3 is not really Focused for a very specific purpose. This is pretty general and of course in C programs We're free to cast enumerated values to unsigned log no problem so If you if you don't find PR control generic enough you can use I octal Which also has a unsigned long request and is very args to boot So you can do a lot of things with these and if you don't like the existing system calls at some level You can kind of roll your own with these very generic interfaces But may I don't know whether this is a good a good example of interface design or not arguably not not everyone is a fan a second case that is definitely an good a bi design principle is the principle of unused arguments, so System calls that have arguments that no longer have any function don't need to remove the Prometer from the function signature Perhaps a little bit misleading to have a parameter that has no effect But it doesn't really hurt anything and it's better than making people recompile and reinstall as surely Then Another interesting recent development at least to my awareness is Intentionally putting reserved for future use parameters in new system calls So people are putting in often flags parameters That are essentially bit masks It has seemed like a lot of security problems have been have needed new flags for their solution And people are thinking ahead that that could happen with new system calls I Kind of amusing one is System calls that have multiple versions, so it wasn't a bi break, but maybe the old version Isn't really usable anymore and it's kind of sort of deprecated Clone has clone one clone two and clone three hopefully with clone three. We've we've gotten it right now and Then finally G. Lib C the most used C library has simply not supported some system calls Just throwing up their hands perhaps or maybe just nobody has stepped forward to write an implementation then the most interesting case I think is If you read kernel mailing lists, you find that people are discussing a bi Stability when they're considering the design of new features This hyperlink up here at the top is to a discussion where one developer said clearly the most convenient logical way to Make the new feature accessible to user space is to put it in ProcFS or SysFS and the other developer said We can't do that because we don't know how people are going to use it And we don't understand what the interface would be and if we put it in ProcFS or SysFS we're committing to maintaining it and maintaining backward compatibility and so I this really shows you the hard decisions that people are making because You know, maybe if it's too hard to use the new feature because it's not in ProcFS or SysFS No one will use it There are no good solutions here So with that we're hitting the summary Just say that the Linux kernel ABI Boundary has very clear well-defined parts and other parts that are a little bit more puzzling controversial Some ABI breaks are just cannot be avoided like with the 64-bit time everybody agrees about them With a case of robotics and priority inversion We didn't have an ABI break and maybe we should have had one when the 64-bit time came on Situation for developers and ordinary users with ABI stability is is different and perhaps a little bit hard to understand and if you're Certainly if you're rating applications that touch parts of kernel metadata that are not in SysFS or ProcFS You should think about these points. Oh That I'd like to say I'd like to thank a colleague Sarah Newman who will be presenting about live patching in production later on today and In this very room at 6 p.m. So consider hearing that talk and then finally I'll say that my employer is still hiring and Has Maybe a hundred job openings for software engineers. They're all remote all the time and They're for things like JavaScript and Golang as well as C C plus plus And you can find it at a road tech dot slash careers or talk to me and Thanks for your attention I'll run around and give anybody the microphone if they are not Too stunned to ask a question the The statement in the earlier part about Linus talking about like not breaking a bi kind of alludes to this but Besides like reviewing of code. Is there any mechanisms of the kernel employees to ensure that a bi breakage doesn't occur like automated? The kernel does have a lot of more and more tests every year, which is great I'll see who is going to respond. Good. Yeah, the believe it or not There is not a stable a bi That's not the rule the rule is not Rule number one is not there's stable a bi the rule is don't break user space It's a hole, you know tree falls in the forest I say if a bi breaks and no one notices did it break. No, it didn't so in fact some of the solutions is You could change all the users before like you know keep like the old way in the new way and go and change all the users You could do that too. So there's really you can't really say is there a test to make sure there's no a bi breakage It's basically when someone complains So that's actually the so again if people think that oh, you can't change a bi. Yes, you can we do it all the time They just don't break user space Right Yeah, but there are many tests. I'll talk on both microphones Yeah, but like this break to GDB was unintentional that's just the people who approved that didn't think of it, so oh You won't be on YouTube if you If you just speak you can't you can't speak that loudly So I think that one of the questions from What what this gentleman said is that if I was to do some work? And I want to make sure that I haven't broken things before handing it over so they don't get yelled at for doing stupid stuff Then is there a way for me to do the testing there? Yeah, you could you you certainly can't think of all user Applications. Oh, it wants me to enter my SSH password fasting It only popped up now that is so great Yeah, you certainly can contribute test to the kernel or to g-libc as I pointed out the problem with prairie inheritance and mocks Was confirmed by the fact that the bug reporter actually posted a reproducer so That's why that bug while not fixed is not closed either But Okay, yeah Could be Anybody else? Well, thank you all very much for attending Okay, I think I can see that the audio isn't working as well. So it's good So for me this is the first event post pandemic It's kind of good to see people's face in for real Is that the same easy guys? Have you been to other events? No, yeah She means nobody From the event seems to be here. So I guess I'm just gonna supposed to start and the time arrives Yeah Okay, I still got one minute like every time I have to this end your dispose I always tell myself, okay next one next time I'm gonna have to remember some joke to tell in front of the audience and I keep forgetting and Then one time so I mean it's just waiting for the time to be something at the one time the like this audio So I mean the video set up really didn't work like that. That's why I was having at the moment Well, thanks to help. I was able to fix this but One occasion they couldn't so they know the stuff all came out and tried to fix things and there's a lot bigger audience I have to feel the space when you feel it here for good 15 minutes So that I am so I talked about the Jenkins origin stories So the one time I had this recurring nightmare That is like I show up on the podium and I do not have slides. I forgot to prepare the slides in time Today I have a slide. So that's good But so when I was telling somebody about that story, they said, oh, like if you really find in that station You should talk about the origin story because that's gonna that that keeps the time. So guess what he worked All right. So it's that I think it's time So I think I'll just get going The microphone is on so it should be okay. I guess it's probably supposed to be recording me. Hello Okay, so today I wanted to talk about the flaky test so Actually, how many of you are developers? Okay, so pretty much everybody so You probably are already hard of flakiness But according to Google at least a blog post done by the creating in Google Wapping one in seven tests have some great level of flakiness But since you're all developers, this is probably not I'm used to you, right? I mean the I I have dealt with the flaky test myself in my involvement with the Jenkins project I'm sure you do the same in your day job where else you probably aren't here in the Sunday afternoon I mean the Saturday afternoon So but nonetheless, it's good to start with why the flaky tests are a problem, right? So I think the so the most important reason perhaps that it creates this erosion in trust to these automations So in my high school The fire alarm system is broken So from time to time it just like it goes off for no particular reasons and from these teachers the students They said oh, it's just another broken fire alarm Ignore or move on as normal, but it's not you know if that was actually a fire Then you'd seen like about you know, they 20 250 dead high school students because the fire alarm was so The erosion in the CI system. I think it's a similar effect, right? If you have too many flaky tests, what's gonna happen is the people start the ignoring the test failures It is okay most of the time until it's not okay, and then you have a large fire Unfortunately, it's probably not gonna kill people, but it is like, you know the embarrasses a company in front of the customers Another reason flakiness is really problem for especially for the whole process that it creates this Really annoying unnecessary work, and I'm sure you have done this like you know You commit the code before you you know you you headed off to the home for the evening Expecting that it no CI processes kick in like the validate things and the Mars and the rest of things Except it didn't like flaky failure happen So later in the night you have to log back in to VPN only to press that retry button again And then if that goes lucky if you're lucky then that goes through but if it doesn't then you'd be even harder world of pain I'm sure every one of us here have born some midnight or like that And then the the other problem of the flakiness is well like is that really just a flaky test problem? Or is it actually observing a problem in the production system? You know that I'm sure you have the always highly concurrent some race condition that really only happens once in a while So your test might be actually cutting those problems You think it's flakiness. It's a problem in the test But if you did try until you succeed Well next time the failure happened it could be in production system trying to process some of these important transactions And that's the last moment you want to see it So this this fear of the actual problem slipping into the production. I think that's another reason the flakiness is bad so You know as we established earlier Everybody has this problem in this technique. You know, they even Google and they are highly paid engineers can escape from this problem Right. So what are we doing about it? And that's I think is a worse question worth asking and that you know since I work in this space developer productivity space I work in the Jenkins project and I used to work in the Jenkins projects And now I'm a lot more focused on using data to drive developer productivity. I wanted to I decided to look at Okay, so what are we could actually be doing in this industry? So the So one thing so the one of the things that the Google guys do great and they're also very data oriented So in the same blog post as they did they correlate the sort of like a test size With the likelihood that the test contains of flakiness and as you can see it's got the really nice linear Correlation again in some sense. This is not a surprise right when they say tests that occupy a lot of memory They're talking about the big integration test and to end test that kind of things and we know without nobody telling us that those Tense tend to be flaky But nonetheless when you can quantify things like that It's really helpful because we the developers know but your managers do not know so I'm sure we always struggle like when we say Wow, we need to fix flakiness the manager like your manager is shelling like you know And I'm gonna spend my like 50% of my time working on this That's a time you're not spending on debilitating the features So they are they have a hard time saying yes, but the number had a way of transcending these like in the professions and specialities So that work that tend to work very well The other things that the some other people have discovered about the flakiness is that there are some well-known causes To the flakiness the one apparently is a time Related I think this is deported by the github guys and this is actually one pattern that I haven't seen before But they said like not the end of the month or the beginning of the month or some weird, you know, like cycles a problem tends to like a flakiness tend to happen so So that's that so that was that the second one I think is at least for me. It's a lot more common Where the test unintentionally modifies the global state essentially alter the state of the test System so the next test and the test that ran afterwards a sort of trip over to that state change that happens So in other words known that the testing to clearance So I'm sure like you had this experience when you know like when you swap what the test order it just start fading or passing The other thing we know is that the flakiness is very unevenly distributed So I the first number I mentioned that the Google that actually the Google said is one in seven tests have some flakiness But the same blog report also says only 1.5 percent of the tests round are flaky and I'm like, okay How does that make sense like is that one in seven or like a one in hundred like you make up your mind well, it turned out what they mean is One point one point five percent of test runs, right? So some tests Fail because of the flakiness all the time Others not so often right so collectively they only failed like a 1.5 percent guitar report did the same thing in a little different way That's more sensible. So in them they said in a given time frame So let's see. So what is this the about the 24 percent of the test Have seen fading because of thickness more than once There is only like like about 0.3 percent of so of tests that shown more than a hundred percent a hundred times failures caused by the flakiness, so that you know that that's what's really going on and this is both You know, it's a good news and bad news. It's good news because you can focus on that really tiny number of tests and if you can Solve the flakiness in those point three percent Then you can really make a huge impact in the perceived flakiness that your developers see But at the same time what it's about news because we can't fix all that flakiness. You're not gonna fix every one in seven tests So what this tells us is that the really we should be intelligent about Try to like know figuring out where to spend your effort and try not to contain and like eliminate the flakiness Actually containing is good eliminating is bad right So given those things that we know about the test flakiness What have we been doing collectively in the industry? Well, it turns out there's a lot of things also down And then the good thing is that the engineering teams like to talk about these things So we got too much work I was able to discover all these different approaches taken by the different engineering teams or blog Publicly, it's wonderful Kind of designed to make you feel ashamed because the whole point of these guys talking about it It's to brag about how awesome their engineering teams are kind of makes us feel like okay. Oh, like I'm not even there but but nonetheless, it's like quite Instructs quite quite educational So when you read about all these things you start to get this big picture of what's going on And I sort of classify this into three different pillars It's kind of like a three legs of one stool All of these three things need to work together for the solution to make sense And I when I talk to the actual practitioners of the you know, the depth of the engineering Who is struggling with the flakiness. I feel like they're often trying to rely on just one leg of a stool Trying to make the chairs stay stable, which is a you know, like a hopeless game So what are those three pillars? The one is what I call the keeping bills green So at the end of the day, our job is to deliver a software. So, you know, what the flakiness does it's almost like a Frictions in this your engine, right? It creates some heat from time to time So number one goal is to kind of keep you know engine cool enough to operate so that the software are getting delivered So, you know the keeping bills green making sure that the occasional flakiness doesn't suddenly like stop the process So that's obviously the most important things and that's probably how everybody enters into this space The second one is the measuring flakiness. So people often measure the flakiness by how often the engine stops Right, but if the goal is not to stop the engine, then that's probably obviously a bad way to do it So as you try to keep the kind of engine moving you need a different way to measure flakiness And that's the another sort of like a key effort that all of engineering takes team takes on And a third is if you're just measuring the the flakiness and not doing anything about it Then you might as well not even measure it like white white care, right? And then as I as we saw earlier like Not all flake I mean flakiness are unevenly distributed So if you can focus people's attention to the right part And also if you can focus your manager's attention that this is a problem or solving then you kind of already won the battle So that's what the teams are doing. I called it escalating flakiness to human attention to the right place So let's look at that a little more carefully. So The fourth one is keeping bills green All right, so not surprisingly. I'm sure you do this again They the number one most common way for people to keep him to keep the bills green is just retry Okay, yeah, the monitors is also keeping trying so it's also please What have I done? I appreciate it's be trying at the right place. I'm just gonna I'm not gonna even touch it Okay, so yeah, so this just now keep retries until it passes Interesting Yes, weakness this is that so does that just retry is happening at about almost at about every layer of the tools So we use so the build tools have a retry engine, you know the Jenkins as it's only try there you're I mean the test runner of its own retry logic so on and so forth And I call this like a different size of the retry loop Right, so and then that's there are different solution that creates a different size of the retry loop Like the smallest one that you can think of it's probably a test runner So like you know five it spends let's say you spend five millisecond running a test it fails So immediately that your test run it can be run the same test in the same in the same place same computers as well But if you're using something like Jenkins then the failure is bubble make the retry loop is much bigger Right your developers get to see that they failed and then they schedule a rerun It's gonna probably running a different computer a different time and as we know sometimes That that different mess is what fixes or that can masks the flakiness Right, so you kind of need to combine these solutions and some of the people some of the teams really went much further to To tackle this problem So Github and Dropbox came up with way of like a messing around with the system time before they rerun the tests Which is actually surprising her that in the even in the containers like they you can't really change the system time But they figure that how to do that cleverly and and so if because if they If you believe what what they reported as they reported some class of flakiness happened is because of time program So they're not your way to kind of shake things up is to change to a different time or They move the work for simply move the work to a different VM etc. So Those things have been done And then it also create the difference between how you know Do you want the failures to be transparent to the developers? Which is good because they reduce in the noise or do you want them to be do you want them to see the problem? Which is good because they get to see that there is a flakiness in the test That there's this problem that need to work on Now the downside of retries is one that it masks the flakiness, right? And this is usually the number reason people really hesitate to make that you know the using logic like retry to hide the flakiness Because if you're not measuring them Then that's that you know people in correctly assume that the tests are not flaky and then they let the program linger But the another kind of key problem is like it delays a feedback because the retry count has a way of growing Right, they nobody's gonna go in and say, okay Like the right now the retry can't be set to five but I'm gonna reduce it to three because that's the right thing to do Nobody ever does that what's gonna happen is like, okay This test keep keeps in the case fading, but I know it's flaky I just gonna increase five to seven and you did on this so that the next time it's gonna pass So pretty soon everybody keep doing that it's growing the retry number to that I know more number a higher number than it should be and then when the legitimate failure happens Then that's it. You know the retry count is 10 You have a one integration test that takes five minutes to run and it fail for edutiment reason And it has to spend the whopping 50 minutes Before the developer get this year failure and the people of course complained that the test takes longer But do they could see trying and so that is the cost of the retries So the so what else people are doing to keep the build green well, you know, guess what surprised You know the sudden and some teams are basically just ignoring the flaky test and this might feel contrary to you But I actually come to appreciate what they're doing here So what they so the teams like dropbox and Google reports that they run the tests They don't each ride they let the failure happen and they're just basically ignore those failures Especially from pre Mars where the things are more visible to the developers All right, so they let the code Mars even if the flaky like a test that's known to flaky fail And then post Mars. They're still running all these tests now Like those things are further away from the developers and then they could own by the DevOps engineers We're creating or the people who you know are more tolerant Let's say of the flakiness or they are more equipped to handle is those so they are they you know They sort of like do more statistical approach to the failures and so on So they can see they can decide if the level of the failures they're observing is okay or not etc So I guess Yeah, another point that's worth mentioning here is so you know because I was involved in Jenkins products I often the people I often talk to you are DevOps engineers and usually these are the people who care about the flakiness But they often do not see what developers are seeing pre Mars because at the pre Mars The developers keep retrying until they they you know, they the the code is the test result It's okay, and then Mars DevOps engineers domain of delivery but now usually happen afterward So the DevOps engineers often lack the appreciation of just how painful these flakiness are to the developers And it creates this conflict within the team. That's often not not great So and then that was I guess I was speaking about this psychopsychological barrier Like this people's instinctive reaction that no, I don't want to be ignoring the flaky test failure That feels bad that that should be worked on not ignored Which is quite understandable but for all the reasons all the numbers I shared And I came to think that it's probably not really practical to try to eradicate flakiness so, you know like the The reason it's so painful the thinking goes something like this Like if you're putting developers closer to the fire meaning like a flaky test failures Then the pain is good for them and that's what motivates them to work on right, you know Like try try doing that your kids like a pain is a great motivator, right? So they just does not work very well in part because the developers aren't the monolithic like it's a single person And what often happens is you know, let's say when the When the bulb is trying to commit the code and the flaky failure he sees is created by another developer Could have left the company like a long time ago. It's a third set So this this idea that well by making the pain more visible they they get more incentivized to work on is that that's why in my mind it just does not work very well So anyway, so that's about the keeping bills green So the next pillar I wanted to cover was the measuring flakiness so The what most of the teams that I talk to are doing is what I call the tribal knowledge, right? Like if you know the way they measure the flakiness It's by asking the engineer who is being in the team the longest and they kind of know like when especially when they see the Failure they kind of know. Yeah, I've seen that before this test is freaky and don't ask me why but it just is right So in other words, I can not really quantitative in any meaningful way But that's really the usual state of business in this world and that in my mind It's just kind of talks about the missed opportunity, right? We usually think of engineers as more like a numeric Rational, you know, like objectives like nothing like that in reality So but some of the team again some of the teams are doing better and this is like an engineering workforce and so on that we can see So the usual way of computing this is that So you track what code is being tested And then if you run the same test against the same code and if they produce the different results Then they measure the others of flakiness and even Google guys does it so that makes me feel good Okay, so they if that's good enough for them. That's probably good enough for me And in my company like you know, we produce the services that kind of compute these metrics and provide that to The engineers, but we do this in the same measurement. We do measure it the same way But it also has I feel like they're kind of carry naive simple way to do things So as we know one of the common ways test fail is that them some external dependencies are broken Right like, you know, your network is having issues. So test started fading. That's right So, you know, you go you stop the test and you fix those and you rerun the test The same code tested again by the same test and now they pass the different results So in this measurement they've been gone as a flaky test failure But I don't think the developers who see those as flaky failures. I say it's naive way, but it doesn't work It's easy enough and sometimes I think the lesson that I learned from it It's like sometimes like a stupidly simple solution works well enough, which is like a basically the engineering principle number one, right? But some other teams have even gone further. So this one came from Spotify, which I thought was pretty clever So they said, you know, the flaky test failures happen randomly So if you see so this is tracking like different test cases Run at different timestamp. I think that's what's going on So when you see like this one build where lots of tests are happening at the same time They don't see that as a test page are flaky failures So that probably some like common cause that caused it But these like in a random failures here and there this like a one-off They recognize that as a failure. So this idea that you should compare Other tests that I run at the same time and how they fail. That seems like a you know, like a reasonable thing Very practical thing to do And then there is a face trick, right? Oh, no I did apologize on behalf of a computer systems, which every time, you know, does that not happen to you? Like every time, you know, like I was in the airline counter and then the people are having trouble with the system there And I feel like on behalf of the software engineers as a whole I wanted to apologize so that a program is not working well This is not coming back Jesus Okay All right. All right Oh, oh, yes, you can do it. I know you can do it It was happy until Okay, I'm Yeah, I'm tempted to start the more proper troubleshooting session But I also feel like if I was able to keep going for 30 minutes, it's probably should be able to go for another 30 minutes Okay No, no Yeah, good, good suggestion. I love it Okay, all right, all right, let's do that So the position Well, let's see What can we do here? I could try a different connector since the organizer have it here Let me do that quickly Well, so what's Oh, oh Okay Don't touch I don't know. Oh, come on Oh, let's not jinx it. All right Oh, okay I am so I am so sorry Well, maybe this is the downside of uh, you know, misting I in person conference for two years. I forgot how to do these things anymore Yeah, but this is not working enough Maybe I think I'm gonna avoid full screen mode because the uh before this show began that was what Clean or not Geez Yeah, what's that? Oh, I tried two different connectors here. So it must be my computer or uh Um, I think I'd like to think so I've been presenting more Let's see what it be You know what? Yeah, I guess I'll just Well, I'll just keep talking and if the flickering start to get to annoying you just uh, maybe you can start looking at me So So what was I saying? Yeah, the uh, the uh, Facebook So so they have some of the craziest approach of everything and this is what's going to happen when you hire phd And then try to put them on these new products So they came up with a statistical model Otherwise known as a Bayesian inference and this is a mechanism that's used by um things like a spam filtering and so on It's a pretty straightforward, but no that is interesting statistics. Um, so what they do is that um, you know They say uh, there are two reasons the test could fail the one is you know, the test can fail Legitimately right like something is bad. So whenever you run it always fail In some even if the test code is good Let's say the the failure could happen is from the thickness statistically just by sheer chance So they model those two things separately But they say well, like the only thing we can observe is the past or fail We can't really from outside see those tell these two things apart So if you only see in pass and fail like a what are the likely combination of those pb and pf and that's kind of how they go about it um And then uh, so they could you know So, uh, yeah, so that's how that's how they can do things. Um, and then the last the third the Still sort of the leg of the stool is escalating the thickness Okay, so the so the naive solutions and I think what we are Actually the naive solution that I think we should be doing more often Is to just throw the test away Right, I mean the like you I mean, okay, we okay. I know we are supposed to be fixing tests But really like because of the nature of the testing Oh, do we have a professional help? Thank you I Yeah, okay. Yeah, so I know the uh, the tests are the Fakie failures. Some of those are just inherent to the problem Yeah, see that's what the pros can do like they see the problem for five second They already know what's wrong. All right. I love it Thank you. Wow. Let's see. It's like a magic. Thank you very much. Oh, man that Oh, I guess Yeah, no And that's always almost like a what the technical troubleshooting is like right like we're looking for this place and no the problem is over there Uh, I love it Okay So, uh, yeah, so, you know say if you're doing the integration tests or like a large-scale testing There's always some bits of uncertainty in the behavior and then they compound and cause flakiness So sometimes if the flakiness is just too high, you should be asking is this worth keeping this test? Uh, all right The test has a certain benefit which is it cats some problem But if the cost of maintaining the problem or that the flakiness that it creates is too high Then it could be that the the price is too high for the benefit and then you should be throwing this test away But again, some of the teams have done a lot more which is The So this is now done by github. You can see the logo here So what they did was like in a quantified impact of test failure the flaky failure all the way So they said something like okay, they compute this thing called impact score And I don't know how they come up with the number, but they could say things like okay This flaky failure has affected the 2700 bills And uh, you know, it it's it created delay in the 37th deployment So all of that is to try to quantify how painful This flaky failure is and like I mentioned if the flakiness is unevenly distributed That's a great way to sort of help you focus on the right thing And also again, they're appealing to the managers, right? So they slowing down the 37th deployment is a better like a metrics and saying oh like I've been I know this flakiness is really annoying which is a qualitative information that doesn't move these people And they even try to kind of diagnose the problem by saying okay Like this test is thus touched by this person like a who modified the code So try to pin down the pin the blame down to a specific person at least to begin the investigation So that it doesn't you know become this collective team Exercise, which you know when the blame is shared by the whole team, then nobody is really responsible, right? So I just think about like how much effort did they spend in building a system like this? Like can you imagine like somebody in your team? Let's say I mean, it's probably going to take like a One quarter at least building this web app And then not to mention all the data collections and computation behind the scene so Well, they think that even github has a flakiness problem to this extent to justify some of that the engineer to spend this much time Is both encouraging and depressing about the state of the software engine And then there's dropbox is something I thought is really clever. So they they they have this like a system built around sort of Like getting a lot more information about the flakiness. So what happens is when they first time see a test failure They put the test into this noisy test state meaning Their system hasn't determined if this failure is caused by the flakiness or the legitimate problem So usually like, you know in most team that I worked with this state like once they The failure of the advanced is enough to not get escalated to develop right away Right, but they don't do that what they do is it's automated system kicks in and they start running this test Let's say for you know multiple times They they sort of check out the audio code and try to run the same test to see if some of those would fail um, and then basically like using more information determine either the test Was indeed flaky in which case the test is more just quarantine And then the developers are notified that though this test is potentially problematic And obviously that the change that was flagged by the failure would be allowed to pass in or the um Or the code was determined to be bad in which case the like a proposed change gets rejected the reverted and the test says Okay So the fact that they built this kind of system to almost like a triage before the developers are asked to look into this Is like a great first line defense, right? They are trying to shield the developers from dealing with this lower level analysis And I feel like you know throughout our history of automation in the software engineering That's kind of basically what we're trying to do right like of shielding the developer away from all the uh Lower level details. So they have built the system just like that Um Now the what I'm really curious is as I was hearing all these stories like Okay, so are they, you know Once the developers heard about this developer information like, you know, the what are they doing with this information? Like are they at the end of the day is to just throw in the test of a then like what's the point of all this Building all these systems. That's something I love to ask these people Building a you know, see I the traciness tracking system in github um And then the last bit that I couldn't quite fit into any of these fun Like I wanted to see pillars, but nonetheless, this is the approach that I love the most Is the more architectural approach to dealing with the thickness um Even in other words that can change the product itself so that it becomes more testable Uh airbnb guys developed their whole like a mobile application Framework with the testability designed from get go, right? So the mobile app like one of the hardest things to test because it's got the stateful transitions like interaction with the server So originally they were doing things like the rest of us in this case Like they mock up the server they simulate the people's interactions and they see if the app runs in the uh lands in the desired state or not Um, but you know just like web apps using selenium is flaky this kind of test It's just so inherently flaky because it's testing like a big thing So they uh, they kind of what look to me look takes um like a react native idea transpose into But or maybe they just wanted to build their own framework whatever, but this was um the so they made the um, they made the testing so easy um, and then as a result You know, so they thought they well, they pain the flakiness in that way And I and I'm more in the java developer land from the old school guy So early on maybe 10 20 years ago In the server side java the whole programming paradigm called the dependency injection was invented in part Because that was deemed more testable So when we say you know dev ops like we like to say about devs and ops like you know They join together they we love each other But when the testing became part of what the software engineers do Then it really changed the way we write code All right, so this is kind of key example of that which I um, yeah, which I wanted to mention We should be having more of those but we don't so Yeah, so the main approach I mentioned like I said the city next to the still is keeping the bills green And then they measure in flakiness and that's how you think flakiness So now how do you think so where do we go from here? What's now what's up next? So after I'm seeing these different teams take on this problem in different ways I started to notice that the certain trend which is there's almost like a that You know the the thing is splitting into two parts. The one is Incurrently local like a unique to every development team Which is to be able to run the test against the code in the setup so most of the The build pipeline that I've seen Kind of intermixes to build and test together In ways that it's hard to just run the test Right, but in order to do any of these sophisticated approaches It's kind of key to be able to test arbitrary binary arbitrary version of the software against arbitrary test quickly And then so that requires some surgery into delivery pipelines And then so that's something like a unique to the development team Often the environment that you run the test is all very unique and not portable Then there's this other part that seemingly more universal, you know, like I mentioned this Bayesian inference algorithm that the Facebook guys have developed I can't imagine different teams be implementing that but there shouldn't be a need to right. It's just pure So like a data processing thing So if we can capture the test results in the portable ways And then, you know, the build a common open source library or service or something Then that that part like how we process that data in more sophisticated like a statistically sophisticated manner The extra useful insights what I think of as a brain part Should be the usable So we shouldn't be having different endearing team in different company Inventing with that. So if that's a usable part and if it's like a running a test with a different version of software Is the more local part then seem like that's the role split that we should be heading to So I'm more interested in building this other side of brain part And then the tentacles of spreading to your build and test environment that should be At that I think is continuously built locally when I say by I mean like as an industry Collectively we need to figure out the way to use that So that's the kind of what I wanted to talk about any And I appreciate your patience on all this this screen position turns out to be I put my bag in the wrong place Any any questions? Any any opinions All right. Oh, sorry. Yeah uh-huh So the question is like Okay, if you're talking about this reusable industry reusable brain that collects test results Like how should we be collecting them? Um, so they Well, I think at the end of the day the notion that the test result So like the data in the test is pretty simple like what test did it pass or fail? Maybe how long it took like if you can capture the log output from that that'd be perfect But that so the data model. I think it's pretty straightforward Now, how is that produced today? There's a this de facto format called the j unit report because the j unit guy invented it And that's pretty much used by most of the test runners out there. So You'd be surprised like a python guys So really guys they basically that is the only format that's Even semi-wide be used so every time somebody wanted to write a test report I became the format. So j unit is kind of wrong name for it. But anyhow, so that's I think How how we should Well, how I seem practice being done if that answered the question Not quite Uh-huh So, so yeah, so the how about is I think you're saying tap how about tap and how about go test? So I believe the go test does have a j unit report like a plug in their module um So the top top. I actually don't know but um But uh, yeah, so the I'm not too worried about actual physical format Whether that's json or xml like so like you only need a little bit of a glue code to extract It's like a simple data model So at least that's how we implemented our system and Jenkins also worked the same way so All right So the question is the The time and effort to figure out the uh test flakiness is too much and then so why don't we Just you know, I guess Accept that the flakiness happens like forget measuring and just uh react like what the dropbox guys are doing here Yeah, so I'd actually posit that the building a system like this. It's far more effort involving right I mean the most of the teams that I've seen do not have this kind of State machine logic can be able to retry individual tests and so on and so forth so But if you're in the station where this is no This is uh easy enough. That's great. So the whereas in contrasting that The effort the compute the flakiness Is a one time problem like this is like you shouldn't be implementing that but if somebody did that like so You know like somebody able somebody should be able to come up with a logic or the mass or the algorithm That takes the data like this and then spits out the flakiness for the test, right? So that's what I'm calling for the use there. Yeah, we shouldn't be implementing those but um Yeah, so that's kind of how I think about it Um But maybe or perhaps another way to think about it. It's not either war probably like different teams are in the different stations Or some things is easier. There are other things are easy over there. So that's great. Um, you know, you should do what makes sense to you Uh, um, so let's see. So they if I if I try to summarize the comment, um, the Sometimes the flaky tests are still useful right, it's like, you know um Well, I can't think of a politically correct example. So let's skip that but the uh, yeah So sometimes the the tests are flaky but useful. I mean the and then so what about those and then I guess the other point was um If once the flakiness is measured You know the different by by vulnerable or different developers didn't You know they do the work and indeed I feel like the flakiness of the test is usually like a team level information That's supposed to be visible to everybody Right, so I never think of these things as individually calculated by developers It's a part of like, you know, the develop system still to You know keep computing those and make it visible so that the you know The delivery process itself or developer can use this information maybe to retry or maybe to decide what test depicts and so um So yeah, and precisely because you can't always throw away Test that show some flakiness. It's useful to measure so that you can decide when to throw away. Yeah All right Anything else? Oh Yeah, so the question is great. These guys have already done it. Can we just use theirs? You'd wish no, but so what's actually happening is um There are these like a small number of well, there's some of these mega engineering team Do engineering in in in there are so the in their own ways that's so unique You cannot pull out individual pieces And then you use them elsewhere And I think that's what's going on with these guys And I know this because uh in the early days when I was building Jenkins like everybody has their own in harsh tier systems But when they when these guys solve the problem, they only solve it for their own team and they do not make their solutions Generally usable So that's why I think they are not seeing the code or anything come out of it And I don't expect that to change. It's actually surprisingly hard to make a general purpose software Yeah Like I put differently like if you are working for sales company and then you have a system If somebody comes along and says, you know, like just give me the source code I'm going to run it elsewhere. Then you'll be you'll be laughing. It's a good luck Right, but there's so much like operational insights built around the people's in this team to operate that kind of system So this code is probably not going to be very useful All right Oh Oh, thank you. Thank you Yeah, so I think you're I guess your question was you well first there's a story that okay There used to be this test seed that takes two hours Nobody cared. They're just like, you know, basically ignored And then somebody came along and then we were able to shrink that down to you said five minutes That's the amazing deduction. Can I hire that developer? but And then like people suddenly got motivated in adding more tests and actually started using it for real and then so I I do I do agree that the test durations like a long or test that is useful Because people start to mentally check out, right? And then I care about this problem so much so that that that is the Like the problem I'm solving in launchable my current startup, but I've been I've been struggling to kind of quantify the difference And it's one of those things I feel like the developers, you know, like, okay The difference between test seed that takes two hours versus test seed that takes five minutes But like it's been It's one of those things we struggle to Describing ways that, you know, the people with money would understand Yeah, so Yeah, I'll leave it at that All right, anything else? If not, well, thank you very much and you enjoyed the rest of the day All right, you guys hear me, okay Also, we'll go and get started Three o'clock Thanks for coming to my talk um Appreciate it's going to be back at scale. I've been an attendee for many years, but this is my first year actually speaking So pretty cool a little different But yeah, it's good my name is Dan Islah and I've been uh Been here in SoCal for maybe 13 years So I'm not quite a native, but definitely a transplant like most other SoCal residents Uh, I'm going to share with you a project I've been working on Um, as well as a uh, how I've been continuing to work on it Since I left Google I had a little bit of my background where I'm from Um, some of the things I've worked on before And specifically about this open source project. I called Selkies um, that's used to To orchestrate stateful workloads for users on on kubernetes as well as deliver remote development environments um, and ides to the browser So, um This is a small group here. Just curious. How many of you are like would say you're developers and you do developer stuff all day All night Sweet. Definitely all night. Okay Okay, what about like people do like dev ops, maybe systems level stuff kubernetes containers Yeah, very good. This is uh, they talk kind of It'll it'll hopefully appeal to both crowds from a developer point of view Um, I definitely want to share my perspective and some of the tools that I've built around enabling developers to be more productive through uh, uh, through remote development environments And from the dev ops point of view, I'm showing the usefulness of containers orchestration And uh, ultimately running a managed service um and scale So we have a whole hour. Um I've content that fills some of that, but we'll we'll kind of go with it. Um, we can make it interactive. There's a few demos and stuff Uh, so definitely Let's uh, let's go jump right into it. I know this is uh, I think it's being recorded and streamed to youtube So if anyone out there is watching welcome to the stream So again, my name is uh, my name is Dan. Um, I went to Boise State and spent four years there studying electrical computer engineering Um, as well as computer science And what I learned there after over that course of time was basically filing exams really hard but in reality, uh I always wondered why it was so hard to like just start programming Right, you have to like set everything up. That's all these java dependencies Which is left pad thing breaking all my all my code and why won't this build on my mac It was uh, it's basically just A lot of hell trying to set up A developer environment so you could just do a simple task and especially when you're learning In university setting you're You're not only trying to learn the language and how to code, but you're also trying to learn What are all these uh package managers? Um, how does this work? Why isn't this working at my machine? And when you if you take these practices to your job Then it just continues this cycle of okay, let's I got a new computer. This is going to be a problem Step one spend a week setting it up. All right So right after uh, right after Boise State, um I moved here and joined the jet propulsion laboratory Uh up in Pasadena. This is where this conference normally is So it's kind of my my stomping ground and it was for a long time I was there for about eight years and As a systems engineer and tooling engineer So what I learned there was that operations it's pretty hard And I'm talking about operations not only software operations and running services at scale But actual spacecraft operations jpl. We fly spacecraft to the outer planets and do exploration Real-time ops is hard not only you may have to live on like a mars time schedule Um, or you know, you're dealing with mission critical hardware and software But operations in general is hard even software world, you know, you're trying to keep services up and running That might be falling over and maybe a bug in your code That's breaking not only your services for your users, but a jpl. You're you're breaking like spacecraft on another planet Um, I did work on the mars science laboratory or curiosity rover. Was there a landing night? It was a lot of fun pretty much the highlight of my My career, I guess at jpl. It was like dream job number one, if you will Um, it was a lot of fun. We landed in 2012 um But also what I learned was processes and bureaucracy pretty much ruin all your productivity The developer, you know, as soon as like a lot of process and a lot of bureaucracy comes into your your sprints or what you're trying to get done It gets really hard to to be productive and I can kind of eat away your soul So, you know being productive is very important to developers, you know We want to get things done. We want to write code. We want to ship it to production And the more this that comes in to play Um, the harder that becomes So we had like a lot of problems, you know, there's things like you have to be given company hardware You can't code on your own device from anywhere. There's like all the sanitary virus stuff. You have to worry about There's vpns. How do I even connect to my source repository? Where is my source repository with with the firewall? Work from home was actually really hard at JPL. We didn't really have a lot of tools such as video conferencing that worked or You know streamlined processes to enable work from home was very much that office culture This has changed a lot since you know 2020 But it's still difficult and just became a productivity nightmare So I left To work at google So google I joined as a solution architect Was there for about four years working on cloud and Really wanted to just work for a big company and see what that was like in tech and And google has always been my radar It's amazing. So I joined down in google urvine in southern california here And was there for about four years And what I learned there was it was very eye-opening Right away, of course the the free food was amazing. You basically just All your meals are provided. It's mainly everyone's heard the the stories of the google food and most of it's true And also, uh, you know It was like everything stepping in a time machine and going to the future and all the technology is completely alien It's like what is a virtual machine? I don't know. We have this thing called borg All the processes for developing software was very Different than something that I'd ever seen before in industry. Um, and it really it felt very futuristic Um, you know, google has this thing they call the borg which is literally alien technology And they were gracious enough to beam down a portion of something that looks like that through kubernetes to us earthlings and Anybody, yeah, but they still run borg for the most part Um, the developer's experience was was actually really nice. It was amazing But it had quite a lot of differences, you know, we weren't allowed to have code on laptops So nothing from google 3 Was allowed to to get off if you're working from home Again, you kind of had to be in the office if you wanted to access this stuff Or use some kind of remote desktop There were no vpns. This was kind of amazing. They everything just kind of worked you open your browser And you can get to your source repository. You can do your code reviews. You can You just go to url and everything was kind of magic This was done with like what we now know is like the beyondcorp proxy and mutual tls, but it was just kind of a it just worked Code reviews were we're very straightforward because it was a giant mono repo You had actual folks that could own portions of that repository You always knew who to call who to contact who to pull into a code review You had people that were actually certified in Readability, you know, like this person has java readability So they're gonna automatically be pulled into my code review And they're going to make sure i'm following all the standards best practices and style guides The mono repo, um, some of you may have heard google's mono repo You know it works when you're at that scale And the one thing you have to do is bring in all of your tools all of your tools have to fit in that model Like you can't have a distributed repo and have all of your tools kind of working the same way it's very difficult So the mono repo like you have to use those tools. They're all built from the ground up There's very little open source if you will And the ci city's chains were were fully integrated. So everything just kind of worked you make a change A small change everyone gets automatically pulled into do review unit tests automatically run Things automatically run to get deployed out to production And sre just kind of holds everything up and running in production. It was very It was very alien, but it was also it was extremely streamlined the developers were Were kings right they did everything they could to make sure developers were the most productive people in the company and it was great, so But you know this only really works for google because they started that way It's very difficult for an enterprise that's kind of shift into that model Especially if you like you don't need a mono repo Or you don't have all the tooling or you have kind of a hodgepodge of tooling some built in house some Vendored it can be very difficult. The google started that way For the real thing So, you know, we're stuck with this little next generation shuttle versus the park trying to figure out how to do kubernetes good and While google's off doing their their borg thing. So what I really learned was that You know the browser Can enable this secure productivity from anywhere like I said any any asset we needed at google you could just open a browser Go to an internal website. You didn't know if it was internal or external You can just get to things very easily Everything like I said from code reviews to searching For code even like writing code in a in a basic ide that again was built from the ground up Only works with internal tools, but and also, you know google workspace Most of you probably worked with gmail or google docs or slides or things like that It's all on a browser like you don't need to install anything Um, you log in with your normal credentials If you ever see the login screen because we have things like sso that automatically log you into Your tools as well as your source repositories You've got your little security key that you rub and that has a browser integration Um, so everything from two factor to identity is fully encompassed using browser standards It also helped that google owned the browser, right? They own chrome and the chromium open source project They're the main contributors So you can actually build a lot of these opinions into your platform With the environment that you want people to be working in So seeing this, um, I really wanted to do Let's see if we could, um, solve these three problems And bring some of this technology to the rest of the world, you know Well, I can't can actually take it out of google How can we rebuild us so that we can kind of Obtain this level of productivity this level of access and convenience So I kind of identified three problems when I was there as a solution architect working with customers And those were access Access to the software to environments to tooling How you deliver that do you have to install some client? You just go to a web page. Is it in a google doc? And then orchestration, how do you run this at scale? How do you deliver it? To your entire developer team, how do you enable everyone to be productive day one? In a way that is not falling over in a way that Is easy to manage and operate So from the access point of view, you know, my it's my opinion that that remote is actually better And we're cut the thin client is kind of coming back into interview, you know in the not only in the google world But the you know the browser is your thin client and the browser can run on almost any device And so you don't need super powerful local devices You know the chromebook was actually incredibly useful because all you need is a browser And everything else is being run remotely This means you have access to better hardware as a developer You can compile things that are bigger if you don't have a remote compiler You are running in the remote environment So you can have when it may more way more memory cpu's Faster internet connection, you know if you're working from home and you only have maybe 50 100 megabits You connect to a robot system that's got gigabit 2 gigabit plus Network connection. It's just nice if you're working with things like docker You're pulling in large images and your local connection is crying No problem. You're just running it remotely Creators running in this in the remote world become more collaborative, you know, you can do you can build content In a remote session and share it with other Peers or collaborators because everyone's connecting in the common point which really is the cloud at the end of the day And it lets gamers play from anywhere. You guys may Be familiar with uh with google stadia. It's a game streaming You know, there's several game streaming platforms out there now, but this is kind of like remote environment remote connection remote compute Bringing that all the way to the local device and taking as much off the local devices As we can, you know, it's making this remote world. That's a little better And as you know connections people are always connected, you know, in fact, it's Internet is like a utility. I think we all learned that in 2020 and we had to go home and work If your internet goes out, you can't do anything, right? You're very non-productive So, you know as the world becomes more and more connected developers Are always online. So the idea of having a remote environment Is uh is becoming more and more palatable I think one of the exceptions is still Flying on an airplane where you have some satellite connected internet connection And you know your remote environment is not going to be so great if you're sitting on a plane, but you know That's okay I think I saw tweet saying maybe you should talk to the person next to you on a plane or dig a nap, but um Either way, um remote is is uh, it's coming back And I think the browser is basically the new thin client So some of the problems with access or identity authentication uh, so for example, if you have to download a A fat client to connect to a remote environment like an rdp or vnc client or something You have to like log in twice or you have to get some special certificate Or you you have to somehow integrate with an additional identity system They really just adds friction if you're a developer or really anyone Having to deal with multiple layers of authentication if you're not on the right device with the right configuration that was set up Identity can be a major problem Networks firewalls are hard, you know even running a conference at this scale, you know there there can be network problems but uh You know this can be true, especially when you're at home on a home network You can't rely on someone's router configuration You're working with your network security team to open firewall rules make exceptions, you know peer networks VPNs things like that Uh, and another problem is security and endpoint isolation. How do you How do you isolate your local device from from everything else that you might be running on computer? What if you have a virus on your computer? You know, how are you going to isolate that from your actual corporate? IP the work you're doing the code you're writing that environment. How do you keep that safe? If you can't trust the local device then you get on all these solutions like enterprise for doing endpoint management and virus scanners and trying to stay ahead of the curve Um, these are these are problems that um that come with access And the web solves the majority of these um for identity Web SSO has been around for many many years. You log in with google that token follows you around any google asset It just works. You don't have to think about it Um, and it's better too because you can have your two-factor authentication It's all in the browser the browser understands these things now and it's very seamless Networks and firewalls You know the web proxies things like engine x and ha proxy um are very common now for Forget granting access integrating with your SSO system And traverse the network. Also the browser is pretty limited. You can't really You can't really do a whole lot. You can only make web requests, right? You can't just open some random Port on some random protocol the browser just doesn't let you do that Um and technologies like web rtc Make this even more seamless and we'll talk about that a little bit later Um from an isolation point of view this concept of pixel streaming Meaning that the content of your environment isn't actually never touches your local device You're just looking at the rendered, you know manifestation of it and streaming those pixels over the wire This lets you do things That a lot of sissos are concerned about with developers, which is like one way copy paste You want to make sure you can copy all your code from stack overflow into the environment But you can't take the custom stuff you write out So with pixel streaming you can disable that outbound Um copy paste as well as things like outbound file transfers Um, and you can also from a remote solution point of view you can Firewall and control the network Environment remotely and make sure you can't like send things dropbox or Google drive or other random things So that's a some of the access problems and the solutions that i've that i've found to them So let's talk a little bit more about pixel streaming and why this is interesting So pixel streaming lets you deliver these graphical applications basically as they as they were built as they were intended There's some of these web based idees out there like cloud nine From amazon or even visual studio code server that let you run the ide in a browser But at the end of the day, it's not the original ide and it may be an ide You're not really familiar with And so you're having to do a lot of context switching like okay I used to be able to do this in my vs code environment or my intelligent Or my eclipse environment, but now i'm having to like relearn All this this ide new ide way of doing things Because the because the environment and the application is different. It's fundamentally different So by streaming you can just run that application Send it over the wire and it's the exact same application that you have on your local device It's just wrapped in a different way and delivered remotely Another thing is that's nice that I like is that you can recover where you left off So I like to you know write some code maybe run a build or a few builds step away for a bit close my computer Maybe Go sit on the couch my laptop and you open it back up and you can just resume where you were working without having to switch gears or Or Recover your session or restart a process If anything your keyboard is different Um or in the resolution you may have There's no local dependencies when you're streaming if you can play a video on youtube then you can basically pixel stream to your device And then as we talked earlier the endpoint isolation Is uh another benefit you get with streaming that's that extra amount of local device security in isolation So some of the problems with streaming um the biggest one everyone throws their hands up at is latency You know there's gonna be latency. I'm gonna I'm gonna be coding and typing at characters and take like 10 seconds for it to show up, right? Um, yeah, that might have been true a few years ago But it's a little better now, but it is definitely a problem Uh bandwidth, you know, you need to make sure you have a good enough internet connection image quality I don't want to be looking at blurry text if I'm staring at text all day trying to write code Uh, it's gonna probably give you a headache And the protocols that let you do text rendering like that and keeping the quality very high Uh, and then of course back again networks and firewalls. How do you create? Connections that are low latency that can actually get through everyone's crazy messed up networks and also with your corporate networks It's problems Um, there's even more browser based challenges that are specific to browsers You know latency the browser Cannot create direct socket connections You cannot just open a random udp port To a to a server on the internet the browser like one of that you do that doesn't know like there's no such thing Um, and so protocols like udp that would reduce latency because there's no transmission control. There's no packet queuing Um, don't exist, right? So the browser has has a problem there Um, you know, you got this tcp congestion control you can run into this if you're using like web sockets Just just get this inherent latency because of the protocol you're using and the protocol that the browser is limited to Uh, then there's image quality and bandwidth. Um, so obviously if you reduce the amount of bandwidth That you're streaming with um your image quality can suffer And a lot of that is usually related to what codec you're using the you know the method you're using to compress the The the image and send it to your browser And the browser doesn't have um all these codecs built in it can't it doesn't know how to You know decode um lossless Images using like a high a brand new compression algorithm that can give you that nice yuv444 color correct Image it doesn't have that built in and so it's something you have to overcome or compromise Um, and then native interactions, you know, we all know that browsers have these like keyboard shortcuts and things that we're used to doing You know command t To open a new tab command w to maybe close one Well, what if you're like in an ide and you're trying to use those same shortcuts? And now it like closes your whole browser rather than closing the window the source file you're working with Um, these are browser problems. You know the browser likes to hijack some of your shortcuts and developers love shortcuts So some of the solutions to this Well, web rtc fixes a lot of these things so web rtc is a standard That recently has been finalized. It's what a lot of video conferencing tools use and it's for peer-to-peer browser Communication, uh, we're going to talk a little bit more about web rtc here in a minute But uh, it's and it's basically the only way that browsers can make udp connections to another Endpoint on the internet so browsers have a web rtc peer connection object That that understands and is very tightly controlled to for how to connect to another browser or another server It doesn't have the latency problems as many anyway Primarily because it uses udp and secures it with a tls or dtls Datagram tls so you can get a much lower latency connection. You can That is much more tolerant to you know internet traffic and internet weather And problems with packet loss can easily recover a lot quicker And then you know from an image quality point of view There's a there's a new standard coming to browsers called web codex That tries to standardize a lot of these more modern connect codex like vp9 h264 265 And av1 I was trying to make that more more usable within the browser right now the browser supports a lot of those Um, but it's still not 100 there and local gpus. So as of about like 2019 or so almost every laptop and consumer device has a basic gpu that's capable of the hardware decoding these codex like h264 vpa vp9 and that's main Mainly because of youtube because of internet video In order to decode 1080p video on a low power device You have to have a gpu like your cpu is going to cry if it tries to do that by itself And so they're pretty much standard in every device you could buy a hundred dollar chromebook At best buy and it will have a gpu capable of hardware decoding video And that's really important because if you can't achieve 60 frames per second On your local device from a streaming From a stream then your latency is going to be capped right Because you're not going to be able to get that sub 20 milliseconds or even sub 50 milliseconds when you add in your network latency If your computer is trying to decode frames and it's taking hundreds of milliseconds to do so From native interactions point of view progressive web applications Who's sort of pwa's or progressive web apps? okay, yeah, so uh, so these basically Give you back the control of your of your keyboard shortcuts by making the application Look like it was installed locally and kind of feel like it too. So you get like the full screen experience The shortcuts that you would have lost in the In the browser when it was in a modal view or you're just like native browser You get those back They don't take over the entire system anymore And it's something you get for free and browsers now you just have to like write a little manifest Set some javascript workers and you can install your application as a pwa So these problems, uh, mostly were solved With a project called selkies. So selkies Is the the open source component that I that I started at google Um And for a customer when I was a solution architect and then open source it on github as a way to show people Examples of how to solve these streaming problems how to solve this delivery and access problem In a fairly opinionated way centered around containers and kubernetes so selkies, um Has a few components has a broker that's used to actually assign workloads and do identity Management, um, and then it has a a streaming component and this is the web rtc core that does that uses gstreamer to to accelerate to encode accelerate Transport to a browser, um over web rtc So there's there's a few components of that, but they're all they all run in docker They can also be run on a vm and it's very centered to linux right now There's a few efforts in the community right now to port it to windows But we're not quite there yet, but yeah, so it's very linux focused especially container focused So I want to show you a little bit about about selkies and talk about this project So selkies, um, like I said, let's you do this kind of staple workload So you can basically go to a url in your browser That's running in your kubernetes cluster And it will Connect to the the web rtc stream when the container is running This is the web interface Connect to the stream And And we're streaming so this is a rocket league Running in the browser Using a h264 with the gpu attached Tesla t4 accelerator I don't know why I said the screen is glitching like that Sorry I hope it's not too disrining, but yeah, this is running streaming at At 30 frames per second right now and uh, there's like pretty much no input latency. I'm sure a pro gamer would be able to tell If there's any latency in this connection But from a casual If I was just using this casually I would I could play this game and it'd be fine So and this is running at basically 1080p Um, yeah streaming to my chromebook, which doesn't even really have a powerful gpu. It can it can run uh It can run videos. This isn't glitching on my screen by the way. I think it's just their uh Their adapter. Let me try to reset it real quick just so You guys don't go crazy. It's refresh We're gonna talk about the web rtc architecture here and it's like I just want to give you a A Just a little preview of kind of what it can do and um and how it works. So like all the um my keyboard interactions My mouse interactions are translated from javascript sent over a web rtc data channel and typed into the container Um, so you get this kind of you know responsive interactions and then the video stream is is sent um over web rtc, um the interface the web interface has some statistics that That show up in the side here giving you an idea of What your uh your current bit rate is frame rate? um latency the server's running in organ right now and on uh google cloud Those are like gpu load gives you some stats around the um the actual environments resolution Um as well as the audio it does have audio. I think it's muted right now But uh, yeah, so this is the basic web interface this web interface is very simple From web rtc point of view. It's very developer friendly. You don't need a whole lot of code On the back end point has a completely different story to build a pipeline like this is there's a lot more code on the back end And uh, it's just running on kubernetes cluster And uh, one of the things that you can do with selfies is Is gpu sharing so you can actually run more than one workload on the same gpu So this blender instance is running on um on the same instance With the same gpu in parallel with um with my rocket league session So the other thing so that's like the the gpu accelerated with web rtc um, but you know, um selfies being um An orchestration platform can also run things like uh like ideas and streaming platforms. Maybe they don't even use web rtc So this is uh the a non web rtc application that's running on the same cluster in a container um, but this time i'm delivering um vs code into Into my environment To my browser. So this is the full vs code Uh ide we have access to the entire microsoft extension library It kind of looks and feels just like a uh an ide But you notice i am still kind of in a browser. So this is that progressive web application Where you install it and it gives you a little icon on your desktop And it looks just like the uh An ide you would have locally installed But now in this view, you know, you have your keyboard shortcuts again I can do things like control w and it closes the tab um And and it doesn't look like it's running in a browser anymore. In fact, you can even close it And then launch it from the shortcut and it will start back up kind of as if it was a Another application A long application. There it is There you go Um, yeah, again, this is the the full ide but you know because we're pixel streaming. This is actually using a different codec This isn't even using web rtc Um, this is using an open source project called expira, which basically does window forwarding. So window forwarding is an old linux um Concept that's pretty popular on thin clients actually Where you're only sending the window contents to the remote Session and then it's up to the remote session to style it and put decorations on it and make it draggable And so that's what expert is doing. Um, so this window here There's basically two windows being streamed in my browser now And so there's no latency when i'm when i'm dragging this window around because it's its own html 5 element now So only the window decorations and the Are local and the actual contents of the window is remote Um, you know because i'm running a container in a linux container in my ide I also have access to things like a browser so I can have an embedded browser that Um, that I can run alongside with my ide that's running on my remote system So I don't have to worry about firewalls or connectivity if I wanted to test something like a web app um, you know, one of the things that developers love is having Uh multi monitor support you like to put things in different monitors How do you do that when you're streaming like a session or a view? So, um, I've built some extensions to the expert project that give you this sort of Window undock Field that opens up that window in its own thing that you can then drag to any window you want And put them side by side interact with it So it gives us kind of this this new way of working with windows in a remote Environment I don't want to say windows. I mean like window management So the sulky's project, um has several use cases Um, some of them are highlighted at the website here. The website's very early I haven't really put a lot of effort into it right now Um, but it kind of highlights the main use cases, you know code editors one of them Um, full developer environments with all the tools that you would be familiar with like a local device Um, you know doing game development high performance GPU So this is actually an unreal editor, uh running in container on sulky's This image this docker image was 70 gigabytes seven zero To just build and run the entire unreal engine. Um, yeah, it's it's pretty heavy And then you can also do game streaming. So we saw that kind of rocket league. This is my first prototype was getting Super tux cart running kind of the hello world of money scheming if you ask me And so yeah, the the website, um, the main website and then you've got the Uh, the github repos for sulky's. This is the broker that we'll talk about here in a bit Um, and then there's the sulky's gstreamer Uh, repo that this is the actual web rtc Uh component that I mentioned and we'll we'll dive into a little bit more So we'll come back to these uh to these repos I'm gonna jump back into these These slides here The screen isn't glitching anymore So the theory of after after building this and proving it out, um, without actually turning it into a product at google or doing anything like that Was just just a validation that hey if this is good enough for games, that's probably good enough for developers From a latency from a productivity useful in this point of view And web rtc is basically uh at the core of that because of some of the problems identified earlier with you know Not being able to make udp connections the codex the gpu acceleration So one of the most popular ways to do uh to get a remote desktop in a browser is no vnc It's an open source solution and you're basically just wrapping the vnc Experience in a web socket and putting it into an html5 canvas Um, this is fine. It kind of works if you're just doing like remote system administration You know logging into a something to check stuff out if you're trying to do like full resolution Uh, it really starts to suffer the compression algorithms are not great It was never designed to run in a browser So you get the bandwidth can very quickly shoot to 50 100 200 megabits just for a full screen video Because they're not using video codex in vnc It's a it's a hexile based encoding with jpeg compression. So it's a very different algorithm that was never designed for the web So one of the comparisons I I built early on was showing a side-by-side difference of what web rtc and no vnc Performance was like and so this was actually connecting the same container and no vnc running at one process and the web rtc Um running in the sit in the other process both accelerated by a gpu um and running side by side so you can kind of see Even just the uh, not only pixelate pixelization, but just the the frame rates are much lower With no vnc the best I was able to get was maybe 15 20 frames per second And uh with the full web rtc hardware accelerated Um component we're pushing 60 plus frames per second. Um to the browser out of the same container um And this is mainly due to the the gpu acceleration and the protocol that was being used And for stuff like video, you know, you don't this is fine You can compress the image quite a bit and not notice like things like artifacts don't don't bother you that much Um no vnc also has a lot of or vnc in general They have like a lossless mode where you know, you can get every single frame or eventually get every single full frame Um, but yeah, this was like to me, you know very eye-opening like oh wow I can now break past that 15 20 frame per second limit that I had with some of these other open source Tools and I could not find another open source Um project that would enable this this level of performance. Um, that wasn't you know commercial and so um So I just added the build on myself Um, so that's web rtc I'm sorry with no vnc and so web rtc is a very It's a very different um architecture than what you might be used to Normally there would be like a client server model for communicating with a browser to a service And your web sockets are kind of your only way to to do some kind of low latency direct streaming And usually have a server behind a firewall. You connected that with a web socket from your client Browser that is and it may have its own firewall that it can maybe get stuff out, but it can't come back in It's very simple client server architecture running over tcp With the web rtc in a client server model, you've got a lot more components going on You've got these things called peers. They're not really client server. They could be either And so your peers if the person running the server is a peer again instance running in the cloud um And the browser being the other peer They connect together using a signaling server the signaling server Basically facilitates the call and connects all the peers together And make sure they can have a common place to communicate This can usually be run over a web socket Or even just standard hedp. You're basically just sending little messages saying hey, I'm here and the other guy says Hey, I'm here They exchange connection details through the signaling server Uh from a firewall traversal point of view you have this component called stun in turn So these are a set of internet protocols used for traversing Networks and firewalls it uses techniques that um, I don't know you want to call it exploit But it takes advantage of some of the properties of how sockets are created um and doing firewall hole punching and um exposing ports on their way out and making them available for services like WebRTC to get through so when you have a fully operational stun in turn server You can traverse just about any network Um regardless of what firewall they're on and so through using stun and this reflexive ip point of view You can you can discover ports that are connectable Uh, and if that if that algorithm doesn't work out then the turn Uh relay server is used where you basically both clients both peers Are connecting and all traffic is being relayed through a common point. That's the most reliable way Um to to make a webRTC connection can work through pretty much any any firewall and you need this component because you know Either party could be behind a firewall in a real peer-to-peer, you know video call or something And so you need something that can bridge that gap and find that common ground And those stun in turn candidates are exchanged through the signaling server So once all that algorithm runs in your browser and on the server side Um, the peers the server and the client communicate directly to each other Um over udp and datagram tls So it's a you're not going through the signaling server anymore. You may or may not be going through your stun turn arc Infrastructure, but it's a it's a peer-to-peer connection. So the latency is basically as low as your server Distance from your from your client Um, so because this is kind of a complex system It can be very difficult to debug. There's a i'm leaving out about 10 of the protocols that are used in webRTC But this is kind of like the minimum set of infrastructure that you would need The unique thing about this client server architecture for webRTC Is that everything north of the firewall where the server client server peer is running You generally have control over right you you can decide where the signaling server runs You and you have control of the firewall that's in front of that just like deploying any other web service Uh, you have control of where the stun and turn server live Um, normally you don't have that much control and just like a standard peer-to-peer browser to browser Um, and so because of that you can you can expose certain ports and make it feel very much like a standard client server architecture Uh gstreamer is an open source Uh video streaming transcoding processing engine. Um, it's very powerful It's very easy to use and it's been around for a very long time It's a great platform mostly written in c but it has some python bindings Um that I adopted for selkies and you you construct these pipelines in a python application or c application And you connect these what they're called elements together and those elements Flow from left to right in this diagram, but uh, what selkies does is it takes that x11 application Inputs it to the x11 image source Element for gstreamer and then sends that for those frames directly to the gpu Where the where kuda is used to transform the the color space from rgb to to yuv 420 which the gpu can actually use Then it's hardware encoded within the same gpu to h264 and this compresses the The image significantly and creates a new you know group of images It's now a completely different protocol than just scraping a bunch of buffers out of your your frame A bunch of frames out of your frame buffer After it's been encoded it's sent to the web rtc bin. This is a gstreamer element That handles all the web rtc Signaling It handles all of the The negotiation of the connection the tls setup For for securing it and packetizing all of your h266 h264 frames and sending them over the wire So once you that's basically the end of the pipeline, you know the signaling server that you set up as a web service that runs alongside this But the python app in uh In selkies, uh, basically just does it just sets up the the pipeline wires everything together and then connects to the signaling server Let's go Selkies was built on kubernetes. So it has a sidecar architecture by default I don't know how many are familiar with kubernetes But it's a concept of a pod and a pod may have one or more containers and the thing they have in common is network and storage So there's a three main containers in the in a selkies pod. The first one's your actual application This is like, you know, if you're running rocket league or you're running your ide or something Uh That may be one container and it doesn't have to have any g streamer components or dependencies Uh, it doesn't even have to have an x11 server. It just has to know how the display variable set correctly and it'll run Then you have the actual x11 server. That's connected to a gpu Uh, so you can run those hardware accelerated workloads like games like You know editors and game editors and then g streamer sits alongside in its own sidecar So g streamer is a pretty big It has a lot of dependencies and so the reason why I took the sidecar approach was If if you had to require everyone who wanted to stream an application to install g streamer and have a properly configured x11 server And every single workload would be different and have to like fight this dependency problem So the sidecar method, um cleans that up a bit and you can run basically any x11 application without modification Um, just because you land this in the environment and the sidecars are already there taking care of the actual streaming infrastructure The other thing this allows for is gpu sharing Um, not only can the containers share all the gpu's like a g streamer app can use the gpu to accelerate the stream Um, and the x11 server can use it to actually run the workload But you can also share it with other pods on the uh on the node using this method Um, I didn't mention this but yeah, but the uh the gpu the importance of the gpu is uh in encoding And that g streamer pipeline um is what lets you reduce the latency Not only can your local device decode the stream Uh at a hardware accelerated rate, but you can encode the stream You know a lot of times no vnc and and vnc or even rdp They struggle because they can't they can't get the frames and compress them fast enough Right the cpu is too slow to do that by offloading that to the gpu Now you can really achieve those higher frame rates because it's clocking out frames in just a few milliseconds rather than trying to crunch on it Too long so you pair those together Remote hardware acceleration encoding and local hardware decoding and that's how you get those those full frame rates So the third problem Is uh is orchestration. How do you actually deliver? Environments streaming workloads to the browser in a way that's scalable And um and easy to manage. So this is where the the selkies um broker comes into into play This is uh, you know, there's problems like pod templating, you know, there's tools out there like helm um customize And maybe even just doing some ginger templating But how do you create those those pods for you for each individual user in a way that is tailored just for them There's a lifecycle of your pod, you know When I say launch when I say shut down Uh, obviously you don't want your end user pod to be running for forever or it's never going to shut down It'll cost too much. So that whole lifecycle of create delete update Um can be problematic and then per user routing So this is a big problem that I found uh building the solution was how do I get An individual browser identified user um Into a single pod on kubernetes normally when you're running a service on kubernetes you put a low balancer in front of it and traffic just kind of speed between all the different pods random or you know, there's algorithms, but The idea of sending it to one pod for one user. It was pretty foreign To some of the problems that we have and uh the selkies broker seeks to to solve a lot of those and And once we have this Once we were able to prove one that we can run these workloads in a docker image and then two we can run them on kubernetes How do we orchestrate those in a meaningful way? So the selkies broker is uh creates an abstraction In an operator and this operator is an abstraction over that custom workload So you have some manifests written in uh and yaml that are templates Those are stored in a config map on your on your cluster and these are like prescriptive templates for Creating those sidecars in that pod architecture tying together all the gpu's and all the inputs and everything Um, and then you have the bottom left here the broker app config. This is that custom resource That is a this is my application. It speaks Application it's like here's a name a description an icon Um the image that it uses any input parameters that you want to pass to the templates Um, and so the broker app config is that abstraction on top of the kubernetes stuff to make it more um, you know familiar to running workloads Then the controller reads those manifests reads the custom resource And owns the outputs which are a stateful set Uh, which is basically a pod controller a service and a virtual service So virtual services are an istio component that are basically like virtual low balancers software low balancers That uh that run in the istio proxy And you can update you can create these instantly It's not like reconfiguring A load balancer on a cloud provider where you may add a route or a port and it'll take like 10 15 20 minutes to update The load balancer because it's like changing a bunch of infrastructure on amazon or google With a virtual service you can think of that as just like updating a Like an engine that's config and then root and then Doing a update on the process So but this is a now kubernetes resource which is a abstraction on top of that language of load balancer So that virtual service in kubernetes world Um, just like you can do programmable load balancing routes The virtual service says hey anyone with this email address Which is templated by the controller says if a user comes in with this email address Then send them to this pod the stateful set And because the controller owns and knows Who the individual users are because it owns the identity as well And it knows where all the pods are because it created them Now it can do it tie together the entire routing and program the istio control plane to do the routing So from a templating point of view and lifecycle management the The selkis operator solves those problems and then the per user routing is solved by istio So those are the major components of the of the orchestration in in selkis So at the end of the day, you know, what I learned was basically remote apps In a browser is Is very enabling it's liberating from a developer point of view And just in general being very productive in a remote world And just like remote apps we showed earlier, you know ideas Uh are basically the main one that I wanted to focus on So after google I left uh I left about a year ago a year and a half ago to um To join this company Called um utopia and utopia specialized in vdi remote desktop Mostly windows orchestrating windows on google cloud And um I saw a fit here where we could maybe productionize the um The the selkis infrastructure for running developer environments How can I create this developer utopia at itopia to solve a lot of these problems that that I found Um, and this really reimagined access to um to these softwares, you know, no more company hardware You got bring your own device You don't need any virus stuff because you're running in a Some container somewhere that gets deleted every time you shut down Uh, you don't need vpns because you got web-based sso. You can work from anywhere It really becomes like a productivity dream um Yeah, so Here we go. Yeah, so, um There's this uh movement or I don't know you call it a movement but like dev x or developer experience and things we can do to help improve that I believe that containers are a major part of dev x and it goes all the way back to this developer environment Uh containers allow for consistent development environments. You don't have to you know Your buddy may have a completely different set of dependencies Uh than you do and so it only works on his machine But if everyone's running in the same containerized environment with the same That was created from the same docker file Specific for the project you're working on then you do all your environments are very consistent You can pre-install the dependencies and tools that your team uses And you can basically dev ops your whole environment as code You can everything is a docker file put it in your ci cd Run it and everyone's got the same environment. You can onboard users much much quicker Because they just authorize them to a platform and they run so the first, uh Productization of selkies is utopia spaces Um, this is something we launched late last year Which takes the developer use case of selkies and turns it into a remote development experience for enterprises So this is a this incorporates all the security endpoint isolation Um a container containerized approach to developer environments And it's a fully managed service. You know, we run the kubernetes. You don't have to know how to run kubernetes We put an admin portal in front of it. So it's very easy to configure and onboard your developers And made it very customizable So this is again the first use case of selkies and we're we're looking to do even more especially on the web rtc front So the other day it's like we want to really meet developers where they are Uh by by providing these remote development environments server side Opinions and there's just like dev x stack if you will Where you have this access layer Which is solved by things like web browsers And this is where you get portability data loss protection Uh sso, then you have your environment layer. This is your ide's compilers libraries Um, and what you get from using this environment is ide flexibility You can run any ide in this environment because it's just sending that window to your browser You got your tools, you know, you got tools like uh coder views cicd unit testing And you can increase your reliability or supply chain integrity just by having everything be very consistent Everyone's running the same tool chain to verify end to end their commits And then you got your system integration layer, which is your ide extensions commit hooks This is something i'm looking to do Next on spaces we're like really tying in the browser at a deeper level into the container So that um, whenever you commit something it triggers a bunch of other actions This is where you get a lot of that google like integration and experience So no longer does it work on my machine. It actually works on everyone's machine And that's the end of my talk Thanks guys for for hanging out So we got like maybe five instruments or so I'll hang out for a while We do have a booth down in the expo hall if you wanted to come hang out and chat otherwise I'm here for yeah, but I'll go ahead and answer a few questions before Shut it down Yeah Open gl I think it was a question can I run all my steam games on uh on selfies? You can run some So steam has like native linux games that maybe may use open gl or vulcan and those were those were great Steam also has this compatibility layer which they added into their steam deck Well, they built it before the steam deck, but it's called proton which is literally wine But for games who you guys remember wine Wine is not an emulator. Yeah to run windows applications on linux and it's it's just that And it works really well, but doesn't have full compatibility Why a proton does work on on selfies an earlier version. I haven't got the latest version to work yet But it's definitely there. I'm trying to get more involvement on the selfies repos. So like You know collaboration full request definitely welcome, but yeah, you can definitely run many steam games I was running elden ring a few weeks ago I'm somewhat familiar with dcv I know dcv has a web client But they what I do know is that it they have their own protocol So it's a it's a custom protocol. It's not h.264. It's not jpeg Things like that and and there's a web client version of it And so and it works pretty good. I don't know if it can do the full frame rate like gpu stuff But it is a custom thing. It's also not open source. And so I was like, I don't even want to touch that All right For building this and enabling people to use it and making it shareable and collaborative like it had to be open Source, but yeah, it is very similar. I know dcv is what eight of his app stream uses It's very powerful. That's a it's similar to like link terry dg's pc over ip Yep, so pc over ip is a great protocol. Traditionally you had to install a client to really get the full performance um, you know pc over ip does support gpu's Um and lossless, but you got to have that like downloadable client Which introduces problems with access and orchestration and it's like oh my gosh But is a great protocol. Yeah Yeah, um, so yeah, so drag and drop in a multi window um situation is That's it needs more it needs deeper browser integration to do that So it is possible to do that because the browser does have concepts like drag and drop Um, you know, you could listen for the drag and drop event and then transport it to the other window or pass the events along Salki doesn't do that today expo doesn't really do that today, but it's something that i'm i'm definitely interested in in doing But yeah, right now it really is like especially with expo. It's uh It's window forwarding and so you get like those windows and the events are all there It's it's pretty it's deeply hooked into the x11 libraries But yeah, that's something that's very interesting Did you have another question? Yeah I did yeah There's even some guacamole stuff in the selki's Code somewhere, but yeah guacamole is it another protocol. It's very similar to vnc Um, especially in terms of frame rates. Um, it's slightly different protocol It's text-based protocol, but it has similar encoding mechanisms But it's not as fast like you can't get that. I had trouble getting more than 20 frames per second out But yeah, it works pretty good. It's a little more browser friendly Because it has a you know, like I said a text-based protocol that's meant to go over a web socket But it was heavy. It was like a lot of it's built in java But the and I there is this a project called guacamole light That speaks to guac d api directly to your browser and then renders it in a canvas and that worked actually pretty good But again, like soon as you have like a gpu or something to it like falls apart So yeah, not not a huge fan of guacamole for that kind of stuff But it was pretty good. Plus like the idea of having a full desktop is kind of from a developer point of view I'm really trying to like Get people just to realize this is your application. This is the browser. This is the ide right focus on your code If you need to like go to a video conference call or chat with someone just go to a different window You know that's that that should be handled on local device and you keep your code You keep your environment. It's just that one app, right? So this idea of app streaming is kind of what I think is the new thin client Do you have a question back here? Oh, yeah, so a question about web rtc versus quick. Yeah, so uh great question So quick is a is a protocol that's mostly meant for multiplexing streams and it's it's similar to web rtc In fact, you can run quick over web rtc So I think quick is mostly the protocol aspect of and how you interleave messages and streams and then decode them on the other side It's an integral component to the peer-to-peer low latency Because you can now stream you have multiplex streams. You only need one connection And so if that one connection is like a datagram UDP connection, they're going to have extremely low latency in a very very responsive api The the stadia team uses quick for for the underlying protocol and also a newer technology That's going to be using quick. It's called a web transport web transport is a new web protocol That eliminates a lot of the complexity in the web rtc It takes away the signaling component takes away the stun and turn stuff and allows more direct peer-to-peer That's also you can run quick on that and it's uh, that's like the future of a web rtc From what I've seen And go ahead It's a wonderful question. Do I want to make this more platform agnostic versus gke? It's gonna always wise gke dependent. I built this when I was at google and I'm a huge google cloud platform fan, but um Yeah, so the the orchestration layer the broker Right now all the terraform All the identity stuff is built around gke And it's that's mostly done as a here's a an example of best practice that actually works. You can just deploy it right now Um, I do definitely want to bring it to other platforms I've started some of that work by bringing the gstreamer component out of the main repo and now it's its own thing So i'm starting to work on helm charts that you can deploy this on any kubernetes cluster And the biggest change there is going to be um identity and ingress Like that was a huge problem to solve on google cloud And I was like in order to solve this for everyone it's going to be too hard And so my plan right now is to put it out there. You bring your bring your own ingress But here's the operator and here are the containers that you need to run it on any Kubernetes cluster So trying to finish that out before the end of the year. I have a couple collaborators on that but It's coming. Yeah, good question Yep Yeah, good. Okay. So I think yeah the question was around do you still need local endpoint protection when you're pixel streaming? um, so the the idea with pixel streaming is that you've you've isolated your browser from the rest of the device because Uh, the text like there is no content going into your browser other than like the pixels of what you're working with So if you're working like an erp system remotely or code Um, as long as your network is encrypted and you're using your sso to log in Then you don't have to worry so much about endpoint protection Now if you're not on a secure wi-fi network and they were doing some kind of like, you know certificate Man in the middle stuff, then yes, you're still going to have that problem You will still need to have some kind of endpoint protection for that But at that point there's not almost any solution is going to be vulnerable things like that Yeah Yep So even then like if you had a bunch of malware on your windows machine Um, and your browser was open it'd be like watching a youtube video, right that malware usually isn't going and attacking youtube servers Right, maybe like taking your information or or doing things. Yeah, but yeah, so you still need like some kind of Endpoint protection for securing your sso They actually will log in the identity component of it and that would be the same for getting to any corporate asset Yeah, I think the the main thing with endpoint isolation is that usually you centered around a lot of like ip exfiltration and a copy paste redirection Spaces and the developer use case is great for onboarding remote teams, you know, or like india and russia places where Developers and networks aren't as trusted But you you want to just like give them an environment that's sandboxed They do their work, but they can't like take it out and take your ip with them Any other questions? Okay, I'll be at the booth um rest of the day and a little bit tomorrow. So feel free to stop by Love to uh Chat more about this project and in our product. So thanks again for coming guys. Appreciate it You're out You Uh Which we turn them all you down as long as You know A few resources and So On the side there with some current slum trust members at the time um I have posted a question and then my own answer to that question And the first statement was if the data set permits and they actually believe in my answer Because you know, they say there's so many unspoken caveats to get the data set permits But it might mean like to have to understand your data will bring to do anything. So of course like every answer to every question is if the data set permits Once you get past this past right height table for man Then from there it's all about the command you want to apply to your data if you need to expect it out But not for me to move on to your data The process might apply obviously Um You know, there's a math or a ta right so you're going to monitor limits log. They're the next app that you deploy To your search head and build all your field extractions for you What happens though all the time and it's a vain and so sex and engineers Members will change their schema and then everything you build no longer works Also, I pretty much don't ever trust anybody else. So I might go and spell on after a ta, but I will go through basically the exact same process to verify That their word is the word that I would have done There is a ta That was parsing network traffic so it looked like it was in the reverse order So all of your inbound traffic looks like outbound advice first So, you know, even if there is an app even if somebody does this for you, you might still want to go and do it again What you might have to do a later anyway So here's the process that I use for any main performance Data preparation. This is going to be like 90 percent of the work Publishing your base search is just once you've gotten all of your data And when I sort it out and figured out you've identified the most important And then you just reuse that base search in all of your analytics, right? So you'll have maybe the first 10 lines of the search are all the same across in their dashboard The last three lines, three lines like some nice things differently that get changed into an analytic product Two, um, if you have these base searchers published, then you aren't the only one that has to do all of the data analytic work, right? People don't have to come here for a very long time, because the dashboard is like pie charts and bar charts on it You can be like, here's your base search, go build your own dashboard, and I'm super good at doing other stuff So that's one of the other things where if you get the data preparation on your base search done You can outsource the analytic to the base search No matter what, as soon as you publish a dashboard, you're like, oh, that's great But now that I see it by, and then they just want to split your data by another dimension But they can do that part on their own Then the analytics is more democratized So this is the method and we'll go and make each of these more quickly, but a new data preparation Stage, first you want to reduce and remove all your noise There's two main pieces to that There's some Splunk noise, just the metadata stuff that Splunk injects in there And you don't need that for analytics So you want to remove that There's also event noise There's always extra fields you don't really need You're not helpful or duplicates Or if you've got an app installed You might have the original field And then the first sim field So you just want to get rid of the noise I found that really helpful for me Because if you're looking in a pile of dirty laundry You can pull the things you don't want And throw them in the corner And eventually you have all the stuff you need All the things you don't want You have all the stuff you do want So that's kind of what we're doing here And then once you've got the fields you do want You want to start to normalize those So the field names you want to normalize And then also the field values And really just These are like the most basic commands That you're going to use And that's like 90% of everything you ever need To do anything that's really Define your cost off cost So There's a couple things you could Once you've got your search And all of your data is online You can post that into the configuration file Or you can just data the base search Either way And then your base search again And then your antler So when I'm getting set up You want to always run in verbose mode You can kind of see down On the right hand corner there So run your search in verbose mode Set the static time range So you don't want a boolean window You have the data shifting them Underneath you when you're trying To perform your normalization In your analysis You don't want that to happen Later on you'll spot check back But for now you want repeatable Reliable things so that we can Make a change in your search You can see the output You know it wasn't the data that changed It was more for this So depending on what I'm working with In this case So we'll see when we get to the demo If there's some ransomware data That Slum publishes That you can play with So I was using Slum ransomware data So you know this is going to be In terms of intrusion detection Data models that I want to map my data to And then of course Obviously the admin guide too Because there'll be things like Event type 1 That will look at what does that mean So you need the admin guide To translate that point You can build those translations In the Slum But without that Alright so we're moving to Slum noise So when I get to start with a brand new data set I'm going to assume you already know where your data lives And by that I mean Your index and your source type Somebody had to In just the data and put it somewhere Maybe that was you So you know where it goes I maybe should have concluded the slide On how to find that But let's just say we know the index Of the source type for your data This is like the first line that I throw into Any search I just get rid of my Slum noise So we're going to start to Well first we want to focus on The fields that are present You know 100% of all the events that we're working on The reason for that Is because You know if you do Search let's say Action equals star That will return all of the Events that have action Have a field in the event But if you have 50% of your events They don't have the action field Those events will get completely wrong And maybe that wasn't Exactly intentional Because maybe you were thinking Why even one of the ones were actually in the empty That will get empty ones But it won't get one for action And sometimes people have a hard time With figuring that out at first So I'll show you the mini live Where you can see Select the fields that are 100% of your data Whether it's going to be guaranteed That you can use your analytics Without arbitrarily looting The data you just made And then I'm going to table All of those fields out So I need to see them I need to see if there are any score on And then you see I've got This new field minus Oh and I'll say The dirty laundry pile that we're throwing Into the corner of the room That's field minus So field minus that first line And then we have another field minus here And I figured out That this was on my event list Well how did I figure that out Well I had to go look at my data So you almost have to know your data Before you can get to know your data Before you can figure out what to give her though But that's why I table it all out And I can start to see how I've got a bunch of news I don't need the team version So I can throw those out The other thing also Is the same command here I like to keep my pile segmented Because maybe I have A black pile of dirty laundry And then a red pile of dirty laundry No blue pile Because later I'm like oh you know what Let me go look at the red pile again I can find it more easily this way I will say there's literally no dirty laundry That might help Oh great I got it And then Once you've identified In a previous slide I had the I had the second field minus Right both minus have been channeled Am I sure they're trash? I would usually go back And double check all my work Right so I put the second Field minus to table plus Which then only displays Those fields And then I ran a quick That's value star to star And that gives me a quick visualization Of all of these big values So I can say yeah for sure Event ID and pass that and pass You for good Keyword completely useless Because I don't know what it means Because the exact same value Is every single event So it's really not giving me Questionable information I know it's this one already Because that's my source type High part now they field So it's built to have like Thousands of big values That are scum or with caution Then you want to start to identify Important fields here So you can see now I've got Three piles of 31 grade I've got still my table commands Which are the Field mirrors still important From those ones that are 100% prevalent And I just wanted to start To work with some of my most Important fields So I just thought I've got four Those same fields Which ones do I want to trash Do they need to be normalized in some way And I'm just going to use Again our basic Field table stack We're going to find the source fields That have duplicate and similar values To make four trash piles Figure out which ones we want to keep While I'm working with them Like one on one we're going to Strange share the one I'm coming Marry, marry, build, build, build, build, build Don't worry about it right We're just doing data analysis right now So none of that really kind of Matters We'll clean it up at the end Both of you made a regular expression For you guys okay with the regular expression If you're not This is a really excellent book It's like a quarter inch thick Like 15 bucks on Amazon I think I only read the first half of it So you can get through Like 30 pages and most of the Rejects you'll need Anything except like Rejects Then you start to normalize You can see right here at the bottom right I decided to rename the bytes as Desk And then I also did A Rejects to extract the Deficitation and T domain Which is just the domain and not the Fuller for the end And then I made a new trash pile Right because I now no longer need Reitered bytes for posts And again I'm going to spot check And then I'm going to call Desk and T host Well if my I know for a fact That That my My field was present about 100% of my data Right Because that's where I'm starting 100% of all of my fields If I perform a Rejects And then I do where it is No in this field I created And that field is empty And preserving still 100% I know that like a sim schema Should be Like a sim schema is a sim schema Right But you know why did I use Desk and set a device Turns out after I worked with this data set For about four days building this On the original lane I ended up not doing this at the end I went back to the original device In destination But when I you know day one Like this weird intermediary thing That is exactly here In the destination But by the time I got to the end I was like oh can I understand now With the way that the traffic And the data was working in this data set That I actually needed the intermediary And the two directions So it's not exactly cut dry And everybody is going to do things A little bit differently Fundamentally you should come up With the same results You should at the end Have roughly the same fields In normal lines Roughly the same We walked through all of this again And then we start to look at fields with 90% coverage And then 50% coverage Because a lot of times those 100% coverage fields Have to be like the hostname The IP whatever But it might not really be The real contextual information About what is occurring in that Data So you almost always have to go down To 90%, probably 50% Anything less than 50% You might actually be building A very specific use case As opposed to a more generic like base search That's going to cover your 80% And then So here we've got Fields, fields, fields, tables, tables At the very end then you clean that up Actually don't need the fields minus anymore We already know what's trash I've already figured out now My table plan has bothered me for all the things That I've normalized and I want to keep And I've just got my e-mails and regexes And my names that are up This right here The way I like to do it When I table out the very last table command I'm going to start with My most important fields So for me that's always going to be A time stamp, a system generating The event, the type of event And then an outcome So in this case You can see the time, the host The signature IDs in the census So it would be like event 95 And then I've got the signature Which would be a human version Of whatever font it is Direction user We've moved The very end like command line File count I think I was getting all the way down To like the 1% Event coverage If I had enough room In my screen To go ahead and have them I really don't like scrolling left and right As a bonus If you're building a dashboard Or really don't like scrolling up and down I try to get my backwards Of them on screen And people in a row Maybe like a half Then again I'll validate And then this is where This whole time we've been working That's going to depend On your data set also You might run a search Over 7 days and it comes back like that You might run a search over 15 seconds That seems to take forever So things like Fireball logs Or I don't know the male gateway That stuff has so many Longs and I think that's hardly Not a 30 second search But if you're AV You should be able to run 7 days globally and I have a ton Of data right in So when you spot check When it's not your base search You think it's done Expand your time range Or if you use the event sampling To try to make your search run faster While you're working Remove the sampling So you get a couple data sets And then you want to check again Are all your reges still working Are the filters supposed to be known For all the stuff in the middle Well if you put it into your pops.com Then you actually wouldn't have all the stuff In the middle because the system would do All of the work Here Right so from the Rename down to the table That you can put into Your more advanced users Might actually need to see all Maybe they don't trust you right Like I don't trust anybody else Maybe they don't trust me either But both of you If they want to start learning To do this work themselves They've got some cheat sheet code That has a private stack overflowed for them To a deeper group in San Diego A lot of the women in there She does this but what she does Is she puts these into the macros That's what I'm going to say It's smart right because then People just know Like oh Syslon base search macro Or Palo Alto base search macro So I thought that was really smart And I still live from her What I would do is just Bookmark the hyperlink And give you like your guide But then that's where You can actually start to do some analytics Right now you buy your base search So this one here We can actually practice the whole First practice search that will show us Which system was getting ransomware At the time But it's congressional to build a word Like it flipped over that alert Which this system would now And that's where you start to do Your protection engineering So here are my resource groups Now again If you go to Their dot com The Conference watch online This talk is from Last year It is reported and they have the PDF of this also These hyperlinks that are clickable If you go there Otherwise You can see the picture You can google this stuff All the way These are clickable If you go to the PDF On Syslon online Again the whole conference Is reported there as well So maybe Let's do some time Anybody wants to leave So we can do some questions And then we can do some demo Of some of this that you guys want Questions Yes, so we can show that In the demo for sure Should we just book a little bit of a demo? This project is called Creative Committing Area So you can look at the Boss of the talk competition With like a blue team after the plan This data says From their box turn into You know what else is really cool Actually this is free Actually though Do the data set It's pretty cool You can get the challenge questions In the poll actually That's pretty neat So this is the actual data set project And if you We saw one that will actually Step you through Some of the exercises here as well Oh it's like super big We actually do need this set though And You should pretty much never Ever use all the time But I'm going to use it right now Only because A is the demo site I just want to show you how I go about You know I'm going to be Iterating this command over and over And over again I don't want to sit here and wait for it But I need a good representative Data set as an example Of my data support list You know if I'm only working with 5% of my data is also more Than ever all the work I do But this looks promising right Oh wait something that will run more quickly And again I'm running in verbose mode What that does is that gives me All the native field Extractions here If we run it and I'll just Show you that Which is what they'll tell you In like one training right They already know the data Trying to teach you how to use the command But you don't get all those Field distractions out here Well if you already know What fields and you know That was one of the things that always left me Feeling confused when I first started taking this Long training program Pretty good what I'm going to do I need almost never ever to use this But just to get us to run Of course we don't run in verbose mode We've got our static time range There's a whole ton of fields over here So first thing right You guys don't actually And you've got this filter here right So if we go to all you can see How people want to work with these Probably need 30 now And then you do have to go through here So now why do I do that right Because when we table out these fields Now we can start to broken see What's interesting and not interesting I suppose to do this Right where we've gone And then here we have to like Be And try to you know It's just not as horrible So we table you guys out Right we can start to see Some things right off the bat Four of those things and then two different formats Some more stuff over here Oh look we've gone Then go to event nine You get your ID and pass So we've got four fields that I'm spending So we can start And get rid of some of that Every time I have this thing Up for end point stuff I happen to know off the top of my head That signature idea and signature Are the fields for middle form start And new trash files Even though this is going to look like a trash file It's not really but I want to keep it It's separate so I know that I want to keep it But I don't want to look at it anymore Because I've already seen it And I've accepted that it's going to keep So I'm going to do fields But I just know that this is eventually A final approved table list What's maybe the only thing With a short version I'm sure version because it doesn't take up A lot of real estate But it's also going to depend on your environment So if you work in an environment Where you've got multiple domains You might actually need An additional contextual information That includes the domain in the asset If you're in the stable domain I personally could go with a short version We're going to get rid of pure I need two fields for this Or do I just want the domain Like maybe I can keep the hostname Maybe I want two fields Maybe I want hostname and domain Or maybe I want one field Which is the end I want two fields now So I'm going to do this This is going to be I don't know if it's right here These are my two lines And I'm eventually going to be my official Year to go that hostnaker uppercase And I actually ended up You know, I've got it in my sweat a bit Let's see what we've got We've started to build our search here And we just for our own brains Like kept all of our trash files And everything else up Oh, you know what So I don't know If you talk about Windows Live I maybe should have picked a different dataset I've had a reasonable comment That is performing the activity So it can be an important You know, I think it's like Um, 18 or 19 or 20 And I forget which one is like this But in this particular So no, I would keep it But if we So it's the same user account What that means This is actually a cavern Which is collecting the new method It's not actually a mouse It's performing the action This is not just collecting the event from So because of that It's actually a useful field That I don't need cluttering So if this were a different dataset Right, this was still Windows But it wasn't Syslon And it wasn't just ransomware Well, forget I said ransomware It wasn't Syslon Right, here we've got two timestamp fields Now when you do a staff value Star, star, I don't know But I just want to find the The field that's one uses To mark a timestamp for an event So if we go to the event channel Same as our source type we give her that We like the event description This is actually helpful Although it is a good example So thank you for not warning We're going to do some fun I would love to know It'll be great so we can give her an event Maybe this might jump some gates in But even if this were like a brilliant activity You could literally put it as the title Of your chart And then like you don't need that field That's good Basically we just do this Come on, do it again So what we would do is We come over here now And we start to look at all the fields 90% We know we already look at them And work through that again Eventually what we get to Is we've got here again That device signature that we like So here is the Actual base You can see I did filter Something about I don't care about the system I guess maybe if I were on Out this Data that I just don't need I've got my re-names here So I named the image loaded To the standard normalize I don't even think it's actually the same field But at the very least it's lower state case This would drive me completely crazy Like Even if image loaded wasn't the same field If you left it in camel case Down here In table commands of all the lower state cases Remove our sampling Return process Return map Turned out in like The 100 sum a lot, 200 thousand events It did not have a process field Once I was I was done with all my normalization And I went and checked every field Every single go I was like where is null And then I looked What I found were four events I was like 200 thousand It did not have a process I'm starting at boot time So it didn't It had a process but it was in a different field Because it was like monitoring itself I just go to show like the level of I don't see the type of map So I'm sure nobody else would have pride About those four events You try to find A ramp that would be in ransomware So we've gone What do we have? Five I think 10 minutes Maybe we can sum up with something Let's say ransomware, what does it do? It encrypts every file on your Symptoms I think it triggers A unique process for each file When it starts to encrypt them So I think it would count the process By needing buying stuff like a search What we would perform the detection This would be our base search That we would save and give our user We could reuse this now Any number of times per step All we have to do is throw this Online on the blog And now we have a ransomware detection Something fishy So in one hour We don't know if this one is also Like if it's got Something fishy going on It could or can't We could easily do We wouldn't want to just run this In an hour And every hour would get us a bullet If you've got 20,000 systems in your environment Every hour you would have 20,000 We need some kind of threshold And then we can set that to 4 So we'll be on trigger work We need a better data set Or not a better data A larger data set In order to take an appropriate threshold Right now we're just looking at 3 hosts With an environment And you would be able to see what's normal There might be servers You know, web servers are really busy So they might get their own work And then we don't need Maybe we also need a account But also going to be Do we have like a chart? So maybe candy is the only thing Maybe because it's the best of the day We have it interactively We think about security Section engineering is nothing That's like, hey I'm bad But normal activities They're just abnormal And that's the part part In figuring out No matter what search you're writing There's always none of them Like a judgment call What's the appropriate threshold Right, you don't want to last your time For like a thousand orders a day It's inevitable that At least half Right I think I'm good About one last slumber She smiled I turned around to you smiling Yeah, everybody was At that point I didn't know Let me get my mask And then I said it's too big I'll just shrink it Some things actually shrink And then others I know It's a wonderful thing Asha came last for you So when I was learning There was a lot of noise But it was just too easy To do it similar Which I'll be around I'll be In the hotel And probably at the bar Right I would stay in the hotel You know, if you have the other conversation You don't want to barely drop it But you know I understand Perfect Alright Let's These are going to go behind your ears This is the Lizard head Where the lizard head It's got the little frog You guys have any? I'm going to say I appreciate you coming Thank you Yes, that's the camera I thought that's what happened I don't know Well, it does that too But it's also like Cool So yeah So we just started on there Yeah, I think so I don't know if you want me to give you an introduction Okay So it's 6pm I guess we might as well get started Okay So welcome to Live patching I'm Sarah Newman I want to start this talk By some disclaimers I'm very much not the expert On live patching I'm somebody who Was running a virtual private server company And decided that live patching Seemed like a pretty useful technology And so I decided to try it out It seemed to work And so now I'm Not at now I would say that Incorrect live patches can cause Crashes and data loss So if you're going to try anything From this presentation You probably want to try it in a test environment first Not in production Don't do what I did and crash A single server Because you didn't do adequate testing in advance Alright Agenda I'm going to start with some terminology For why live patching Talk about some types of live patching And then the bulk of this presentation Is going to be About building live patches So terminology Because I'm not sure where everybody In the audience is starting from This is all probably pretty basic For most of you But with hypervisor It's a system that runs virtual machines Common examples are KVM and VMWare Type of hypervisor CentOS and Ubuntu are Linux distributions Binutils Is a suite of utilities That includes read-elf and object dump Which can be used to look at program sections So an executable on disk Will include a section For Machine instructions called text Or It will include something for Initialized data in the data section With newer Versions of DCC There's also compiler options F data sections and F function sections Which allow your functions And your data to go into Their own sections Which happens to be handy while you're building live patches Then finally Microcode is I would guess I'd call it firmware That goes on a microprocessor Such as an intel or AMD A lot of times when you want to do security updates Or Some of the time when there's a security update Often related to Side channel attacks There might be a new version of microcode And you might have a live patch That depends on that Next I want to talk about Why live patching So Problem statement I want to apply a software update To a kernel or a hypervisor Or in some cases a core user space program And I want to do this with no restarts Reboots required And to go over Wide live patching I'd like to talk briefly about some of the alternatives To live patching Reboot is an obvious one For applying a software update In the case of a database server Unless you attended one of the previous presentations This might take on the order of hours When I was doing virtual My virtual private server company We would have it where It might take up to an hour to bring down The machines reboot and then Bring the virtual machines back up Kexac is a way to Avoid the hardware part of that Reboot that you still have The overall time Associated with the reboot Redundant services So behind a load balancer Or some kind of high availability solution Allows you to Perform a software update on one of the systems Without having any downtime to your customer And then finally In the case of Virtual machines only There's live migration So you can Have your unpatched hypervisor Running multiple virtual machines Your patched hypervisor That maybe doesn't have anything on it You use the live Migration protocol to copy over the virtual machines One by one When your old machine is evacuated You can go ahead and bring it down For your software update Given these alternatives Why would I look at live patching At least in the case of virtual machines Not all operating systems Or live migration capable So you'll try to Live migrate them And you'll have to perform a reboot instead Live migration is not always as well tested So you can have bugs That Occur after the migration And then you have to reboot the virtual machine anyway And then finally Live patching is very fast to apply To compare to pretty much all the other options That I mentioned So I would say that If it takes any effective time at all Like we're talking about the order Of probably tens of milliseconds Which is much faster Than doing a reboot Other than redundant services I would say Redundant services obviously would be Comparable but you can perform live patching Without loss of redundancy Okay Any questions before I continue All right Okay, types of live patching And by this I just mean Conceptually like how I would perform live patching So I have Edit in place where I could overwrite an existing function With a new one So generally when I perform Live patching When I talk about Doing a live patch I'm talking about patching a function I'm not talking about Patching individual Bytes or a module Okay So in this case I have a function Here that I want to overwrite I want to make into this one And if it happens to be The same size or smaller I can just overwrite that in memory I haven't found any Examples of doing Edit in place that I'm aware of So then Another possibility in theory Would be splicing Which would be if you could Find all your callers to a function Edit them to point to the New function You do have a problem with function pointers That are pointing to this original memory Trying to find all those function pointers And edit them might be a challenge So again I don't think that I haven't I'm not aware of any examples Of somebody trying to use splicing For doing live patching Then The final Method I'd like to talk about Which is the method that people actually use Is a trampoline Where conceptually You're just inserting a jump At the beginning of The original function In linux This uses I believe The ftrace Framework And then you have the advantage That your new function can be any size This does have the disadvantage Compared to the other ones Of having some overhead But it's fairly minimal And as far as I'm aware As I said I'm pretty sure That all of the live patching technologies Use this Somebody who built and applied Live patches in production The bulk of this is going to be About building live patches And in this case All the debugging that went along With building live patches So there are a number of different implementations And If you come out of this talk Interest in the live patching But it's not your dream to be a kernel engineer I would highly recommend you Talk to one of these companies Live patching solution for your system So I will specifically be Talking about Zen and also Kpatch because those both Had pretty good open source Implementations Or at least they had Open source implementations I could find So Next Your build environment You really should use the same compiler And the same compiler options Is the original build If you don't do that Kpatch at least will warn you I'm not sure that the Zen Live patch build tools would I would say that it's easiest To self compile And preserve the original environment So probably using Mock or Docker Or Trude or whatever it is Just something where you're not Continually trying to keep that build environment up to date Then there's a question Of what can be made into A live patch And again this may vary Between different implementations But I think that At the basics are generally All the same Of what is easy and hard So What I found to be easy Was changing just the logic Within a function You can also add variables So add a global Harder would be changing An existing data definition So perhaps If you have an exist If you have a bit field And you want to add bits to that bit field But you aren't changing the size of anything This Should in theory be possible With some help That I will get into in a bit Hardest I would say Is if not impossible Would be changing the size of a data structure Changing the initialization code Because the initialization code Is already run Therefore you can't live patch it Or changing parts of the live patch Code itself Probably or not necessarily advisable And As I said earlier Some of the live patches you might want to Generate for security reasons Might use new microcode Well Linux and newer versions of Zen Can load new microcode at runtime You weren't aware of that Okay So now I'm going to go through A few small code examples Again this is from the Zen code base So it may not be generally applicable But here I have in the I'm patching something in the Init section schedule in it I'm just adding a print K Do you guys think it will patch? Raise your hand? Alright yeah definitely You're totally right no It's Explicitly Ignoring the init section even So that code already ran There's nothing to do there Then Here I'm adding a Variable and I'm calling it From A function called schedule ID So stupid code But do you think it will patch? Raise your hand Yeah it actually Successfully patched And I ran the code that Exercises patch And here's Oh alright Here's some of the output at the bottom Okay now I'm I have a structure that I'm changing To be that used to be const Now I'm removing the const Do you think that will work? Nope definitely will not work So Symbol change sections From the read only To the not read only So it definitely doesn't work And then adding on To a structure And there are Statically declared instances of it Will it patch? So again The answer is no So that's because Of an object size mismatch So I suspect But I haven't tested So I don't know for certain but I suspect That if you didn't have Any statically declared Instances and you were just Doing dynamic allocations That this would only be function changes I could be wrong and that Might actually generate a patch That of course would not work properly I could be wrong there Unfortunately I didn't have time to get that far So one of the workarounds That's available for things That will not patch properly Is hooks Both Zen and Linux Have a concept of Hooks you can make changes Potentially before During or after a patch is applied Or revert hook You can make changes right before During or after a patch is reverted So you can use this for Sanity checks you can use it For modifying data So if for example You wanted to add that Bit to that bit field And you knew where all of the Instances of that structure were In memory maybe you could go Through back and back filled You know some of what that bit The definition of that bit Is as an example Another work around is Shadow variables So Conceptually this is Adding a variable Onto a structure In Linux there's a shadow variable API You can use to do this So I can have multiple shadow variables Associated with a single Structure or just An address really in memory And Zen unfortunately There's no Shadow variable API You'd have to hand roll This is what you were doing there And alternative Instructions I mention only Because this is the reason That I had my one crash in production So Different versions of CPUs Have different capabilities Different instruction sets Generally like When we do a binary for x86 It's the lowest common denominator So alternative instructions Is the framework for Patching in the Correct Instruction based on the instruction Set so For example if a processor Has this Supervisor memory Access protection I believe But not all Intel processors Do it'll patch in the Correct instruction In order to protect or Unprotect memory And it does this With a Linux module It would do it when it loaded it Or in hyper for Zen it'll load it It'll do it when it loads the Live patch And this really needs to be run as part of Live patching Or like me you will discover That you've entered a You've encountered a memory access Violation So the bug I was apparently an Early adopter of the Then and I did not test On the specific processor That I was going to apply the Live patch to And they weren't Handling the features To apply the alternative Constructions correctly so I Encountered a bug there where the Alternative instructions weren't Applied correctly and had my crash So that's why I mentioned it Any questions before I Continue? Okay Moving on So patching Zen specifically So I want to talk a little bit about The consistency model So unlike Linux Patches are kind of Applied all at once in Zen So there's no concept of I've applied a patch To one virtual machine or another machine Does it all at once Perhaps just enabled And you probably shouldn't patch anything In the call stack while the live patch Code is running because the old code Will continue to execute So the process Of applying a patch in Zen So it You can upload the patch It allocates memory for the patch It resolves the symbols And then it applies the alternative Instructions so When you go to apply the patch The There are work queues in Zen So it schedules work to Apply the live patch then Within the code Like in the idle loop The CPUs will check For live patch work and then All rendezvous at a specific Point in the code is what they call it The first CPU there Applies a patch so it'll run the Apply hooks Zen does not have I believe it's The f entry that Puts the Linux that puts the no ops at the beginning Of the Function so it doesn't have that So it has to save the old instructions Before it overwrites them so you can revert the patch And then it Overwrites the The beginning of the function With a jump So a jump from your old function To your new function So for the test patch I decided to do Was an actual security patch Zen security advisory 401 Which had There was a race condition Whereby a safety TLB Flesh is issued to early In any case I picked this because there was a total of one Function patch Get page type in mm.c So The process of building the patch I will note Just in case you're familiar with kpatch The Zen live patch build tools are based off Of kpatch So here I give the source to Zen I give the config For Zen I give it the path to the patch Then because In Zen when You stack live patches It wants to do a safety check To make sure that you're Stacking the patches as expected You will give it a path to A dependency That will either be Your previous live patch or Zen itself So here I'm giving the build ID just of Zen itself There's a similar Check for the build ID of Zen With this Zen depends Then I give a path to Zen symbols which is used Both as a sanity check To see Is the function that I'm patching Actually in the original Version of Zen that was built And also the size of the patch Is pulled from The size of the function Is pulled from Zen symbols And then finally I give it the output directory So building my live patch It Builds everything It has this reading special section data Which will do things like I believe pull the size of the Alternative instructions Among other things I didn't look at it in that much detail So to Apply the patch There's a shim in between Make And GCC That will take any Source that is built And copy it off to a special directory Then it will unapply the patch And do the same And then it will Remember The two using Using create diff object Which is part of the build tools And so it will do Effectively a binary diff With a number of safety checks To make sure that what you're generating Can be an actual Valid live patch And one thing I would like to point out Is that if you are adding a source Which you think would be easy for it To add to the live patch If you add a new source file It doesn't get included in the live patch So Don't do that As I Discovered a little bit ago Alright This was not my first experience My first experience actually Was that nothing worked So as of The last time I seriously used this Was CentOS 6 and Zen 4.8 as of 2020 So Zen 4.14 And Zen 4.16 Whether it was CentOS or Ubuntu It didn't work out of the blocks So I did a bit of digging I was able to generate a patch For Zen 4.13 And that's still security supported So I figured good enough for this presentation I asked about it on the ZenDevel mailing list And Turns out that master And some public patches For the live patch build tools And I was successfully able to create A live patch for that But I wasn't able to successfully Backport to Zen 4.16 Or 4.14 cleanly So as far as I know Nobody has tried to build live patches For those versions Hopefully when Zen 4.17 comes out That'll have support Then problems I encountered I I needed Dwarf Debug Info Re-enabled So The Dwarf Debug Info had been disabled I encountered a segfault with creative object That was fixed upstream In kpatch So I backported that successfully To Zen build tools And then I had a change function Not detected I was just mentioning all of this Only because I want to emphasize That there's kind of a strong dependency In between what you're building Your live patch build tools And the version of the compiler Then I tried to load my live patch And it failed because I have an ABI breakage So having fixed the ABI breakage I was able to successfully I was successfully able to upload And apply my live patch But then When I went to go take a look It had patched a lot more Than I expected So rather than patching My one function It had patched eight Eight functions I took a look with object dump And indeed those were all defined So Now I needed to do some debugging And I happened to know from former experience That log messages can be A problem With generating live patches I think more so with then With kpatch Because kpatch sets some logic To try to filter out changes To the log messages But having fixed my formatting Fixed that My new object dump I had Just the one function patched And it successfully Applied And so We generated a live patch for then Some of the other problems I encountered previously Was that with CentOS 6 The compiler would kind of Randomly insert no ops Into the code And I would Manually look at those And try to figure out what was changed And I didn't really observe anything changed So I carried a local patch To filter out Those functions for my live patch But I haven't yet observed it on CentOS 7 So I'm hoping that that is An upstream bug that has been fixed In GCC And then the final thing for Zen That I'd like to mention is There is a payload limit So This applies to Zen Only for the most part I would say If I try to Generate a live patch Within two megs by, for example Inserting A space at the beginning Of A whole bunch of files so that We get those debug line Message changes It rejects it with invalid argument That's because there's a check For live patch max size Which is defined at two megs And I have not tried Live patching that to make that larger Just so you know Then Finally I'd like to Finish with Patching Linux with Kpatch So I didn't go into this As deeply as I did with Zen The consistency model So for the consistency model There's a Live patch Framework in Linux That is somewhat common between the different tools So with this Patches are applied on a per Task basis The kernel will When you go to apply the patch The kernel will Check the stack Of a sleeping task To make sure that there's nothing Being patched in the stack If there's nothing Being changed in the stack It'll go ahead and apply the patch Otherwise it Applies the patch when the Patch is in user space When it's in user space It's not executing anything in the kernel And is therefore safe So here I have a Test patch just to Modify Proc devices So a very simple patch Not really terribly interesting But it has a visible effect Which is nice So if I successfully live patch There'll be devices at the beginning So With K-Patch I looked at Ubuntu 2004 And I tried doing And I tried doing an apt-get of the K-Patch build tools and building against that And very similar to with then I need to give it a path to the vLinux So the original symbols I need to give the source directory a configuration And then finally my devices When I Kick this off at 5pm one evening I came back at 8pm And it still wasn't done And I'm like well maybe it just takes a long time And I came back in the morning And I discovered It was still running And this was Not something I expected So I decided to attach gdb I turned out that There was Performing an fscanf Of Module.simverse From simverse.read I looked at the offending line It was doing a while fscanf is not end of file If fscanf does not Convert anything successfully It doesn't Increment the file pointer So it just stays there and it type loop doesn't do anything So that's the bug I encountered I looked at the upstream build tools Simverse.read And found out that This was a bug that they had fixed Upstream so then I went And I did user local bin Kpatch build tools So I got A working patch then And then I had a beginner mistake Where I Did not notice that the Source that I was Building did not match the Running because AtgetSource got the latest Linux source Having fixed that I was able to Generate a live patch successfully So this is very similar To the Zen Build process And was able To load it So it loads as a normal Linux module It waits up to 15 seconds For the transition to complete And it is patched So Huzzah All right So do you summarize this Presentation Live patching is very fast Compared to other software update options It's typically performed Via a trampoline Live patches may need to be Specially coded. You can't just take a random Bit of source and expect to be able to Generate a live patch from it Don't assume that Live patch tooling will work With any particular software compiler version You really need to test So Thanks to Alison, Venim, and Luke For early feedback on this presentation I have some references here And any questions Yes Yes Right, so you can tweak some of this So the patch Your guide here Is pretty good So it talks about When it decides to change the Inlining there, you can do things like Apply no inline or apply always Inline in order to get that to work There is a case where you might not be able To patch everything So for one thing Your function has to be large Enough in order to Apply whatever trampoline You have, for example I agree with you So this is This is something where When I was doing it, I would If I could naively get a live patch I would do things like Check the call stack to make sure that It looked like Basically it was Safe to apply from the context The change was safe to apply from The context Of where the live patching was happening I would have a test system Where I would have Representative Virtual machines Running on all of the different Architectures that I wanted to supply And I would patch and un-patch Several thousand times In order to make sure that this was Safe to do But I agree with you that It is delicate and You won't always be able to Successfully generate a live patch Sorry Sometimes we would Basically just We would just automatically Re-apply the live patches at boot So that's one way of doing it The other way is to do a full Recompile and then Just load the fully patched version The next time it happens to reboot Any other questions? I guess we're done I shall explain how it's not There is any other update That can work with each of these components A lot of times you probably have This From Data Steward Which is supposed to support These People People together Components And I wanted to make a video For the data styles For the conditions To share the data To put your head out there Cutting them to a big Quality Causing them money Coursing efficiency With a good impact And the benefits of these systems How do you like it? How do you think of it? How do you think of it? How do you think of it? How do you think of it? According to data catalog So, we don't have morecado other parts of our condition Let's say you are in contact with The other side of the company, and that makes you go through the whole cycle of this, trying to come up with a situation that is similar to what the other team was going through. So that's a very key component understanding why most systems fail, but they have no anxiety to side of the other part of the organization. So the objective of data governance is to inform data management and organization making, ensure that which one is consistently defined, better understood, including the use of large-scale data in organization like this side. This requires a lot of, what do you call it, support and time. It's not something that engineers can do with this side. Building trust is something that has to be a part of the strategy. If you try to bottom up your network, because you're asking other side-outs in the organization to work with you. And for most part, most side-outs are just because it's not their business. It's a jobs field. If that side-out is working, just make sure that their jobs, their insurance, and more, they'll be able to get out and something like that. So unless the management and they have all the rules and the work to be done, you have to do that. And the extra part of this management is the work side of the bottom line. That can be money. And most of this is on the back-end side of it. So probably you won't see dollars, you won't see any kind of value all of it. You can spend time and money in account. And probably they can leave it for few years to see value all of it. Because you will notice that you have a push back on management side of it. And this is based on what you need to do. The reason that you also see this here. So the amount that you need to spend on this, the amount of value that comes out of this is not valuable in the near term. And it's been somewhat, somewhat of a marathon that understands the long-term value of it. If you don't have that, it's hard to do this because of the support of this management. So there will be compliance in the management side of it. So how this is done, is there any questions for us? So how this was done? This was done in the system, I think, watching this. There are several rules, kind of formative. This is about architecture, the framework of how government can work. And there are lower levels of actually the architecture needs for objects for it. For storage as well. You can have source and architecture also on top of that. So then, how you can get your career out? The main thing about would be your business. So things like your compliance, you know, in the military. CCTA, GDP, all of that, you need to apply that at a place where you can come out of your organization, power out your business and work. So things that should work, across the board, quarantine and work, so to make sure that whatever happens, the police chief, the police chief, all of that, they are all doing good. Aright. So I want to talk a little about such a good data, the data approach, the way to do things, the dataぐlogs, the people who are working, do they have something for their lives? So how does that work? How does that work? Do they have something to help with their work? Does that work, is there a way to do that? So I want to talk a little about the above, the two things, the one thing that you have done is, in the case of a government, to do SQL, you have to do individual SQL calls and then you have to do that, that will make your query function. But if you call the market across the region, you have to do this kind of function. You have to do a component change, talk to the company. But the biggest thing that happens in your SQL query is don't work for you. So if you call the market, you have to talk to the company. You have to do another part of this. You won't tell you what you're doing. You don't tell you what's happening and something breaks and you don't tell you what's going on. Why are you doing this? So we forgot to tell you. So you don't think that you have to do this. What happens later in the place? That's one thing that we look at and we call the place for degradation. If you fail, you need a failure. You do not fail at that point. You should be able to handle that solution and do that. It does not cause any failure points. So we know we have to be about to fail in this. And there's much more relevant situations in the past. First, this is something that we're working on. But we're really into this. And that's how we did that. How was it transformed? So right from the start, where did it come from? What kind of transactions did they do? And then how did they do that? So I think that we don't need to try to figure out something understanding about where that was supposed to be going from. What cells and what kind of tools are to be applied? What do you think will be applied to that? What kind of transformation will be applied to that? I don't know if I'm going to be able to handle that. I don't know if I'm going to be able to handle that. These are some of the things that are important. I don't know if I'm going to be able to handle that. I don't know if you're going to be able to handle that. Especially when it comes to data. It's a screen that has a lot of data. When it comes to data, we don't know if we'll make our concept in data almost midnight and the balance of our data. Any questions? Related to what you just mentioned and also on one of your slides, the data corner challenge is like four data points in one of your slides. So after you've seen the old external data especially for loading into state Hungarian. Yes. So what are the solutions? What are the solutions? We'll be talking about the other questions. So what are the solutions? We'll talk about the other ones. Also, cameras? Yes. First, we'll have a little bit of a discussion. So most of us are different, so we have different roles in place. We understand a lot of our data, how it should be. For example, this is a data center that doesn't come from a farm that we live in. It could happen that the data is moved because it's a truck that doesn't come from a farm that doesn't come from a mountain or the data. So we have a lot of different roles in place and we have a lot of challenges and data quality challenges. So what are the solutions? There cannot be two data sets. We have a chance to look at the data so that we can see something that's broken from external data to the data going on the next slide according to the data of current breakup and we can understand much data running through one night. There are a lot of things that we don't have to write that out. So I want to recommend that for everybody and you have to understand your data center. So if you don't understand your data center very well, so I think it's a bit difficult to understand your data center because it doesn't come from a farm. And sometimes we don't understand it because downstream, we don't come from a farm, so how do you know the data center? So it's a bit difficult to understand and it's a way of understanding the data sets and quality control and the data. So we have a solid community and we take some quality data we may want to solve it by applying the quality rules but then again it's also something like cost data and you want to buy quality data and spend the money for that and you want to spend the time on a farm and you don't know the system. So it's a bit difficult to understand and it's difficult to understand the data center, but a lot of you might be thinking it's a bit difficult to understand. So the assailants who are in charge of the problem are picking up the data and then they're very familiar to the application. They are very familiar with data scientists go to their common data and then finally you are able to actually focus on the source of the data, the real system, the data coming on, the transformation of the digital kind of data. We are focused on where each person, each structure of the organization focuses on which source the full of the data comes from. Who wants to use the world and what is the state of the data and the land and the system. Conceptual architecture, this is the concept of the digital technology that we use to work in the world, something that you are committed to. In the digital world, the services platform obviously is something that the services have some use in, for example, the source of the data from other places. If you plan to work with them, for example, you have security, compliance, framework, the real data systems, the bodies of the organization, talk about how they do what they do. If you plan to see compliance with the current dialogue, follow the guidelines on which tools and then on top of that, for example, you have the authorization and the accounting, which is the single-sign-off and how to use, you know, functions for the users. So that's the one source of work for the machine. And then plan to use your own systems that come from the... Yeah, I understand. With data lineage, I'm not quite sure what that means in this context. So then you just, as I mentioned, take the blackout of the source of the data and write to the concentrations through the energy source of the data. So discuss it, then we're looking at the dry countries and how they're even landed in that place. For the source of the data, what kind of transmission works, what kind of logic does that have to do with the whole story. So we are showing, we're looking at the whole story. So this is a very clear form, this is how it is transformed. So as I mentioned, it's very important for the data governments to happen in the realm of confidentiality. So the data government talks about the responsibility of the data management and the question of the distribution of data from the organization for transmission. We show people, kind of, the values of the data in the different dimensions of the source of the information and the positions on which the source of the data applies to all these resources. So if you're concerned with the company, then you have to understand whether the data on the data structure is not the source and how it responds to the quality So, I'm not going to end, I'm almost done, so I'm going to make a little talk about that. So, I don't know if you're familiar with this, but Matt tells me about some central actress. She's the only girl who comes up with this thing called Matt, she's a machine learning, how to teach, how to do things, and with that, I'm still teaching more of this last talk, how to teach, how to teach, how to talk about it. So, we kind of, we get this little Matt talk, I think. We're going to talk about all of the application tools in the real ecosystem. From here, we'll just talk machine learning, how to teach, how to do that. You'll see this, I just took a screenshot of this, and you'll see the number of application tools there. This is humongous. Highlighting is hard to be even once in a year, but if you could actually see the entire of the one in this session, I'm sure you can actually see it. It's so humongous, it's probably like a thousand applications involved in this environment. If you want to know which ones are, should be, you know, I'm not going to go over that. The one tool is the best is the one you want. But still, we're going to just talk about the complexity of the number of tools that are out there that can do all of this. So, data processing, data transformation, data hosting, data routing, data basis, all of these are all there. So, all of them are there. Now, I just picked up the data piece from this. So, what I showed you, we also just saw the AI, machine learning, and data. Now, for data, I just... In the data, there's just much. But even in the data, there's probably a hundred applications. I can see storage. How do you... So, you can see just from this amount of applications and tools there. So, it's kind of mind boggling the amount of things that you probably have. Now, you must... Not to worry about this, but... What you know at least now, what's out there is that when you go through evaluation and transparency, what can make your life easier, better, easier to understand. So, maybe this is something that you do. But also, kind of... What's the future holding for the data world? Understanding that the data is moving out, doesn't mean transformation is important. So, here, I will just... I want to talk a little bit about some of the things changed in the last few years. Just like object storage, computer storage, and just gave a lot of space. So, a lot of things that are out there. So, just... How I talk to you about data waves, which is kind of the... How people talk about... How they create... I mean, how Henshok is also... Data waves, it's kind of lost. But last year, we were talking about this, but it's not that much, not that much. So, it's kind of the talk of the shine, but it's not that much. But we actually very much confused about how we use a lot of features in the digital world. So, features are part of the work. So, on the front wall, there are a lot of... There's not so that... So, you will be able to go when we can actually come here to do... And all you will see there are these... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... So the plan is to start studying the resources. So apart from that, you've got to be able to use full data. So this is a very small subset of the resources from the program data. So try to go to the plans to understand that, to use the resources that you've got. So you've got to bring in most of the data, use that to and find out if there's something out there that's possible to succeed in your graduate development process. So you're bringing that into the community. It's acquired, but you may sort of have more... It's acquired, yes. It's a traditional of this program. So you've got all of the data coming in. You've got to find that data. There's a lot of data that's free from buckets. And then you've got to bring in some of the data. So you've got to bring in some of the data. And then you've got to put it into... Now, to be able to use the resources that you've got. So you've got to be able to use the resources that you've got. So you've got to bring in more of the data. So you've got to put your ideas, once you've got the data. You've got to be able to hold it, find something that's really, really big. And then you've got to bring in more resources that you've got to come in to supply those resources. People can see, it's called the audio deck. They've got to be able to use that to develop and calculate the data through... what you call... ... that. That's the sense. Yeah, yeah. Yeah, but you don't have a theory that's not a theory. Because if you're doing a thing, you just kind of just cut it. So sometimes you cannot say that, uh, we probably see a lot of people with an account, a few more accounts than we see. And you know, but the real thing with them is kind of that expectation that you know that you can see the problem. At the end of the day, how are you going to do it? Do we want to cover, uh, how you can go through it or not? No, not at all. How close? I mean, the youth is just training right now. There's one of the campuses where it's less than a small portion of it. But they show us that this can work. So that's kind of the whole thing. So one of the other things that we have to do is to do that. And even though we announced it, it's a good thing. But after we announced it, it's up to the campus to say, you know, we don't do that. Uh, you're right. You're right. You're right. You're right. You're right. You're right. They have their own architecture that they use. Some of them are not. So this is where we have our own, we have our own building. So, um, we've got to focus on all of our, um, they call it, um, Bajtega. So Bajtega, we've got to focus on our own. And Jimmy gave us a couple of small details. And we've found out, for example, um, we don't know how to get to the building. So if you don't want to join us, we're probably just turning to a portion of the audience. So I was more asking about, you know, the streaming data, more of like the architecture where you're showing that board. So you saw the, It's a unicorn. A cross- addition. We just have to focus on this. But at least now we're going to have one. So we may focus on the Diva, we may have to be on focus on Diva. So I'm on average, um, the, the big, or actual shop. And, um, we have these, um, if they want to, they want to, If you want to take the one myself, you have to follow the best practices you have got. What we do is going to be push the boundaries of what we do. What do people do? In the education or in higher education, what do you have to do? Most, most boundaries are not good. So we have got to be push the boundaries of what we do. Thank you guys, I appreciate it. Good conference coming up. Thank you. Yeah, they're glad to meet back. Oh, good. I thought it was you, I saw the name. Yeah, conference. Very cool. Thank you. Yes, yes, definitely.