 Are we ready? Are we ready? Welcome to the 10-40 talk on testing OpenShift on OpenShift Take it away. So, hello guys. My name is Samvaran Kashi of RallaBindi and I go by name SK and I work for Continuous Productization Team and I have been working on a project called CI Pipeline which is a part of CentOSPA's SIG community and today in the current talk we are going to discuss about the following things like we are going to test OpenShift on OpenShift like a use case and how is it feasible for us and how did we work on it and why do we need OpenShift on OpenShift and the basic terminology we need to understand like the whole process like the container or slip boot and OpenShift again containers that's repeated and about the privilege containers and what are the differences between them and why do we need OpenShift on multiple clouds and how are we deploying OpenShift on multiple clouds and how is the whole process is enabled by a tool called as Lenspin and we will have a short introduction about CI Pipeline project and we will have a demo and we will be concluding the whole presentation there. Going ahead like our use case now is to like install and run end to end tests of OpenShift on a VM which is running inside an OpenShift container so it's like we are running a nested virtualization scenario where we are running a virtual machine inside a container which is already running on an OpenShift VM and why do we need it like because of the regular system updates we found a need to test OpenShift on like multiple distros like for example CentOS and Fedora like Fedora is going going very fast like we have Fedora 26, 27, 28 and 29 is in B maybe it is in beta and every time like whenever there is an update there is a need like whether the OpenShift works on that update or not so that is the thing which we are going to address and we also need to check out like how the OpenShift works with the multiple deployments and why does it fail and how does it fail in multiple distros also so in that case like we need OpenShift we need to test OpenShift in a very feasible manner using OpenShift itself so before we go ahead with the talk like these are the things which you need to know about like containers like most of the people who are attending this talk should be knowing about the containers like containers are nothing but an isolated user spaces like which acts as a processes running a shared kernel which can simulate your work like environment like servers or like the CentOS servers in a user isolated user spaces and we have like privileged containers these are like the containers which gain the access to the host kernel and we will be discussing more about that and OpenShift is a container management platform which is based on Kubernetes distribution which recently changed its name to OKD or Origin distribution so I should be using that more like by promoting that instead of like taking OpenShift name and finally like we will be we should also be knowing about the LibWord Demons where like LibWord is an open source API which is used for managing different kinds of virtualization platforms and which are like different kinds of virtualization hypervisors like Zen hypervisor or KVM etc going ahead let's talk about like a container versus a privileged container so when you talk about a container like it is just a process or it is just a command running inside a user space well then why do we need why do we need a privileged container in place because like most of the times containers are kind of secured by a container engine which run on top of an operating system which is kind of using like the kernel and it is running on an infrastructure however in case of privileged containers the containers kind of bypass this container engine and get access to the operating system and the kernel devices directly which we might or might not want to do like in certain cases because unless the container unless the container needs to use the device shared by the kernel we shouldn't be using privileged containers because there is a risk where like container can run commands like rm-rf slash and like remove the host all at once so which we don't want it to happen I have a perfect analogy for this like how do you guys use hotels and airbnb you definitely have used so what do you think about airbnb is it secure enough which one do you prefer staying in a hotel or staying in an airbnb hotel yeah if my company is kind of sponsoring me to stay in a hotel then I would definitely prefer hotels but in some cases like in case of airbnb you kind of share the resources like kitchen and like the sometimes washrooms I have pretty worst experiences with airbnb and great experiences too so airbnb kind of acts as a privileged container like a person who is using airbnb can be a well-mannered person and make use of the whole like accommodation in a very good way or he can destroy the house he can set your house on fire so that's what happens with the privileged containers like if you are not being careful with the privileged containers the whole infrastructure is on fire like people might delete the devices or it can go like in an error state whereas in terms of when we check out the hotels these are like the secured spaces which have the security mechanisms of like security guards or like patrolling around and they have access to the cops at any point of time and each room is kind of on its own which doesn't share the resources and hotels have the best service also because like hotel management is kind of responsible for maintaining the rooms whereas in in terms of airbnb the guests are kind of morally obligated to clean up their rooms like when they go away but in hotels it's not the case so going ahead why do we need OpenShift on multiple clouds so recently like we have seen OpenShift popping up on every type of cloud provider like OpenShift on AWS OpenShift on Azure we are in collaboration with Red Hat and OpenShift on like Google Cloud Platform so there can be like many scenarios which you want to use even like you would like to run OpenShift on your local machine for your development environment but there must be a better way to choose cloud providers there must be an easier way for your deployments to happen so advantages would be you can choose what you want to when you choose multiple cloud providers in some cases like Amazon is like more costlier it's just an example it might not be true in real time scenarios Amazon might be more costlier in terms of storage Google cut downs it costs like every every three months to compete with Amazon so I might want to run my OpenShift deployments on Google Cloud maybe like Amazon is more efficient in terms of storage so I want to keep the storage on Amazon I would like to run my machines on Google Cloud so I want to connect them together like that is like very difficult these days because each cloud provider has its own API and it's very difficult to connect them and the person who is using those APIs like should have intense domain knowledge of like both the cloud providers at any point of time and like when we kind of deploy the whole infrastructure on multi cloud basis I'm pretty sure that like a Google cloud on Amazon like both wouldn't be down like at the same time like so there will be less down times and there might be less latencies like according to the regions which they actually offer wouldn't it be dreamy like if we have a lightweight tool that does the whole like multi cloud platform deployments so there we get the tool call as linchman so linchman is a collection of answer playbooks, modules and like simple python scripts which enable this cross cloud deployments and multi cloud deployments going ahead like linchman does have its own terminology like which you need to understand before we use linchman linchman has workspaces where workspaces are nothing but collection of files which are generated to manage your cloud deployments or it's not difficult to like create a workspace it's just a simple command called as linchman init and it just creates your workspace magically and we have a pin file which is a starting entry point for linchman to grab details from and they have topologies and layouts each topology constitutes of different resource definitions of like multiple clouds which you would be seeing in the upcoming slides and we have layouts to generate multi cloud inventories automatically based on the data which we fetch from the different cloud providers and the best part of linchman is the hooks so hooks are something which acts as like a pre provision and the post provision scenario of like based scripts which you can run which can enable like things like open shift installations going ahead this is the linchman flow like a linchman if you see linchman as a black box and it takes topology and layout as an input and it gives you an output file and it gives you an ansible inventory if there is a layout and it connects to all the cloud providers like aws, openstack, gc and there are like 6 to 7 providers which we support right now through linchman and it provisions the instances so all the outputs will be gathered from the APIs provided by the cloud providers and linchman is linchman hooks works on like generated inventories so they run ansible playbooks, python scripts or like node.js scripts, ruby scripts on the generated inventories once the resources are up and running and they create a deployment using some magic there is no magic in between but it does use ansible so which we call it as magic because that is a pretty good tool which uses SSH which I always wonder how come SSH can be an integral part of deployment so that's what I feel so magic so we use ansible in between to deploy the inventories and going ahead this is a typical workspace looks like this is a workspace to install MySQL on a particular topology and it consists of the credentials which are stored in like YAML we can use like whatever credentials which you want and these three credentials which are standardized credentials of like open stack aws and the google cloud and we also have hooks we created an ansible hook in this particular case which installs db server from an external role and after that we have different types of folders to store different kinds of files like inventories, layouts, topologies resources and topologies this is how a pin file looks like and each pin file is a collection of key value pairs where you just give the reference to the topology and layout in our case like we can use any other we can use a layout for like openshift three node cluster or like four node cluster that creates ansible inventory for us to create ansible inventory to deploy the whole openshift environment and these are the examples of the topologies which consist of resource groups, resource definitions and there can be like n number of definitions coming up so the highest deployment which I have made is about which I accidentally made using linchpin was like a 20 node deployment on my aws account which costed me like $200 over two days that was crazy but I came to know about a lesson like you should be very careful with the count attribute of this topologies and linchpin works with bigger deployments too though we haven't tested in the production environments of deploying 20 node clusters but it does provision like 20 node clusters this is a basic structure of linchpin topology where we have resource groups and it consists of different resource definitions and each resource group can have its own metadata which can be parsed and the best part of this topologies is templating where you can dynamically render the whole template of topologies so that like there can be ad hoc provisioning like different other provisioning tools and this is the inventory layout which is kind of cloud agnostic in nature because it doesn't specify it doesn't tell you to choose from the which cloud provider it intelligently like goes to the provision instances and brings out like brings out the resources based on the count attribute there to map the ansible example layout with inventory like we have three sections mainly like one section is the wars which translates roughly to the all wars inside the ansible inventory and at the same time each host has its own host group which consists of its own metadata using the layouts going ahead this would be a successfully generated inventory for like an app server and db server as you can see there can be like there can be a aws instance and a google cloud instance working together or there can be a private cloud private cloud instance to which you can connect them together as long as the network permits you to do that and coming with the linchman hooks which are kind of part of our use case of testing open shift and open shift linchman hooks are like kind of context aware scripts which run after the provisioning of the instances happen there can be like five types of hook sleeve can be written in ansible python shell ruby and node js2 and there are like four states where you can initiate hook on where one is the pre upstate which is before the provisioning has happening is happening and one is the post up which is after the provisioning has happened and the one is before the destruction if you want to do some cleanups to the external cloud providers with your own custom scripts which you can do and one is with the post destroy which can be helpful in a case where you want to assure or be certain of whether the resources are being destroyed properly or not and this is an example hook of installing a db server basically we don't do much work in creating hooks because if we use a playbook which uses an existing role this would be a playbook looks like it's just referring to a role which is generally available on ansible galaxy so you need not write different types of roles if they are already on the galaxy and this is a quick linchpin one or one of how linchpin is installed and how do we create instances you install it via pipi and after that you create a workspace using linchpin in it and giving up and giving to the credentials path you can use linchpin up command just like you do with vagrant up it creates all the instances and if layout is specified it creates ansible inventory too and finally if you want to destroy the inventories or it won't destroy the inventory but it would destroy the whole resources out there by using linchpin destroy command and finally to make an open shift possible like we used a container which is specifically which basically installs linchpin and other dependencies and which also runs which also runs libwardy inside the container so we kind of borrowed this docker file from Mr. like Brenton Bard this is like one of the great examples which we found in order to run libward inside inside a container and on top of it we like we just had to install some of the dependencies like the libward developer devil and the rpm bills and the batch completions which are necessary for which are necessary for creating like libward instances and what we did is like we kind of since we are using like a privileged containers almost there like since we are using privileged containers we were like trying to run the whole libward demon by overriding the existing setup with the host machines KVM device and we got to the point where we kind of created an inception inside an inception like we had a miniature VM on top of it we are running a linchpin container on top of it like we are running open shift and we are using linchpin to run end-to-end tests so this is a simple like a workspace I would like to show the I would like to show how the workspace looks like you can access this workspace on to this particular repository and before I conclude and show the demo like I would like to talk more about like our CI pipeline project where the whole testing open shift on open shift is a part of CI pipeline project where we are trying to like accommodate an automation framework which uses different types of containers and tools to make your CI process easier and in an example project which testing open shift on open shift is a part of as a stage where we were testing we were actually getting the packages of Fedora atomic host and every commit that is made to the atomic host runs through the pipeline and it triggers a build package and runs the functional tests and composes an OS tree and further there will be integration tests made on the compose of the images and finally like once the image is being generated that image is being fed to the linchpin LibWord container where the open shift cluster is being booted inside a container and end to end tests are being ran so this is how our pipeline looks like as I said it goes like whenever there is a just get commit it goes through all the stages and finally like it confirms like this is the part where the linchpin works inside a container to run the open shift test oh sorry coming back to the demo I hope I can play this so this is an open shift environment which is running locally and we have the linchpin LibWord containers which are already being built as a part of the CI pipeline process and this is our Jenkins environment where our actual pipeline runs for this particular demo I kind of isolated all the other stages and used the linchpin LibWord container directly to run our open shift end to end tests in this case it started provisioning the instances and using a LibWord provider so using a quick work through this is a privilege container which uses the host machines LibWordD but the virtual machine is actually running inside the container where it uses the linchpin to install the open shift and run end to end tests on it so it took 28 minutes 38 minutes last time and it's going to take a little while more but the whole demo is within 2 minutes so I took a freedom to edit the demo and now it's kind of downloading the image source that would be a fedora atomic image and it uses linchpin boot that image and run open shift tests on it let me just forward it a little now it's generating the outputs and it is generating the open shift inventory with the hook which would be the post inventory, post provisioning hook and it started installing the open shift environment onto the virtual machine and once the open shift in this current experiment we tried to run the single node environment because running full blown open shift deployment inside a container which we tried it kind of crashed our environment like a multiple times so we just wanted to check only the single node environments so coming back to the presentation I thank my whole team like continuous productization team all the support they have given me and for giving an opportunity to work with this particular project and feel free to feel free to fold the repository of CI pipeline we are looking for contributors and on free node we are continuous infra and we have a mailing list for continuousinfra.com if you have any doubts so any questions yeah sure performance like in terms of comparing linchpin to just using a playbook that has the AWS modules or the OS modules for open stack performance like since we are like underlying playbooks of linchpin actually use this OS server and other things like if you are just talking about provisioning of instances it's hardly hardly differs in milliseconds because the things which we do is like it is just instantiated by ansible api and ansible api calls the playbooks where the open stack server like open stack server modules are ansible open stack modules are referred so it shouldn't be any much of a difference but when you talk about like a whole run from creating of instance to generation of inventory which you need like multiple playbooks to run is still then like linchpin gets an advantage on top of it by simplifying the process rather than on performance so in linchpin we have multiple components called as rundb where we use a database to store the existing topologies and successful runs and you can repeat it again and again like based on a transaction id essentially it can be like thinking about that it can act as an external cloud service provider if you write a REST wrapper around it it can be a full-brown provisioning as a service kind of thing but currently it's just a small lightweight tool which does provisioning across multiple clouds any more questions? Have you played with kubvert instead of like manually not yet but some part of my like a couple of people in my team they have started working with kubvert and I heard like pretty cool stuff with kubvert so we are going to like implement kubvert soon Any other questions? All set? Thank you. One more small announcement like I have a talk about like CI pipeline for dummies which could have been better like if I give that talk first because that has like all the basics of like how the pipeline works and how open shift works and water containers and it starts from a basic level where what is the software and how does it work so feel free to attend that like and that's it so thank you Welcome to the 1120 talk by Harshal and Andrew. They are graduates of the Westford high school and they are going to call it soon this week so take it away All right Hi our talk will be school extending distributed tracing vertically into the next kernel. I'm Andrew and I'm Harshal We did this as part of the MIT primance research program for high school students So first let's talk about distributed systems So nowadays applications are getting more and more complex and they are no longer monolithic so they are now written as distributed systems. The advantage of that is that you have modular development continuous deployment and better scaling However well you can see them increasingly in large companies like Twitter 2013 services that operate that depend on each other So let's look at an example distributed system for say something like web search So the user first makes a request to the front end service and then that front end requests the web result service which then contacts the page rank service and then it returns results the web results and then the web results are returned back to the front end then it gets the images which uses visual rank and so forth Now suppose that a user is unhappy about search results being returned too slowly How do you tell where the problem is and this is where distributed tracing comes in So distributed tracing lets you monitor and troubleshoot these distributed systems So it helps you discover latency issues and it helps you find out what services depend on each other and helps you find where the issues are recurring It traces specific requests as it propagates through the entire distributed system Disturbed tracing tools nowadays miss a lot of things because there's more to performance than meets the distributed tracing tool So there might be other services running on the same server and there might be kernel bugs that cause performance issues and even security patches like Spectrum Meltdown can have pretty significant performance impacts So the question is can we gain visibility regarding these issues via kernel and that's our goal So how did we approach this issue We started with the Jaeger distributed tracing framework It's a tracing framework built by Uber released to the open source community I believe very recently and it's a fairly mature system It's actually used in production by Uber to as their own tracing infrastructure To this we added LTTNG which is a Linux kernel Linux kernel trace toolkit that is built to gain visibility into the sys calls and the kernel events that are generated while an application is running When we combine these two we get SCUA So now let's back up and talk a little bit more in depth about how tracing actually works So as Andrew said the goal is to follow an individual request as it propagates through the entire distributed system So you start with a front-end request and now the front-end, the first step that it does is generate a context for this request That includes a trace ID a parent ID, a span ID and whether or not this is sampled The trace ID identifies this individual request throughout the entire system It's a unique identifier for this request The parent and span ID together are used to construct the causal relationship between a bunch of different services So in this example you can see the front-end request has no parent ID and it generates a random span ID Then when it contacts the web results service the span ID of the front-end becomes the parent ID of the web results and so forth with the page rank and then when the front-end makes an additional request of the image of service and the visual rank service then the parent IDs and span IDs are set accordingly The span IDs are randomly generated and the parent IDs identify the parent span of that individual thing So what exactly does the span entail? A span usually identifies a specific amount of work that is being done by an individual service and it's marked by a start and end time Additionally the user space application can attach logs and different events to the span so that they can gain visibility into what that individual request of that span was operating on at the time that that span was generated Now in addition the front-end when it gets this request it actually usually does a generates a random number and decides whether or not to sample that individual request based on that number The reason behind this is if you try to collect the span all of the trace data would simply be too large and so while it is possible to do this in development for testing purposes in production usually a sampling rate of say one in a thousand is used Now if the span if this entire trace is supposed to be sampled then all of the different services will report the spans that they generate with a central service some span aggregation service this in Yeager is used as the collector service and that then stores it in a database which can then be query through a web front-end Speaking of that web front-end this is what it looks like So you can see the top-level request and then the length of that bar is the duration that that individual span took and then you can see the causal relationship between each of these based on the indentation of all of these services and you can also look at how different services are operating concurrently and when they start and end how long each one takes so it's fairly easy to identify a bottleneck in terms of let's say something is taking too long you can simply look at the length of the bar if it's too long if you expect then you know you can identify that as your bottleneck and that was what causes your top-level request to take too long So a little bit more detail on how SCUA works So already a user space application is running we attach a Yeager client to that which collects these spans and then reports them to the Yeager framework this is already built as part of Yeager so we didn't actually have to do too much specifically in the Yeager framework to anything to change that to make it work with our system Now the Yeager client we did have to modify which we'll talk about in a little bit Now when the user application is running it's going to be making a bunch of different syscalls into the Linux kernel and additionally the kernel is also going to be generating a bunch of different events for example a kernel event could be a scheduler event switching that process off of the CPU and putting another one on and that usually happens immediately after a blocking syscall is made LTTNG already allows us to collect these syscalls in kernel events and collect them using a set of kernel modules that LTTNG provides So we needed a way to propagate that Yeager context for the trace into the kernel and so we did that by using PROCFS with our own custom kernel module we stored that data in the task structure which is the thread specific data structure that the kernel uses to identify each individual thread you can think of it as thread local storage in the kernel what we did is combine the data from the task struct which contains the context and we attach that to all the syscalls in kernel events that LTTNG is generating as part of its kernel modules we propagate that information back into user space using an LTTNG adapter that we built and then report those into the Yeager framework using the conventional methods of reporting different spans so to recap first the Yeager client propagates its context into the kernel then it stores that information in the task struct when the user application generates syscalls or kernel events LTTNG modules reads the data out of the task struct pairs that information with the events that are generated and then eventually those events are then propagated back up into user space through the LTTNG adapter and then the Yeager client as well as our adapter both report to the Yeager framework so the Yeager C++ client that sends its context into the kernel that took around 25 lines of modification to the existing Yeager client library in C++ we treat each Linux kernel event as the next level in the span hierarchy so let's say there was a single span that was running for the visual rank service each syscall that the visual rank service makes appears as another span beneath the visual rank service and then all of the kernel events that are generated so like RCU events, scheduler switches allocations and freeze of different kernel memory each of those becomes an event in the logs of that span of the kernel span that it's under and our modifications to the LTTNG kernel modules in order to tag each span with the associated context information took around 80 lines of code and our LTTNG adapter took around 250 lines of code so now we wrote several programs to help evaluate how well we did with this so first we wrote a correctness tester we wrote a small C++ program that just creates a bunch of threads and makes about 10 different syscalls so this is just to make sure SKU is actually recording the syscalls as expected so we found that they were and it appears that LTTNG actually does not instrument a few of syscalls that are called very often like at time of day and the events that happened during these syscalls were properly recorded as logs we also evaluated the performance using two different benchmarks so they're both running on the same machine with a sampling rate of 0.1% and for each scenario so we have no tracing so we just turn everything off with unmodified Jager so we just included the normal Jager client with the normal span creation and we also the next one is where we used our modified Jager client so this is the Jager client that's modified to report the context into the kernel whenever spans are created and the next one is LTTNG without Jager that's just to see the overhead that LTTNG imposes and then finally we have SKU where we combine both the modified Jager client and LTTNG recording the syscalls so the first such program was a small HTTP server written in C++ we used AutoCannon as our benchmarking tool so we send a million requests over 10 connections as quickly as possible and then we measured the latency and throughput under each of those scenarios so here are our two graphs so each of the benchmarking scenarios on the bottom axes and then the throughput or latency on the y-axis so the first this is our baseline performance this is our unmodified Jager which introduces a little bit of latency but nothing on throughput we also have Jager plus PROC FS so in this benchmark it decreased throughput latency by quite a bit LTTNG alone also has moderate amount of overhead with throughput especially and then with SKU we combine all of those so in the end we have a 12% throughput decrease and about 200 microseconds of extra latency per request so the next benchmark we used was Fortunes so we borrowed our code from our web frameworks benchmark what this basically does is queries for Fortunes from a database in this case it was postgres and this is intended to be a real world application to simulate a real workload that someone would do instead of the hello world example we used a similar benchmarking process but we ran auto cannon twice because Java and we did it with 100 connections to exercise spring's threading model this is what it looks like after we run their benchmark so you can see each sys call appears below for example the postgres query so you see it's doing a send to and a receive from which is expected and you can see each of those logs that happened during the receive sys call like for example you can see it did a context switch and it did some rcu and datagram copying and what not again here are graphs so that's our baseline performance and when we add jager and even our modified jager doesn't do anything to performance and lttng decreases throughput a tiny bit and scua ends up with a about 6% performance decrease in terms of throughput and a little bit of extra latency on average so we discussed our performance overhead so the unmodified jager is negligible lttng decreases throughput and increases latency since it's loading all that data we could improve this by only enabling some of the instrumentation points because for these benchmarks we enabled everything, all sys calls all events our modifications to jager cause additional latency depending on the benchmark since each jager library is written differently performing sys calls is quite expensive even when you're only doing it for every 1,000 requests and lttng and our adapter ingesting the kernel events also perform more work so that performance degradation is expected a note about the tiny http benchmark each transaction is less than a millisecond so the latency impact appeared to be extremely large even though it wasn't that bad alright looking forward one of the things we hope to do is improve our performance even more so currently with the fortunes benchmark we had around a 6% decrease in throughput and a very small increase in latency however we think we can get that down even further and we, Andrew outlined a couple of the ways we plan to do that on the previous slide another thing is to simplify the installation process of this entire system so currently we require modifications to the Linux kernel so you have to recompile a custom kernel we have a modified version of lttng modules so those are the kernel modules that are part of the lttng they'll need to be, those are custom as well so you need to recompile those from source and install those properly and then there's other modifications we did to the Yeager clients as well and so basically putting all of these together to make the installation process and the usage process a little bit simpler would be nice one of the other things we were looking into is adaptive sampling reconfiguration so a talk that Lily gave earlier at DevCon was talking about how you can modify the tracing parameters in real time to adaptively add different instrumentation points and logging as you detect that you need it so if there's a specific service that seems to be taking too long or is too slow then you can add additional tracing to that specific service in order to get better insight into that specific thing so we were thinking you could also do a similar thing to decide whether or not to enable SCUBA so you only would enable SCUBA if you detect that a specific service is taking too long and then you can gain insight into the kernel level on-going of that service in addition to the user space application itself another thing we were looking into is attempting to trace Ceph with SCUBA so Manio is doing some great work replacing the existing Ceph tracing framework with Yeager and we were thinking about if we could integrate SCUBA into Ceph basically to show another example that SCUBA can actually trace real-world applications and additionally to measure how much of a performance impact it has when doing so in addition we can use distributed tracing to monitor and debug all of these complex distributed systems however current distributed tracing frameworks miss kernel information and so we developed SCUBA to integrate the kernel level data from LTTNG with the information that Yeager is already collecting SCUBA has some impact on throughput and latency while it's not that bad and it could be used for some applications it may be too large for other applications so it depends as of now on whether or not it's applicable to production systems our code is open source you can access it there really quickly we'd like to acknowledge Raja who mentored us through this process and this is our time's program oh that alright here we go okay so that would like to take some questions here's the microphone it's over there thanks for presenting that that was really cool do you have any plans for contributing so our modifications to Yeager are fairly minimal the thing is that they actually do have a fairly significant performance impact as of now they have they cause the differences that they have force the Yeager client itself to propagate its entire context information into the kernel for every sampled request I don't think it would be um I don't think we should directly contribute back to the Yeager client but I think it would be worthwhile to maintain like a set of patch files say that can be applied to the Yeager client to enable the skewer tracing integration so while I don't think we should directly contribute it back still build off of whatever they're working on I would encourage you to reach out to them and discuss it because I think they'll be very interested any more questions great talk thanks so much you're doing service calls into the kernel so a service writer particularly wouldn't necessarily know what the meaning of those system calls are so if you've got any tools to help you break out whether it's like networking stack it's entered or whether it's like scheduling accessing file system etc to help those kind of users so we could do some sort of thing where we group different syscalls to say networking or file system or a set of categories such that you can see all of your networking syscalls together and then all of your file system calls together and so forth so we could do that as of now we're just pulling the syscall name or the kernel that name directly from LTTNG and we're not actually doing any analysis on it and we're just sending it up to the to the Yeager framework so right now we're not doing any introspection but we absolutely could okay thank you very much as others said thanks for the cool talk I had one question about LTTNG versus EBPF did you look at all that modification or what would be needed to work with EBPF instead of LTTNG and are there kind of pros and cons or shortcomings or reasons that you picked one versus the other we looked at EBPF briefly the thing with EBPF is that as I understand it it's mostly used for networking related so you could absolutely capture syscalls and different sorts of things that are networking related using EBPF however in order to capture a grand picture of all of the different events that are going on in the kernel we opted for LTTNG we were actually considering using EBPF to propagate those spans the span context between kernel to kernel so that the user space application wouldn't actually have to send its trace context with its request we could actually do that from kernel to kernel using like an extension to TCP or something like that however it just seemed like it would be a little too complicated and then it would be difficult to represent the causal relationship between those two kernels properly using the current Yeager framework so we looked into it I'm not sure it's exactly right for what we were trying to do though got it thanks alright thank you good morning ladies and gentlemen it is not every day that I have the privilege the honor dare I say the golden opportunity of introducing a talk such as this a talk that has world-shaking implications a talk that reaches into the depths of our souls you'll laugh you'll cry I can't even begin to describe the excellence of this talk please if you would please welcome to the stage Red Hat Senior Research Scientist Jacob Kozol esteemed colleagues gracious guests welcome to the reprise of DEVConf 2018 Easter Egg Talk The Hot Dogger a comprehensive scientific experiment Hot Dogs are universal part of life everybody eats them celebrities eat them you go to a ballgame you have a hot dog presidents presidents all eat them in varying manners but everybody eats a hot dog traditionally we cook our hot dogs with the core 5 you bake them, you boil them, you pan fry them you microwave them or you grill them but thanks to a significant archaeological discovery by a fellow senior scientist at BSS labs someone covered a revolutionary way of cooking hot dogs I present to you the Presto Hot Dogger cooking six hot dogs in just 60 seconds coming in its packaging original packaging unopened box we are thrilled to have this phenomenal discovery in our presence at BSS labs upon opening it we saw this gleaming metal case of holding six hot dogs as you see six prongs six hot dogs all cooked in parallel now when we opened it up we found the instructions very very explicit instructions first separate the cover and base you need that cover to reduce splash very revolutionary if you're cooking in a pan that splash out that can stain clothes get all over the place but no we have a cover here to secure the hot dogs we have these electric diodes that you can put your hot dogs in and the electricity is going to flow through them in parallel now three insert the cover back on gotta watch out for the splash and four plug it in no on off switch we don't need that if you want to cook a hot dog fancy you can use your smart stove your smart microwave no we put it right into the wall 60 seconds they're done now careful to note hot dog is designed to cook frank furters only not designed for sausages frank furters now we decided to use four brands of a variety of different meats your austermire classic uncooked weeners turkey chicken and pork ballpark frank's chicken and pork then we mix it up like smart dogs soy protein because we want to do the veggie dog a lot of households won't be cooking meat we want to see how well the hot dog can handle modern veggie dogs and of course the Hebrew national straight beef we wanted a single source of meat product because we wanted to see how that would affect cook times versus the double or triple meat source notice the packaging very horizontal for the ballpark dogs very vertical for the Hebrew national Hebrew national packed vertically Oscar minor vertical logo of course before we can plunge into the depth of advanced scientific methods of cooking hot dogs we have to go back to our roots the ones passed down by generations so we start with pan fry we let it preheat for four minutes and fifteen seconds toss the hot dogs on and then cook for three minutes and ten seconds this lid to a very even slope coming from a brown sixty five to seven degrees Fahrenheit finishing around 158 to 163 all of the hot dogs cooking very linearly very evenly close together we then moved on to the microwave no preheat time here so it's quicker we cooked it for a hundred seconds and this led to a little bit of a different skewing some of the hot dogs cooked up to about 160-165 some of them were a little lower more around those high 140s 150 degrees Fahrenheit and then we move on to the oven preheat for 15 minutes this is variable some ovens will preheat to 400 degrees a little faster more like 10 to 15 minutes we did for 15 minutes to make sure we're at exactly 400 degrees fahrenheit don't want to under cook the dogs cook for exactly 15 minutes again fairly even for most of the dogs coming from the 85 degree fahrenheit range and going up to around 180 however the oscar mayer maybe due to its smaller size or maybe due to the triple meat product in it cooked much faster and cooked to a higher temperature moving on to the boil preheat for 14 minutes this is how long it took my stove to get it to a raging boil and then cooked for exactly five minutes the one benefit of boiling is you can boil quite a few hot dogs one after another but that preheat time slows it down and this one was the most variable of the cooking methods it seems that the different meat sources really affect how well a hot dog can heat up to various temperatures in water the ballpark franks barely getting to the hundreds coming in around 120 whereas the oscar mayer again very high temperature easy cook up to 150s and then the traditional method the grill takes a little longer to preheat you have to get your coals going make sure that the grill preheats itself and then we toss them on and they cook for eight minutes to get to that nice charring point where you can get that crunch on the outside but the well juicy cooked inside and this back to fairly even it seems that when you have an external source such as a stove or grill you can get those hot dogs to cook pretty uniformly and now we move in to the extremely advanced scientific method of electrocuting hot dogs first we decided at base temperature to take their resistance oscar mayer and ballpark franks coming in relatively similarly at 2.1 mega ohms whereas the light life veggie dogs in hebrou national were a little lower at 0.25 mega ohms and 0.5 mega ohms so now you can see the hot dogs lined up in parallel and an interesting note is a lot of times when you cook a hot dog you want to have that kind of slightly curved shape with maybe a seam down the middle where they'd split so that way well you don't want too many juices coming out during the cooking process once you put that hot dog on a hot dog bun you want the juices to come out make that bun a little bit moist so that way you can really taste that grease so if you see here we get that nice curve that you shape provides perfect bun laying ok so the cook method we start by cooking them to 60 seconds per the instructions however we found that for certain hot dogs that didn't achieve the desired temperature so we increased it to 80 and 100 seconds finding that actually 100 seconds is the ideal time to cook a hot dog in the hot now this might be our current limitations on electricity this is an ancient product back when there was less regulation on how much current and voltage you can have flowing through your wall outlets so the hibernationals of real standout here a slow initial time that beef was taking a little while to electrocute up to temperature but once we hit that 60 second mark it started to spike up the veggie dog and oscar mayer as well having a little bit of a slower heat up but achieving that final temperature around 160 degrees Fahrenheit so now as we do this experiment it's very important to get feedback from our advanced panel of culinary experts and electrical scientists to see how these hot dogs actually performed under this electricity so we got technically warm warmer in the center than the edges and then for ballpark franks we have right kind of split that's when we learned that the hot dog was superior with that split that you want in your bun not very hot turns into this one's pretty hot surface temp was high cold and dead on the inside maybe we can all relate to that one light life veggie dogs this is the one that I was personally most interested in seeing because a product designed around a frankfurter is generally going to be designed around the meat but this is a meatless product and it actually performed quite well surprisingly warm tofu licious and synthetic smoke taste a dope mouthfeel and finally the Hebrew national that single sourced beef this one's better better than the last more evenly cooked smells like a hot dog this one tastes like a hot dog that's actually cooked and finally I'm a really bad vegan so now the comparison that we are all here for here is the scaled down version of the cook time in the first hundred seconds hot dogger goes from a low temperature around 70 degrees Fahrenheit all the way up to that 160 165 range within that hundred seconds the microwave also did pretty well with that quick cook time going up just a little bit below the hot dogger and then there's that drop off we go to pan frying and grilling which both did a pretty decent job although the grill did have the benefit of having the hot dog started a higher temperature and then boiling and baking both very slow to heat up the hot dogs now we talk about this char you have a grilled hot dog and you want that char on the outside you want to be able to bite in and feel that crunch on the outside and then the juice on the inside but what about if you had a little bit of a softer outside and you still get even more juice on the inside but then you get that char on the inside you bite in in your surprise bit the amount of texture you get on the inside well with the hot dog due to the inside out cooking method we are able to get a charred inside of the hot dog you can see here there's that nice split and a bit of charring and texture on the inside and now what if you have a larger hot dog what if you decide to go really off on guard and throw in a sausage well the hot dog can actually handle those larger products because you don't need to paralyze them you can have them go in a more weave pattern where you can shift them over you won't be able to cook quite as many hot dogs but if you're trying to experiment maybe you're cooking for yourself you can put them in an angle and then cook larger hot dogs as you can see here these ones quite large at the end very swollen and bloated but cooked fine and now an experiment that was very important to all of us at BSS Labs was determining how long does it take to get the hot dog to pop or as we prefer to call it TTP time till pop so we started with the Oscar Meyer and a veggie dog 58 seconds we heard a big popping sound 75 seconds we got that electric sizzling noise 100 seconds strong burning melt paper on fire 128 seconds smells like hell 150 seconds lot of smoke some light charring smell now this is the point where we unfortunately had to call off the experiment smoke coming out of your machine it's not a pop and we didn't want to burn the hot dogs we wanted to explode the hot dogs the veggie dog ended up feeling very weird at 171.2 degrees fahrenheit and as you can see here some serious warping of that skin texture so I would highly refrain from cooking veggie dogs for 120 150 seconds hot dog and as you can see here the inside has been just riddled with the electric current going through now our brand verdict thanks to our esteemed panel of hot dog judges came down to have a Hebrew national with that single source beef as the best hot dog followed up by the Oscar Meyer and then the veggie dog beating out the ballpark so those light life veggie dogs solid choice if you want a veggie dog and now I want to bring your attention back to that packaging notice that our two losers ballpark and the veggie dogs horizontal packaging however our winner is the Hebrew national and the Oscar Meyer very vertical Hebrew national actually packaged vertically in the Oscar Meyer with that vertical branding so if you're unsure not because there's so many you must go for the vertical packaging it's just going to be the best and most reliable hot dog so now we look at the quality of the various cooking methods drilling is number one if you want a really high quality hot dog we would recommend that you put the time in and you preheat your grill and you cook the hot dog on the grill and also not everyone has access to a grill all the time sometimes it's winter and also pan fry them and that comes out with pretty high quality you get that crispy outside juicy inside but following those two standard cooking methods up the hot dog are performed extremely well it allows you to cook six hot dogs in 60 to 100 seconds with very good quality not the best but very good quality baking, boiling and microwaving came in below with microwaving actually being the least desirable choice now we look at the cooking verdict of the time the fastest we gave to the microwave because there's no clean up it took about the same amount of cook time as the hot dog but the hot dog you do want to clean off those metal spikes then the pan fry came in with a close third not that long of a preheat time whereas boiling, making and grilling all took quite a bit of preheat time and quite a bit of cook time and now our quality to time ratio is where the hot dog really shines good quality hot dog with a fast cook time now the pan frying and grilling come in next because the grill has the highest quality but a long time and the pan frying is a nice medium it's a nice middle balance if you're just coming home from a long day of work and you just want to fry up a hot dog going to be good because you're providing for your family or yourself and then you can microwave if you're in a rush you can microwave your hot dog it will turn out satisfying I recommend against boiling and baking them because of the other methods you can resort to being either faster or higher quality now this is one very small step there is so much progress we can make in hot dog cooking and our research with the hot dog or machine some alternative possibilities we want to look into is what other foods can we cook a banana in the hot dog what about a hot dog wrapped in bacon can we cook that in the hot dog sausages of course we also need to learn more about hot dog cooking methods what if you sous vide your hot dog how does that compare to the hot dog what if you sous vide your hot dog and then cook it in the hot dog to get that crisp we also think we can push the machine a little bit we can put higher voltage through it pass a higher current through that hot dog really see what we can do this is the top that we're all looking for and stay tuned because due to a fellow senior scientist finding an alternative model of the hot dog or released later that maybe can produce even better results we must follow up with testing out the different models of the hot dog this is sponsored by GWS David Cantrell, David Shea and Sophia Fondelin thanks to science and thank you all for coming here today and just remember the mustard indicates progress any questions for our senior research scientist here wait let's get on with it can you guys hear me so I thought it would be a talk but it's gonna be a conversation now so let's get it started so today my name is Sambaran Kashyap Ralebandi I go by SK and on Red Hat Internet like I have an IRC idea of SK and on Freenord I am Sambaran and I work for continuous productization team and my work mostly involves around like OpenShift and building CI pipelines and building tools to optimize the pipelines etc so today as you can see the talk is about like OpenShift CI pipeline for dummies and I don't assume any knowledge of OpenShift or containers or like per se software also but I guess you guys might be knowing about software since you are at DEVCON so our agenda is to talk about like what is software and what are the problems with software and version control and its need why do we need version control and common terminology which is coming across like when you start with CI pipelines continuous integration, continuous delivery etc and how do we distribute like code on pipi and how to build your pipeline using Jenkins on OpenShift this would be followed up by a small demo which is actually recorded for 20 minutes but I will try to fast forward it for the talk going ahead with the software as this is a conversation do you guys have any definitions about software yeah that's from Wikipedia it doesn't count so like in my opinion like for me software is like something which is written for me software is just a code a code written in a file and which runs on my computer which can run on any other device like which has the computing power it can be my washing machine or like it can be a television thanks to Samsung and other tech companies now software is running everywhere so that's how software is defined finally it comes down to a piece of code or like a set of instructions that you tell a machine to work on and these are the problems with the software like usually in our like software world the first problem is like does it install properly and the second problem is does it install properly on my machine because you guys might be working on Mac, MacBooks Fedora or like any other like any other machine per se but does it work finally and does it really work there is a subtle difference between does it work and does it really work because each person who is using a software has their own use case of like for example if you if you use a paint software like a child would be using for like painting random diagrams or a professional would be using for certain other use cases by clipping or cropping so when we talk about does it really work it extensively tests about like software is working for many use cases or not and where do we store the software and how do we store it and how do we verify everything which has which is already mentioned in the questions going ahead and another nightmare we see in every like sys admins like any of you guys are like sys admins by any chance okay have you ever got this like it works on my machine by one of the customers so that's what like happens with me always because I am one of the maintainers of like a project called as linchpin and most of the times like the GitHub issues we see was like okay like it doesn't work on my machine and some people say that it works on Fedora but it doesn't work on CentOS but we want to make it work on CentOS that is like a usual nightmare which we usually like have you know how to do that like the next big thing is like where do we store the software like there are like many options to store the software like during my undergrad days I used to mail my code which I am totally embarrassed of because like we didn't have like a GitHub or like anything maybe I was not aware about the version control systems at that moment we used to mail code using Gmail but thanks to Gmail they have started blocking the code in zip files like these days so that is not an option anymore and later on we started using traditional software providers like Google Drive, Dropbox and OneDrive but the inherent problem with this all this kind of software we couldn't maintain the versions of the whole software so for example I have made a change like 10 days back and I want to get that change again right now so at that moment I don't know where if I use Google Drive to do that every time I upload my files to Google Drive it kind of overrides it unless it is Google Docs it has recently introduced a version control system where you can go through different versions but that didn't feel like a proper way to store my code so then came Git so I just wanted to share this manual page of Git that is like mentioned as like Git the stupid content tracker it's not stupid anymore it's like it's the best content tracking content tracking software or the version control software have ever experienced so how does it work it has like many many features out there but all I do is like I memorize like 4 to 5 commands and the purpose of that Git add commit Git pull, Git add Git commit, Git push so these are the 4 commands which you need to know like in order to like maintain your software on Git so this is an interesting definition like version control is like the basic idea is about like the homomorphic endofactors mapping the controls of Hilbert space which I don't know what it is even like I'm not sure like creators must be knowing about that the advantages of version control which I found it is like we can get the continuous backup of the software so we can back up the software like to certain extent that we can revert that software from the like not even 10 days or like it can time is literally time travel from one particular checkpoint to another checkpoint so these are the like Git one-on-one things which you do there will be a remote repository on GitHub or somewhere on your hosted server and you'll have your working directory and you use Git add to Git add command to add the working directory onto the staging and staging and then you use commit to actually like finalize the version of your software and you push it back to the remote repository so that's how you do like Git and there's like whenever you have the perfect backup of your software and you always have confidence to move on or to like go with the next steps like installation of software or like resolve some other problems with the software going ahead let's see the terminology like you need to know about like continuous CI pipelines before we get into the actual topic one is containers and the other terms are continuous integration continuous delivery, continuous deployment and there are like many continuous these days like our team at Red Hat is called as continuous productization and there is like continuous improvement coming up there are like many things which you continuously do then and we will see through the definition of like some of those and we know what we should be knowing what Jenkins is and we should be knowing what OpenShift is and different types of configurations in OpenShift so coming to the continuous things the thing which has been bothering everyone like since many decades is like well like whenever a software is released like there will be soft softwares, a lot of enhancements a lot of bug fixes and like a lot of features coming up so the basic trend nowadays which goes around like all the software is continuous integration continuous delivery and continuous deployment so continuous integration comes into picture where you keep on merging every commit every commit which a developer says on to your like production environment or like you create a build of the software which is continuously which is continuously aligns with the actual production repository so that is when we call it as a continuous integration because every software has like lots of nowadays every software has a lot of dependencies and pieces which you need to put it together like a Lego and if one of the like pieces fails again so to make sure that everything is working fine continuous integration is one of the process and the other thing is like the other most ambiguous two terms are continuous delivery and continuous deployment there is a subtle difference between continuous delivery and deployment delivery is something which you reliably release your software so whenever I say software X has released like it is the most stable version and I am going to deliver it to all the distributions out there and so that people can install it but continuous deployment is something which if there is a running software like an Apache server or like a Python based server and you continuously update the existing packages while running the software out there and we have a stage production and test environments like all together making sure that deployment of the software runs runs as per the tests which are being ran against that particular application going ahead why continuous why do we need continuous because like whenever a software is released it doesn't mean that the software is perfect it is a process where like the evolution happens and it turns into a like a quality software where people can use using all the test mechanisms and other things but the basic reasons of like the continuous things which are happening are because to ensure the standard practices and to ensure the software is delivered at a faster rate and people get the time to market of a software is reduced if we fail also like we would like to fail consistently so that we could fix it in the next version so that is the important go get a wave with the continuous improvement of existing software going ahead so this is the CICD loop which every software is like now it is following where you code a particular software and you plan and you build that software and later on you test the software release and deploy into the production and make the customers use it and later on you can just monitor it so depending upon the feedback which you get you can just again plan re-code and rebuild everything and enhance it so what are containers so containers are these buzzwords recently came into picture like after a long time these are this isolated user spaces so if you have a very big machine of like a 128 GB RAM and like 8 TB so if you want to share the whole machine with different processes together we need to create some isolation so previously it used to happen with virtual machines but later the problem with virtual machines was like virtual machines were very heavy in nature and each has its own operating system, complete operating system running and a lot of unnecessary like software which is running as an isolated environment in terms of virtual machines but whereas containers lessen the burden by creating small user spaces using technology called SC Groups I'm not going to go in detail into it but it kind of simulates the whole server environment in terms of a process and you can give you the whole feel of the operating system within a container so that resulted in this previously people used to say it works on my machine now people started saying it works on my container so this is an example file of a Docker which is used to create your own container so what it does is like it pulls from the main repository of Fedora and it pulls down the Fedora image and it runs a command called as dnf install and runs the command and installs the software on top of that image and it also says that okay this will be the slash them slash would be your start working directly and this is a container if you run the container without the last statement there it would just run and it would stop after a few seconds because container needs something to achieve each container has a purpose of running a command or running a web server or running anything but if you want to keep on make the container running always you use some command like dnf dev slash null to make it it works like an infinite loop to make the container running always and finally Jenkins Jenkins is a tool which people started using for the whole automation purposes where it can achieve by means of plugins and shell scripts or Jenkins files by creating many schedule running builds and triggers notifications and automated scripts and pipelines and it also does the multi node delegation where like if there are like different types of operating systems you want to test the software on Jenkins is a software which you go to because it can you do the delegation if you want to run a script on a Fedora slave it can also do that and it also has like a logging mechanism where it can record everything every instruction which you run against a particular slave using Jenkins so in just Jenkins is a tool which automates stuff with the help of scripts plugins and job files and how Jenkins coming to picture like with OpenShift Jenkins is a tool as I said it builds software or it can do anything which you want to Jenkins takes the code from Git server and it has a feature to build a Docker container and it can also run it on OpenShift so OpenShift comes with a CI server that can build, deploy and deploy the containers for you so whenever software is nowadays delivered in order to make sure the software is always running we deliver it inside a container instead of a RPM package or any other script or any other repository because containers have become a reliable source of software distribution because of their isolated environment nature and because of I'm sorry isolated environment nature and it is sure that it would work at any point of time using Docker so OpenShift recently changed its name to like OKD or Origin Kubernetes distribution it is kind of built around the Docker containers and Kubernetes distribution platform which manages the whole like different kinds of containers and it can also load balance on the containers so OpenShift has like many many configurations which you can make use of to build containers and to deploy containers and to store different metadata about the containers and also it can also create pipelines and pods inside the container and we will see that in detail a build like OpenShift in OpenShift terminology like each configuration is termed as a template to do a specific task for example when you see a build config build config is something which you tell OpenShift how to build the image of a container and a deployment config is about is about the way how you deploy or how you run the container with all different load balancers and like any health checks or triggers etc and config maps are something which we use to store the credentials or any other metadata which is related to the container which can be available at the time of running a container as environment variables inside the container finally like OpenShift also has like seriously it used to be a preview but now OpenShift supports pipelines where you can have a series of steps which can be ran through Jenkins which uses the containers containers using an OpenShift domain specific language and separate OpenShift plugin so that like you can achieve the whole pipeline deployment or like software deployment or release process using user defined steps inside a Jenkins file and pod is nothing but a collection of containers which you want to run standalone instead of running it as a deployment or the advantage of a pod is like this is the best mechanism to test your container so you just create a pod file and you copy paste the pod file onto the OpenShift environment and it runs as a container and you can delete the pod anytime however in case of a deployment deployment by configuration by default configuration deployments always tend to be replicating in nature even though if you accidentally delete it so deployments are something which you use for a production environment finally like how do we deploy the containers and how do we define all this deployments so the best part is like OpenShift talks YAML so instead of having a long json or like any other XML definitions OpenShift has a simplified nature of like defining things using YAML before that like if you want to have your OpenShift run running on your local machine you just create this is an instruction set which is used for installing OpenShift on your current Fedora machine so if you want to do it on your like Windows machine or like any other distribution there are a set of documentation instructions on the OpenShift website but for running an OpenShift on your like local environment to test things we install the LibWord dependencies and we install like we make sure that your current username is added to the LibWord group to manage the KVM and all the virtual machines and you download the MiniShift binary from the MiniShift GitHub repository and you just start with like MiniShift start command so this is an example build config for building the Docker file which is already mentioned before so most of the times what any person would do is like just copy paste a working YAML file and try to edit it while understanding it so pretty like build configs are like pretty intuitive in nature because if you can read through the whole build config you can understand by the key value pairs so for example it uses an API version of version one and it's a template and further it has been labeled as a template by template name called as Fedora and it has different annotations which can be like perfectly ignored because these annotations even though if you don't mention it, these are generated by OpenShift inherently and each config has its own objects like for example we are using an image stream to build the whole Fedora container and this image stream is again referred as an output for the container so whenever a container is built it is being pushed to the OpenShift image stream and this particular section called a source is used to build S2I images, the source to image images from Git itself so if I mention a GitHub repository, URL as a parameter it can go back and pull that repository and put it inside the container or it can refer to the Docker file remotely which is on to the GitHub repository or any other external Git server and it also follows like different strategies currently we are using Docker strategy and with no cache is equal to true which says that whenever a request to build comes to OpenShift it should build from the scratch instead of using all the layers inside the Docker or instead of using all the pre-populated steps or pre-ran steps it tries to build it from the scratch and it can also have different kinds of triggers like you can make OpenShift do things like whenever a person pushes to the repository whenever there is a commit being created you can trigger a change in the whole build conflicts so for more information like you can just refer to the OpenShift documentation which is more detailed this is a basic example to create a Fedora builder container using a build config going ahead you can pass the parameters in build config using a parameters like attribute which has which is nothing but a key value pairs which you pass to a container while it is being built and the next part is like deployment config as I said deployment configs are very replicative in nature if there is any accidental delete in the environment it tries to recreate it again so that it doesn't affect the customers who are using the deployment config and it also has config maps as I said before it is mostly used for passing the data or the credentials as environment variables inside the containers finally the pipelines so pipelines are something which where you declare as I said where you declare different stages of a particular Jenkins environment so that Jenkins can pull down the Jenkins file from the external repository and run those steps within the containers so the example pipeline looks like this as I said before with the build configs is kind of applicable to pipelines also it can also pull from the remote get repositories so this is an example repository which is being created and this is a sample file where it uses a groovy DSL where you can declare different stages of building a container in this current example we have used different stages like build deploy container wait clone the particular source code and install the source code and test the source code and start building another another container using the same source code and you can also create a release with running commands like commands like twine pipeline release or any shell script also so finally there is a stage called as a cleanup stage so once the whole work is done you can directly clean up so as I said before you can run pods as individual containers without anything related to without anything related to deployment configs or build configs so this is how you do it so let's create a project and create a build config and create a pipeline and start building so I have a demo that simulates the whole distribution pipeline where I use a python package called as gummy bears which is already on GitHub repository which does nothing and run gummy bears it just prints hello there so this is a pypy package and like let me start the demo so and as I mentioned in the examples this is one of the Docker files which I have been using to create the whole pipeline release and each of the steps in the pipeline are like to build the whole package and later on test the package and deploy to the pypy repository and this is the OpenShift environment where you just go ahead and copy paste there is an option called as add yaml to the project where you can just copy paste the build configs and it creates the whole images and it creates the deployment configurations and it also creates the pipelines so now we are just creating the pipeline and as I said like we are just copy pasting the whole the whole yaml files and changing the parameters accordingly so once the pipeline is being created we need to we need to start the pipeline but in this case like pipeline just failed because the credentials were not available so I just created a conflict map which shows the credentials and now the pipeline is started again let me just fast forward it so as you can see it is currently building inside the Jenkins container the container is built from scratch so it starts from dnf update and installs the packages and installs the whole pypy package as you can see there is an error involved here because there is no change in the pypy repository there is no change in the repository so I created a commit update the version of the particular software and re-run the build again so once the build is being re-run it updates to the pypy repository again with the help of the pipeline there you go like we got 0.0.3 version and that's it any questions so OpenShift contains the Jenkins OpenShift has a different set of image catalog where you have the Jenkins images which are already there so once you deploy the Jenkins image it automatically detects the pipelines and it is kind of pre-packaged with the OpenShift plugin which identifies the OpenShift resources within the Jenkins that's it any questions welcome good afternoon ladies and gentlemen thank you for coming to the 140 session next up we have Sushil Kulkarni engineering manager at Red Hat and Ani Slaghilam sorry for the mispronunciation who is a software engineer at Red Hat so they are going to be talking about catching network regressions using LNSTs good luck guys thank you I think I know some people in the audience but how many people are in networking development how about CICD like it's supposed to CICD a little bit probably not okay so this one is basically going to walk you through a tool called LNST a framework called LNST that we've developed and I work in a networking group at Red Hat and so does Anis and basically show you how this framework does and give examples so so here's like an agenda I'll basically first talk about why we did this framework LNST and why it was required for us in the networking team and then we'll also talk about leading to that what is LNST and what it's capable of doing and then we'll give you an example of how you can set up a test or topology using LNST and how it reports results and that you can interpret and use for catching regressions and in the end I'll just tell you what's going on what's coming up in the future for LNST so some time ago there was an engineer in the networking group and he started writing regression tests for bonding and then he soon found out that he had to redo a lot of things when he had to start testing teaming for example this kind of similar technologies so that's where the need for like a framework kind of arose where it would be important to have like a universal framework for running networking tests that could be easily extendable and it could be like really consistent so that you can run them over releases or over cycles of nightlies and then kind of compare the results and see if there was a regression somebody introduced a regression which was also important to also make it easily extendable because you didn't want something which you could do scripting but it's really not extendable so we wanted a framework where people could add new functionality into it and run it so that you could get newer and newer topologies as people added more and more features or more and more functional items and there's something called test recipes which Anis will talk about in a little bit but we also wanted to be like something that you could describe your topology in so that this framework could run the topology and create the topology for you and run the test the reference to CICD was another thing where we found out that at some point we found a bug that we could have caught it really early in the cycle of the development and if we had something that we could test at a developer level for example then that would be great so that the developer could run it the tests and catch and regressions or the LNST team at Red Hat could also run it and test and catch regressions much before then maybe towards really end of the release so these are some of the reasons why this framework came into being the original engineer at Red Hat he started this he started this framework and then there's a team at Red Hat and this whole thing is upstream and the front page had the git repo on it so you can go up and look at it and see how you can do things like that so I'll hand it over to Anis he's going to run through a lot more than me and he's going to talk more Thank you I've been with Red Hat for a little bit over four years and I've been using LNST since since then what is LNST it's an abstraction a collection of programs that help developers to ease their work it's a tool written in Python that has set of tools and definitions that make the workload of the developer easier it is an automated testing we developers as Sushil mentioned when the Jerry Perko was trying to do teaming and bonding he has to do a lot of stuff so automation helps in this way helps eliminating the human error or human factor by using the same commands in a sequential order over and over to keep the test steady portability you write it once and the idea is to use it multiple times it's not dependent on any hardware or special hardware the hardware that you test it on could be used by another developer for example and you need another hardware you just use the same test again the fact that it is abstract portable extendable make it really easy to use and save time for redoing test over and over or configuring test over so what can LNST do for you and it helps you set up your environment faster and easier and when I say it could be used by developers it could be used by hardware guys firmware guys and you can test different topologies on the fly it has a library of test pre-designed pre-configured and you can just pick what test you would like to is it bonding, teaming is it virtual, guest to guest has different and multiple variety of tests you can test the functionality you can test the throughput you can use other ads on IPsec, Macsec it's very extendable it has a feature where there is in your lab you should have a pool of machines and basically you tell LNST what I want to test and LNST will go check that pool and see what configuration is good for you and use those set of machines it logs tests with timestamp for debugging in case you catch anything or you see something that unlocks you you can always go to logs and it has very detailed logs and the most important thing it cleans after itself once you run it it goes set of environment to do the tests and once it's done it flashes the necks and return them to their the original state this is really just a simple overview this is how it's going to look in our tests we use Beaker it's an open source program that controls the systems for us provisioning what kind of OS I would like to use it's plugable into Jenkins which is also an automated framework that a kernel it will kick in Jenkins will tell Beaker okay the build is ready go install these machines with this kernel Beaker will go to the machines and the installing for you once it's done it will kick the LNST test this blue line and this green line are totally different networks this is the controlling or how you SSH or how you VNC to your systems the green network is where your test is going to be your pings or your or your hyper network whatever output tool that you will be using so how is LNST it's up in a chronological order for sake of time we will assume that we have two systems that have already OS installing them which is Linux and and they have access to these machines through a neck that is not on this design I can SSH to them I can install whatever I want like RPMs all that stuff and these two necks are the one that I'm going to be testing either for after an upgrade driver upgrade or a firmware then I'll install LNST on one of the machines they have slaves you can install them through DNF install LNST slave and you only need one controller this controller you could be on your laptop you could be on your desktop or you could be on one of the test machines once you kick the LNST it will go and find those two necks that you are trying to test in this case it's going to be a bond and it will configure it with two slaves with a bond given IP address to your choice and after that it will run traffic whether it's a thing or is it throughput performance and at the end it cleans itself after it logs it reports back to you it's fresh machines I can for a new test if you want this is as simple as it can get we're going to go through how the LNST was built two machines with LNST and slave installed on them the controller OS is installed and this is the final test or the final look of my test connection right here either for functionality or throughput how does LNST know this by we call it recipe recipe is an XML that has attributes and you tell what machines you're going to be using machine one is also supposed to be a machine that XML in your directory or your LNST pool where all the machines that LNST knows about and has the MAC address and has the host name or and for this case we picked the machine one has one nick and this is the IP that I'll be using in the same XML there is machine two and these three dots are there it's just like copy paste of the same just for the sake of space with the different with the two LNST will read this XML and find that you want this task it will go into the same directory look for it right here this is as simple as it can get imports modules, libraries sign a variable and it will run this command ping one from this IP to this IP how do you run it command LNST control run recipe.xml and as I said there is a variety of options what recipe and they are all upstreams and accessible just showing the example back here next one please this is another example I want to test bonding versus a nick same thing like I did not change the US I did not remove this so I'm done with nick to nick this is my second test I want to test I can just do LNST that color run the recipe that has the bonding in it and it will do the same next page please that xml for the bonding it's one single xml just split same thing has machine one has two nicks in it and the driver for the next in case the machine has more than one nick with different drivers I can specify the driver to test and the bond the name is bond zero the type of it is active backup it will enslave th1 and th2 and it will assign IP address IPv4 or IPv6 or both at the same time and machine two is similar to machine one from the previous xml and once it reaches this python task it will go to it it's a more complicated python code than the simple network or the thing that we just showed earlier and it's available online on I will at the end I will post the get will you come find all the recipes and xml's that we use that the python also next one please so essentially just to complete you can see how this recipe is actually defining the topology that you really want to test so you can customize your topology to what you want and provided there are these libraries that will run the test for you that's perfect you can just run it if there isn't then we have to either add a new library or modify an existing library to do this this is a more complex setup again two systems with OS on them each system has two guests VM1, VM2, VM3, VM4 has an open v switch running the network internally one controller and if you notice there is a slave running on this host slave on the second host slave on each VM those slaves are listening for instructions from the controller to what to do, configure what type of topology or what IPM testing, IPv4, IPv6 so the controller will submit the instructions and the slaves will just execute in our case we are going to test functionality test either pink for example between VM1 on this host to VM3 eth0 on each VM is the controller this is how I can SSH to this machine eth1 is the neck that I'm v testing this slave will configure the eth1 with the VLAN10 and this slave will attach eth1 to the switch the physical necks to the switch with the bonding with an IP address here also same thing mirror will be done on this then VM1 will try to pink VM3 and everything the results will be in the log it will also pink VM1 it will pink VM4 which is on a different VLAN just to see if it passes or not in the past we caught a regression where the VLAN was crossing the border crossing the VLAN and it's important to note that it's not just pink you can run any traffic generator you want you can extend it to whatever hyperf, netperf or vf we're also thinking about using another in-house built traffic generator than you have so essentially we're running traffic to see if there's regression in the kernel network so how does LNST report if it's a pass you will get a nice look in detailed report with a pass and the summary at the end if you want to skip the debugging part as a pass if it fail it could fail for any reason but we care about the functional failures if it for example the pink didn't pass it will tell you that it failed you can turn the pass the pass here for example in our test we test the performance as well not just the functionality this test was a result of it was using the netperf was 9.4 and it passes because it was at base the baseline and the baseline it's not just an arbitrary number but it was decided based on multiple runs and making sure that this version is stable and we average it in each run in our LNST we don't run the netperf just one time it's five time and average it and I will talk more on the failures if it fails for the throughput reason this one was less than this baseline it tells you and the second failure is for example the standard deviation allowed in our test has to be within 20% of the measured throughput and the next slide will be a more visual idea about how the deviation failure is in our tests we rely on an open source project called Perf repo which is a database that since we care about performance as well we have to store and compare it to a baseline this baseline the one I spoke before is this green line and this Perf repo has a web UI interface that makes life a little bit easier the Y axis is the throughput in gigabits per second the X axis has the runs each natural number means there is a run that there was a kernel here that we run and this was the result the orange line is the average throughput minimum throughput, maximum throughput and the space between is the deviation the runs that we collected from the runs how scattered were between the max and the minimum is the space so if the space is 20% of the average throughput that means there is a failure like the numbers are not reliable this one is a nice one same same machines, same tests I wouldn't say same test but it's a different test with a different protocol this is a nice looking graph where the baseline after we tested it we decided ok this is how much we are going to be and every time it's passed this is kind of the look now this is an example how early regression testing can be helpful we run here kernel versions during development cycle if you can see here all the threes are several times we run this test then we decided ok the baseline let's agree on this baseline right here was this point for example and this data line is our baseline then after that every time we test if it fails we go and debug and see what it fails here it's a clear regression that starting this this kernel right here it was a regression and the bug was filed and the developers are working it so I that's how LST make it easy for us to just automate it and run it and I'll handle two switches so what's next coming up so if you saw the recipes were XML based before that has a set of limitations where it's not very it's not very flexible and some of the more complex topologies is difficult to implement or provide a recipe for so the next thing that's actually underway currently is converting all these things into adding support for python web based recipes and which also means we have to convert our existing recipes into python recipes so that's the thing that's going on like I said there were different type of traffic backends that you can put in like high perf, net perf and there's another one which we're trying to use called rush it which is again a traffic generator that we built inside the networking team and it's also upstream so we're trying to integrate that as well of course there's work happening on the next branch there's conversion from python 2 to python 3 because python 2 is going to be duplicated I believe so in a couple of years so that conversion is happening there are other supporting things that are not specific to LNST but something that's if you remember that package that Anish showed about beaker Jenkins there's something else that we've added which is sometimes we see setup issues where you might get false positives so we've run created these bots which will rerun the test again just to make sure we've really caught a regression so we might integrate into LNST we're not sure but there's something that we're running on the site so it runs it again we do three passes basically if we have a majority pass then we know it's like a setup glitch we mark it as something that we need to go look and we fix it a test setup issue or a test recipe issue or a test case issue so we also want to bisect it so that if once a regression is found we want to be able to go back and tell exactly which commit it was that caused this problem so that's another bot that we're working on so that we can go tell the developer fine look this is where you start a regression please fix it so that's kind of something that's happening as well and of course more topologies networking have so many topologies so we're adding more and more tests as people add more functionalities in the Linux kernel we also add recipes and tests for that so that's another thing that's continually ongoing so finally credits like the engineer who started this project Jiti Perco he's kind of the founder of it and Andre Lekner he's the maintainer and Jan Luka and Anise and all these guys they work in the LNST team as well so this is the place where you can go get the code and if you want to contribute to writing tests or if you want to reach out for any questions we are also an IRC that I put in the the beginning of the slide deck here so we're on free note at this point so reach out if you need any help with any of your tests so at this point we'll take questions there any sorry thank you for the information so I just wanted to know if this LNST does test for things like DPDK, SRO, UV yeah we have added the work around the way we're adding DPDK and UVS as well okay so like for DPDK I mean what kind of traffic generator HPMD or something like that the traffic generator I think it was Moongen I believe if it was Moongen there's another one which is Cisco I don't know which one so one of them has been given T-Rex okay thanks any more questions thank you very much thank you oh yeah we kind of you know all these hosts on the network and you can have these YAML files that say you know run these jobs on these hosts just keep reinventing the wheel ladies and gentlemen thank you for coming to the final session at DevCon 2018 so we have with us Micah Abbott from the project Atomic at Red Hat and he's going to be talking about misusing Ansible as an integration test framework so welcome Micah thank you we made it guys last day of DevCon, last talk thank you all for being here I know everyone wants to get home most likely or go off to do other things so yeah my name is Micah I work at Red Hat I'm a senior quality engineer I've been working on Red Hat Atomic host and now Red Hat Core OS and this is about how we were misusing Ansible to test Atomic host this is going to be more like a retrospective of what we were doing because as we're moving to support Core OS we're going to take we're actually taking a different tack and I'll get to that towards the end of the talk so I'm going to start with talking about how we we got to this point of using Ansible and then some of the problems we faced when we used this method and the solutions and sometimes just workarounds to the problems we encountered so how did we get here in December of 2014 I was hired as the first QE guy to work on Red Hat Atomic host we had like little to no test process, little to no automation at the time so I got tasked with the normal activities of you know come up with a test plan automation and we were really starting to invest in continuous integration and delivery so I was also tasked with starting to help contribute in that area as well the goals we had for testing at Atomic host after we got ourselves settled was testing the integration of many parts rather than the separate parts itself Atomic host is an OS you know there's a bunch of different packages, a bunch of different components to go into it we didn't want to be responsible for testing each individual piece we wanted to leverage the testing that already happened but we wanted to make sure when you put it all together it worked correctly we also wanted to make the automation easy to develop easy to use and easy to integrate with the continuous integration system that we were working with and we also had to overcome the challenges of working with an immutable host like Atomic host for those of you who aren't familiar with it it's based off of OS tree and RPM OS tree developed by the one and only Colin Walters back there sorry Colin but it presents certain challenges that you wouldn't normally get when working with a traditional Fedora or RHEL system most of the file system is immutable so you can't write to the places you're normally accustomed to writing there's no concept of YAM or DNF so you can't install packages and we're also trying to learn the container best practices how do we put software and applications into a container and apply that to how we test it as well so since we didn't really have anything specifically geared towards testing Atomic host I started asking around to the people who did who were working with it and I found someone who had started working on this makeshift framework called UAT framework it was a combination of Python and Behave and it utilized the Ansible API to talk to the host so Behave is a behavior driven development tool built on top of Python so we were kind of I was kind of encouraged that maybe we could leverage that and get more test cases contributed from people outside of the QE organization so Behave in particular you write a test case saying like I want to upgrade Atomic host and then it's up to the person implementing the test to go write the necessary code in the back end to translate that so I was hopeful that we just get a bunch of test cases in natural language and we would just be implementing it but that didn't really pan out this whole framework had a lot of layers of abstraction so we were really slowed down as we tried to take apart the pieces in the model and figure out where it broke in the different pieces in different stack in different use cases additionally the Ansible API at this point we were using Ansible 1.9 when we started it was really simple but also really simple in terms of how do you use it but it was difficult in making it do the things you want it to do so it was a weird juxtaposition so we spent a lot of time underwinding errors like I said going through the abstraction and figuring out where the problems existed the API was really poorly documented it was basically I was basically in the Python shell selling help on the different functions trying to figure out what it did and then when Ansible 2.0 came out it was like a whole new API introduced a lot of breaking changes and all sudden we were left with a decision point do we continue to try to push this framework forward or do we try to go in a different direction and so we decided we should evaluate our options it was easier to develop and use we realized that a lot of our tests were basically just running commands over SSH on the remote host and we wanted to be able to run the test across the different variants of atomic hosts so we had a fedora version multiple fedora versions the red hat version and a centOS version so if we had a test that could run across all three platforms with little modification that'd be awesome additionally I'm kind of a lazy engineer and a terrible programmer quite frankly so I didn't want to invest the time to build up an entire new framework a test framework to achieve these goals and since we were already using Ansible under the covers I should just do this with an Ansible playbook maybe so we started doing that we took the existing sanity test case that we had for or should say sanity set of tests that we had for atomic host and we turned into a playbook around as a proof of concept around February in 2016 being completely new to Ansible as a playbook using a playbook I didn't have any roles it was just including files full of tasks kind of looked like this this is actually the get check out from the first commit I made to the repo you know there's a couple of Ansible file modules a couple of shell callouts and just a bunch of includes that are just relative to the test itself yeah but we started to get some success so I convinced the powers that be to host this the set of tests called atomic host tests awesome name right under the project atomic github organization on github we had vagrant support to the framework so that you could do a vagrant up get an atomic host and it would automatically run the playbook for you this this this sanity playbook or any other playbook didn't really take off in terms of people using it it was sort of like a nice like oh look what we can do with vagrant but didn't really plan out again and then finally I I've been doing a lot of this work by myself some help from the development team and other people but we finally got a second quality engineer to work on these tests with me and then that's when we started to like really we're starting to get better at it we're starting to learn more about how to use it start to incorporate roles things like that we're testing across multiple versions of atomic host we were hitting a lot of the goals so we were feeling really good about it but there was a lot of pain so number one anciple is terrible at handling reboots you know anciple is a descriptive language where you're basically describing the state of the system so it's hard to describe a system as reboot you know it's either up or down it's either doing some action reboot doesn't really fall into their model the default output from anciple playbooks is god awful it's just json oh wait where to go I used to have we'll come back later I'm sorry I forgot the order of my slides awesome but anyways it's like it's just like json without any line breaks it's really difficult to parse and debug and triage so we were fighting with that when we got errors when we when we chose anciple I had the idea of like well if we fail as early as possible it proves it will prevent us from shipping an atomic host that has problems what well the problem was that is you know if you fail early in your test you're missing a lot of coverage yes exactly so the other problem we had with this is that once the playbook fails anciple basically terminates all your ssh connections and it's hard to back to the host and grab debug information logs or what not you know using just the normal anciple model we had some difficulty selecting which test from a playbook to run originally because I said we were still kind of like anciple novices and we hadn't quite solved all the problems yet but number one like anciple is not a programming language so like we were completely misusing it as such like we were trying to do you know conditional operations and complex operations that would be pretty easy in any other language but in anciple it's you're trying to do it with the yaml and it's dsl and just that's when we kind of realized we made a huge mistake but we were already too far into it so we had to just keep pushing forward and so we started to attack the problems that we had and try to figure out solutions to the different problems we I mean a lot of these solutions I gotta say it didn't come from myself it came from the other engineer in our team he's not here today but his name is Mike Nguyen and he's been very helpful and Jonathan as well over there he helped us with this specifically improving the reboot handling so one of the problems is just doing like a shutdown command an anciple the ssh connection can get terminated before anciple knows it it's actually been successful at issuing the command so the common way to solve this now and at the time when we came up with this anciple didn't really have a good way of telling its users how to handle reboots it wasn't until about anciple 2.0 where they actually put out an article saying like oh this is how you do shut reboots guys like makes perfect sense right so it's an asynchronous action you had to sleep before the shutdown in a shell command that's logically combined and that actually worked pretty well we were actually able to reboot pretty consistently but we still had other problems in that space if the reboot command doesn't succeed the host hasn't gotten down so anciple doesn't know it hasn't gone down and when it comes back and looking for the host again in a further task it just assumes that it's rebooted but if you're expecting something to have changed because of the reboot your tests are going to fail so what we came up with was comparing the boot ID from the proc file system I'm not going to show you the code because it's just taking the boot ID before you reboot and then comparing it to the next one when it comes back so here's this is the slide I want, this is the part I wanted to show you the file output from anciple is awful so here I've run a command on the atomic host atomic host status it's supposed to give you a nice pretty output of when you run it by as a user it gives you a nice pretty output of the state of the system in terms of the version you're running and the commit ID but right now like if you try to parse that you'd be pulling your hair out of your head because it's just it's a mess so Mike wrote a callback plugin to basically make it pretty it looks a lot like this it basically takes the result object from from anciple breaks it into the different parts like standard out, standard error the message you get the return code and actually uses the line breaks that come in through standard out so the same command now looks like this in our logs so when we have an error in the test we can go into the logs and easily pick out oh the return code is not zero for example so there's a failure there I mean not in this example but if it was a failure and we can look at the standard out and we can even look at the standard error if it was a failure so this is like a huge one like it made our lives so much easier once we could actually look at the output and understand it so the problem of capturing information from the host after a failure Mike came up with this kind of hack way to use a handler to go on to the host and pull out the journal it was like it's a role that called a handler on failures which then would go in and like set up some names and grab the journal it I mean you shouldn't have to do this much work with a test framework to be like give me the logs from a failed system but hey we're we're butt into this so we had to do it and this actually works I mean it's not pretty but it works we were able to get logs and journals and what not the failing fast problem so how do we get the rest of the system that we missed or test the rest of the system that we missed so we might came up with this idea I mean Mike should really give him this talk he's not here he's in Hawaii tough life right so he came up with this idea of a meta playbook once we switched to using Ansible 2.2 or 2.4 I can't remember which we were able to use block and rescue more frequently or better suited to our use case so if you can see this on the screen we've got a block and rescue definition where it includes some tasks sets a fact based on the results of the include task call and then at the very bottom at the end of this it rolls up all these all these facts into one and then prints out a nice log of like oh these parts of the test have failed or passed but again this you shouldn't have to do this in a test framework like it should just it should happen more naturally right so the problem of it being not programming language I mean there's nothing you can really do to get around that necessarily you can work around some of it by using like inline python and I was told when I gave my slides to be reviewed by somebody that I should be prepared to defend this because I guess there are people who are better at than I am who would just write a bunch of yaml to do this but for example you can do it these kind of hacks where you're doing using like splits on strings and capturing those if only another variable is set other else you use another variable I mean not fun to read by any matter but you can these are the kind of hacks and tricks you can do python in your ansible paybooks if you care to do so so selecting the subset of tests that was a pretty easy one we just started using tags we use we tag everything into one like functional group so like this is the upgrade set of tasks and this is the reboot set of tasks and that made development a lot easier because then we could just say oh run these set of tests when they were when we're trying to debug which one failed but it still wasn't really like great because if you forgot to tag in a piece of the functionality then you're wondering why didn't work correctly so the way our repo is set up right now is that at the top level we have a roles directory a callback callback plugins directory and at a test directory the test directory has all the playbooks which require the roles at the top level so you could do like a relative you know call to the role but we didn't want to like they look didn't look pretty to us so we just sim linked everything again not the best solution not the prettiest solution but it got us got us got around it and it's been working so far I mean sim links or sim links so the conclusion here it was a lot of pain not just some pain a lot of pain but we it worked for our use case and our use case was pretty well defined we had this immutable host all the and the other thing I mentioned here is all these tests were basically single host use cases single host test cases so we were just running it one playbook against a single host and just testing functionality and that's it like we didn't get into the larger like running a Kubernetes cluster across you know six thing we had a lot it was easy relatively easy to develop and maintain and execute these playbooks it didn't require any specialized software on the host which was big for an immutable host all we needed was SSH which like every Linux system has and Python which most Linux systems have and we had it it was easy for us to run these from a Jenkins job it made us yeah and the other big gain here was the ability to write the test once with a little bit of sugar to make it work across multiple variants and then we end up like this this is our ad hoc dashboard basically on in the repo itself on GitHub so here we're testing 11 different variants of atomic host that we we run on our internal Jenkins we then publish our results to an S3 bucket so you can get the logs you can still get the logs you can't see the jobs themselves publicly but you can see the results publicly which was important to us for the community distributions like Fedora and sent to us and we got a nice bad just generated from badges.io so it kind of looks like we know what we're doing it's mostly green so we must be doing something right yeah and so this is like this is our bar for success like we we got we're testing 11 some odd variants plus the rel variants that are internal so we're doing pretty good however this is where this is the swerve at the end of the talk don't do this just don't use Ansible as a test framework it's not suited for it it's you saw the problems we had you saw the workarounds we had use something else like a bash script would probably be better at this point you could use Ansible to like prototype some tests that fall into this model of I need to run some commands on a host to make sure they ran correctly that that's a fine use case I think for Ansible because you're it has the underlying the underlying tech to do that or the line functionality to do that again it's not a programming language like there's a I think there's actually a pretty there's a blog post out there that has this title like Ansible is not a programming language and instructs you not to do exactly what I just did we had to use some hacks workarounds and general abuse of Ansible if we were going to start over we would do something we would use something like pie test or avocado or open QA or you know whatever the anything other than Ansible right now I mean I'm sad to say and like I said at the beginning now that we're working on Red Hat CoreOS we got really lucky that the CoreOS guys had really slick test frame where it called COLA that's written in Golang it does things like provisioning in all the different clouds they support it does things like capturing the logs and the errors automatically you can write native Go functions and run them on the host so it really expands like what you can do in terms of testing an immutable host and I'm actually looking forward to using that in the future as we test CoreOS to completion that's it thanks for listening is there any questions Sir Adam no wait microphone sorry so I'm just curious this whole thing got written specifically for fedora atomic host called two week atomic and that has its own set of tests which run anything called Auto Cloud and I didn't know until now that there was this whole other set of tests which is also being run on fedora atomic host but it presumably isn't looping into the release process in any way just how did that happen and are there plans to change it or communication I mean I don't have a good reason other than for communication and we much like a lot of the open source community just have a limited amount of time to get the things done you want to get done I remember having discussion with Kushal Kushal Das right I had a few discussions with him and he I think he actually got some of this working in Auto Cloud but we never got it like fully plugged in and this stuff came first to be clear then two weeks after I just checked the timeline well that's when I got hired we didn't start doing this until 2016 I think there was a little bit overlap anything else so in my workplace we're using Ansible for patching homogenous workstations and dozens of servers running very different applications and usually our patching verification is oh make sure the service is running make sure it's listening on a port it seems like even if we were patching the mainly it seems like this would be a good way to do those basic system and verification yeah I think that's a valid use case for Ansible describing the state of the system and Ansible does a very good job of once you describe the state of the system it does everything it can to make sure it is that state so if you're saying like apply this patch and make sure this service is started it will make damn sure that as best as it can that that service is started and that's like your verification check right I bet like making sure it actually started successfully though yeah yeah yeah I mean you can I believe like the service module for Ansible has like ways to configure it how to trigger success like you can look for a particular port being open like you mentioned or a particular string that gets output in the logs or whatnot don't quote me on that because I haven't actually used those a simple example is the service is enabled but oh it doesn't start in boot you have to mainly start it five minutes later that's the kind of thing we want to detect which I feel like this would be a good way to solve it to be honest yeah I mean Ansible it has its use cases for sure I mean like I said as a test framework like we didn't I mean Jonathan and I had this discussion and we kind of came to the conclusion of we got the mileage we needed out of it but it's time to go on to different things like use Ansible for configuration management like large scale deployments like that kind of thing don't use it to test software anything else I'm sorry if I disappoint you by say by showing up I'm going to tell you that it doesn't work that you're expecting to be like find this magical unicorn of software but that's the brakes okay is there a plan to try and be more unified with testing using the cola thing in the future the auto cloud stuff is basically dead no one's maintaining it so in fedora side the auto cloud stuff is kind of dead it just sits there and runs but no one's maintaining it so I'm planning to sort of move forward with cola or anything else and try and be more integrate and have all the things tested using the same thing I'll say yes there was somebody from Fedora QE remember his name Jonathan or Colin he was commenting on the Fedora tracker like looking for collaboration in Fedora QA Caleb or Caleb Caleb Lambert Camille oh Camille Camille yes so Camille's reached out to us and I was actually encouraged that he's getting involved as early as possible we're not quite there yet to like especially on the Fedora side like Fedora QA is still like very much being defined so we're not quite there yet to be like hey here's our test and here's what you we don't even have an image for them to test against you know so we gotta crawl before we can walk but yeah the short answer is yes we're having those conversations that's it have a good DEF CON everybody thank you very much Micah so that was a final talk for the DEF CON and right now at 3pm we have we are distributing a lot of fantastic prizes for a Trivia game in the closing ceremony which starts at 3pm in the keynote room that's the Metcalf Large so see you guys there I I I I I I I I I I I I I