 Aha! Good morning, everyone, and welcome to the Fifth Root Conf. Today we have a lot of wonderful talks for you, which I'll introduce, but first I'll introduce myself. My name is Todd McQuillan. I'm an American living in Japan, who comes to India every year for reasons. And my good friend Zainab has asked me to emcee the conference for you today, so I'll be introducing all of the speakers. And maybe I'll tell a few jokes, I don't know, we'll see. So today we have, today's talks are focusing on monitoring. We have a few talks on monitoring, and also on automation. So those are the two themes for today. We also have a couple of talks that don't fit in those themes, just to break things up a bit for you, keep it interesting. Today's first speaker is coming all the way from Nuremberg, Germany, which is a small town of about 500,000 people. He's been in open-source business for about 10 years, and is an expert in all things configuration management, monitoring, and, oh no, I forgot my answer. So, without further ado, I will introduce Mr. Bernd Elg, who is giving a talk on the state of the open-source monitoring landscape. Thank you, Mr. Elg. Now it's on, now it's on. There's a cell phone. Okay, I'll call you later. Okay, so let's start. Good morning, everybody. Yeah, in the next 45 minutes I've tried to give you a little bit, kind of an overview of what's going on in the monitoring landscape. So forgive me if I don't mention every tool out there, because they have too much. So I checked GitHub yesterday evening and there are about 25,000, so I cannot go through everyone, but I picked the most interesting things, hopefully. So giving you a short introduction, like I already mentioned, my name is Bernd. I'm in the open-source business since over 10 years right now. I'm CEO of a company named Netways. We are based in Germany, and it's a service company. I'm also heavily involved in the deaf of the state movement. There are deaf of the state, a lot of local events, and I'm in the co-organizer team, so if you're interested in any deaf of the state thing, please get in touch with me. And the best way to do it is using Twitter. You can find me there with GetHash. So just one slide about Netways. What we are doing, we are a very old company, talking about tech companies founded in 1995. We are doing open-source since we are about 20 years right now and we are heavily focused on open-source data center solutions. So everything open-source in the data center is the field we are working in. And my last thing is doing a little promotion. We're having an iSinger camp on Saturday, working together with Haskeek on that. So if you are interested in iSinger, the camp will be on Saturday, start about 9.30 in the morning. You can find all the details on the website. So please join us on Saturday for a full day of monitoring here. Okay, that's the marketing area. Now we go into the monitoring. So if we talk about monitoring, we have to talk about what it is. So what's monitoring? I don't want to scare you in the morning, but perhaps wake you up a little bit. So there's no default to say what is monitoring. So I try to explain what monitoring is for me. The basic foundation I guess for monitoring is availability and functional monitoring. It means if I'm not able to ping a device or a service or something else, I have no idea what's going on, what the performance is, what the metrics are. So it means figuring out if your infrastructure is available, if your services are available, and if they are basic, it's a functional, if you can log into database, for example, that's the foundation of everything coming on later. On top of that, the metric and time series, metrics and time series are very popular, especially in the last three, four years with starting with Graphite. Previously where there were things out, RD tool, MRTG and all that stuff, but it was never so, people were never so interested in metrics and it changed for the last few years. So this is a big topic for me. And also locks and events means tools together, information out of the system, dealing with submitted locks and that. And another very important thing for me is user experience. It doesn't help you in any way if all your checks are super green and everything is okay, but the user experience is bad and the user is not able to use the web interface or set client applications. So getting kind of a perspective how your application is served to users is also important I think. So what to monitor? When we visit customers, when we are on projects, it's sometimes really not hard to figure out what should we monitor? So that's really a hard question. What is important for me? So there are different approaches to come to a full-featured monitoring. I would say if you have no idea how to start, the best way is to focus on your business. What the service is you're making money with? Internally or externally? But what's the important service? What drives your business? And to achieve that, I think a top-down approach for monitoring is very helpful. Bottom-up means you monitor every device you can find, do an auto-discovery and every IP address you get a successful reply is monitored, but it doesn't help you. It ends up that you get 2,000 emails in your mailbox and you will probably spend half of your day creating rules to move them to the trash. Therefore, a top-down approach is really very helpful. So first of all, focus, what is your business logic? So what are you doing? The funny thing is some people don't know what they are doing. I meet a lot of customers where they say, what are you making money with? Yeah, that's hard to say. So then you have a problem. So you should know what you're making money with. And then starting with a business logic, starting with external service which your customers use, which your customers are unhappy with if they are not running, it's a very good point to start. Then focus on the application. So that then figure out what applications are responsible that your business logic is up and running. It could be one application, it could be more applications ending up that you have a successful service of your business logic. Underneath of that, there are services. Means, for example, a Tomcat, a database, whatever could be needed for an application to be up, and then you come to the business logic on top of that. Of course, there are different perspectives on infrastructure because perhaps the guys are interested in, I don't know, fail this, are not the same people interested in a fail in the business logic. Perhaps your management is not interested in a hard-wise failure, but somebody should take care of because if all your heart is crushed, then you have a problem in your business logic as well at a later point. So there are different perspectives, depending on what is going the top-down way. So how to monitor so how to do it heavily depends on your perspective. Means what's important for you, what you like to see is something nobody can answer for you. So the perspective on your infrastructure on your services could be so different that depending on your employees, on your service guy who comes to your company tells you a totally different story. There's no no perfect rule to do it. One thing I heard a lot of times that people discussing monitoring is push or pull the best content. And I think there's no R, there's an M because there are things push makes sense and there are also other events where people make sense. So there's no R for me and to write this the same with Vim or Eximus better means there's no, perhaps there's a better one, I don't know but definitely there's a push and pull for me. So sometimes it makes sense to go to a machine and get out the metrics. On the other hand it's also important to deal with passive events coming in. Metrics are standard the push way or for example if you're leading with FNP press for example they are still out there and I think we are not going to kill them in the next 10 years. Personally I don't like Outer Discovery. Outer Discovery is very helpful in marketing because you press a button when you have 10,000 of green or red lights but the quality in Outer Discovery environment is most of the time not very good because what do you have? You have a bunch of services you figured out existing in your environment but perhaps it's a laptop, notebook, workstation whatever so it's really hard to make a good environment out of an Outer Discovery service. There are some exclusive services that you can use especially if you work in big network environments Outer Discovery could make sense so if you have a good tool for example OpenNMS which is really good for telco for big network environment there's a very good Outer Discovery ability also creating dependencies on that there could be helpful because creating for example dependencies on a network layer could be really really hard work if you do it with infrastructure as code but I would say in general IT services infrastructure can think about infrastructure as code means you have a process where you define where your services are use a configuration management if it's puppet or Ansible or whatever we have already something where you orchestrate your infrastructure where you say that service should run of a bunch of machines and monitoring should be a part of it so the time where you create all these services and then open up a ticket to the monitoring team and say please take care of the monitoring please don't do it so monitoring must be a part of your life cycle the service goes up it has to be monitored also in the early stage of the development it should be part of the process that monitoring is an important part of that and also if the service goes down then it's going to be removed out of the monitoring system so infrastructure is for me the only thing to do it right with monitoring that monitoring is part of the process because it reduces failure rates it reduces alerts on things that are okay that they're not longer here so definitely for me a good way and also an important thing to choose the tool for you so if you see that you should know and you should figure out what tool can I configure with my favorite configuration tool provide monitoring as a service if you have a monitoring make sure that the other guys in the company the other folks have a chance to to work with that monitoring system so provide an API interface whatever that they have a chance to participate in the monitoring and they're not required to open a ticket write an email monitoring should be a fundamental part of your infrastructure design and therefore you need something like a service interface whatever you use but monitoring needs to be a part of the process and that you can engage people to use it you have to provide kind of a service if you would like to have a service monitor independent from if you are on duty or not but use an interface at your service here and then we have in the monitoring system then you get your metrics also create dependencies on that so means if you don't edit in the right way you will not able to get metrics for your service so that's an important fact so so coming in explaining what I think monitoring is and how a technical approach can work let's talk about the tool because this is why we are here hopefully in that talk talking about availability and functional so what's out there there are dozens of tools out there there's no reliable database what opensource monitoring we can look on Wikipedia they are about I don't know 64 tools there's a monitoring survey James troubled it for a couple of years it's a little bit outdated this is from 2015 but it gives a good impression what's out there so we ask people on a yearly basis usually what are you using what are you doing with your tools what's important for you and in the tools of eye you can see that Nagios is still number one then there are a couple of star services like CloudWatch, Nurellix, some homegroom tools whatever that means usually it's modified Nagios or something like that then a thinker is coming up sent to Subbix and all these tools which are from the open source on premise area here I will talk about except Optu and Sentrum because they are kind of flavors of Nagios and I don't want to cover it again they are of course individual products with advantages and disadvantages but it's also not possible to put it in the 45 minutes talk so I will focus on the open source on premise tools in this survey so what they are Nagios I think are sent to Subbix Riemann and OpenNMS perhaps you figured out that OpenNMS is not in the survey but I think it's worth mentioning it so I put it in let's talk about the first one it's kind of a love-hate relationship I was using Nagios for years I was in the Nagios community advisory board once and the problem is I think Nagios is a good system but Nagios was a good system at his time like steam machines were cool at some time so Nagios has a lot of advantages and it is easy to extend and all that stuff but definitely today I would say they are better alternatives so of course you can use Nagios still for if you have to monitor a 2030 host and Nagios still is really really reliable because the codebase is so old and so many people patched it to death that it really works but there are better options out there so I'm not a Nagios hater to make that clear but if you start new if you think about doing something with monitoring please don't start with Nagios start with anything else but not with that sorry Nagios somebody's here I think the other one on the list also that I'm involved in the I think project we have pros and cons as well and I would like to treat every product fair I think a tool came out of the Nagios for originally so the I think a project for three codebase into I think but at some point we figured out at this time that it was really hard to change the codebase and then the product the team started to rewrite the code from scratch in C++ advantages definitely in I think are that there are a lot of integrations to other tools are built in means if you would like to write metrics to graphite influx to be open to to be all the stuff it's just in future you need to enable means you don't need some other external tools to make that happen it has an application based cluster stack means the rest API to add delete modify services during runtime makes it very easy to have it in an HR environment and that's definitely an advantage and disadvantage could be in I think because there are so many possibilities to have active passive checks and have the checks running on the server on the client that it could be complex to set up especially if you have a system with multiple nodes or the certificate thing we need to make it secure and all this stuff people have sometimes problems with it always room for improvement of course the documentation is not so bad but we regularly figure out that people have problems with so much possibility talking about Sensor, Sensor has in general a very similar approach like Nagas and I think they also have kind of standalone and subscription checks means the server can do stuff and also the client can do stuff a lot of people complaining about RabbitMQ which is necessary to run Sensor I'm not a RabbitMQ expert but if you look at the forums a lot of people complaining that it's not running it's not stable and it's hard to install I don't want to charge on RabbitMQ because I have no idea but this seems to be a problem with running Sensor and production that people often have a problem with a transport layer in there there are no historical data if you would like to do SLA reporting afterwards it's really hard in Sensor because the information is there and locked but you don't have a data model to access it and what's really sad is that all Sensor move to enterprise only dependencies for example there are no dependencies in the open source version SNMP move to the enterprise version it's not wrong to have an enterprise version don't get me wrong but I don't know where is the border between open source and enterprise version here I think the open source version of SNMP is really hard because it's probably not enough for an enhanced monitoring setup if you're able to buy the enterprise stuff in addition to that then it will pretty much do everything you need but this is something you should take care about Subix Subix is also very very popular monitoring tool in Japan for example is de facto standard so a lot of people using Subix there perhaps you know better but I heard so that the Subix is very popular there it's a full featured solution with Subix an advantage definitely is you get a lot of out of the box to install the agents turn them on and you get all the data, you get your graphs everything so it's really easy to start a new monitoring system logging and graphing is integrated in Subix that is easy it's a little bit harder to orchestrate but it's usually the case with all the tools as much you get, when you get more in up front then it's usually harder to extend the later works because all the things are integrated if you would like to extend it in some way then it's really not that easy and scaling Subix could be a problem because Subix also the satellite systems rely on a Postgres database but at the end all the data has to be written in a single Postgres database to be fair also other problems like you think I have issues working on that but that's the scale out limitation I know that customers have with Subix if you would like to scale out that your database needs to be very good to it means you definitely need to have a large environment to deal with all the data coming from Subix Riemann anyone heard of Riemann okay a couple of you Riemann is a project there's not so much going on it's my last point on the slide I figured that out that there's not much going on in the project right now Riemann is a streaming processor you have a server running and all the clients constantly push streams to the server it's based on closure which is really important to know that it's in closure because you also have to write your streaming rules you have to be familiar with that language to really work with Riemann in a good way if you would like to measure different things like a web server application server they provide different Riemann tools to send over the metrics to the Riemann server it's stateless it doesn't store any data it just handles it, shows it in a web interface and you can see the metrics so the advantage of Riemann definitely is also perhaps in a combination with another monitoring tool that you really have real-time information about your system so you constantly get your performance streams out of the system and see what's going on like I said before there's not so much going on in the project I checked GitHub again yesterday there are a couple of small documentation enhancements but there's not much to do on the other end it could be that it's just perfect right now and there's not much to do I don't know OpenNMS OpenNMS is on the market also for a very long time it's also full featured open source solution means you have everything in there like I told you before it's very good in heteros in homogenous environments telco network they are very strong in that area it's based on Java which is not important to say if you hate or love Java but it's hard if you fork out of Java means every time you leave the OpenNMS Java context if you have to execute an external plugin then performance is horrible they have a lot of stuff included means SNMP everything is inside the JVM because you have to do something which is not included into OpenNMS then performance is not so good but this is not a it's kind of an OpenNMS problem but it's a Java problem forking out of Java is expensive and therefore scaling out external checks with OpenNMS doesn't make sense but I also guess that's not a field of expertise Outer discoveries is really cool they are also really really nice guys from the OpenNMS project for years it's definitely a cool tool if it fits to your needs okay so leaving the functional monitoring let's assume we set everything up using our favorites we figure out something is running or it's not running metrics and time series what it's all about counting so we would like to say perhaps it ends up in money or not but we would like to know how the metrics are and definitely I would say so not talking about all the ROD tool based tools which are out there MRTG, PNP, whatever so ROD is definitely also a very cool thing the problem with ROD is for most of the people it's not fancy enough that's not a technical argument I know but sometimes you don't need technical arguments and that's definitely a problem that people would like to have more fancy graphs where it can add different things and all that stuff and that's hard with ROD so what graphite is the database the graphite is the whisperdb it's very similar to ROD it has a couple of differences for example you can also add data not in a serial fashion like PNP the dates have to be arranged that is not necessary for the whisperdb it kind of started the metrics revolution so it really started to be popular years ago something like that and the biggest advantage is also the biggest disadvantage because graphite consists of different components you have the whisper which is the database for the RODs you have a carbon you have a graphite web and some of the original components first of all are really hard to install so a lot of people fail with graphite because they're never able to get it up and running but also the different components they seem very good if you look out at github there are so many different components replacing individual components of graphite for example carbon seba which is like a proxy for the carbon cache there are so many things that it's really like a moving target and it could be hard to debug I know that there are a lot of large environments based on graphite and they figured how to deal with it but to start with it it needs some knowledge I would say definitely it's still kind of the standard I would say another thing is OpenTSTB anybody using OpenTSTB congratulations you made it pardon ah ok so OpenTSTB is pretty cool but it's hard to set because it's based on Hadoop and HBase means you have to know how that should work if you know it congratulations you know it then you can do it it's really able to scale like crazy you can also store all your data you don't have to reduce your data you can live with raw data forever if you're able to pay for the disk of course and if it's up and running it's very easy to scale based on the HBase and Hadoop you just did a note and that's it it's also a lot of monitoring tools provide an API to OpenTSTB so if you already have knowledge about Hadoop or HBase you already have a cluster that would be a good point to start if you just would like to start with metrics and then have to start with Hadoop and HBase if you have time anyway it's harder to start with OpenTSTB another thing is Prometheus and Prometheus originally was developed by a Berlin based company named SoundCloud for their internal metric storage they open sourced it and I think since about a year I don't know they are a member of the cloud native foundation Prometheus is also a time series database with a dimensional model means the database model is not tree based like in graphite for example you have these tree based metrics and Prometheus is very flexible in the database model originally it's designed for web service you can query externally means getting OS metrics or something like that you need other tools installed to be there there was no kind of a plugin mechanism on there for example you would like to get some load information or infrastructure information you need a node exporter to install it and what made Prometheus very powerful that it has a rule based alerting so there's a component in the alert manager where you can based on metrics and thresholds create alerts send it to a user and if you are if you have a setup that is really based on metrics and probably all your information comes out of let's say response time and specific load scenarios then Prometheus is really good to set up a monitoring based on metrics it has to fit your environment definitely in the cloud area means because also Kubernetes is in the cloud native foundation it gains a lot of attraction heavily developed moving forward so it's definitely interesting tools to check in the next year another thing is Influx DB Influx DB has a very similar scope to graphite I think the person would say okay let's make graphite easier therefore it's really much more easier to install and it has a very powerful SQL like query language so if you're familiar with SQL it's really easy to get metrics out of Influx the cool thing the enterprise also means horizontal scale out in Influx DB you need to pay for it and they put a lot of energy and also doing more with it means they develop the stack named TIC stack TIC is for telegraph what is it? chronograph and capacitor there are different components where telegraph is able to send metrics to Influx DB chronograph is the web interface to analyze it and capacitor is something like the alert manager and prometheor so the guys from for Influx data figured out that they more than a metrics database and created components to have probably the full chain from sending metrics to have a web interface analyzing it and also creating some having a rule-based approach to get alerts out to the user Elastic I would say it's the defective standard so who is using Elastic stack in some way here? definitely more than OpenTCD don't ask me why I would say they were the first player and they had made very good decisions in buying other projects means Elastic search is there for a couple of years based on on the scene and the Elastic stack also started in the time series era about very serious I would say one and a half years ago there is a Kibana extension named timeline for metrics and for time series this was pretty cool you have to look out Elastic Beats which is a method to directly send metrics information from your tool there are perhaps Beats for Icinga for Nagios for your application you can directly send metrics information to Elastic bypassing LogStash and then you can use Kibana to query this it has a different model approach because the fundamental concept of how Elastic search works in the database is different to the model of Prometheus or Breffat works and it's more important that you know what you would like to see where the graphite approach is more like put everything in and if you can afford it store everything and look later what you need in Elastic I would say you need to think more about what I would like to have metrics for how the node design should be how the object design should be Statsy could be helpful Statsy in combination with LogStash Statsy is kind of a metric aggregation from Etsy where you can work with with counts and aggregate specific values and putting that information back in Elastic is very powerful but it could help you in some way as well so now we found a way we can store in the metrics different of the tools have their own web interface like the Influx Data guys Kibana of course but if you talk about visualization so getting all the metrics out Grafana is it I would say right now the standard because it works with all these databases Grafana has interfaces to all these tools and more the guys working on Grafana Torgel and the Grafana team RainTank they're bringing new releases I would say every week and it's very easy to start with and it's very easy and it's probably one of the reasons for people using it because it's easy to combine different data sources also from different back ends into a single panel means to get some sources out of Influx DB from Grafana if you have more of that getting information about Elastic and it's a very cool thing who knows Grafana annotations not so many so annotations are a cool way but if you have your graph and storing your metrics in it for example a puppet run or usually a git commit so somebody breaks your system then you can show that event in your graph means you can probably look with Grafana into your metrics database and also look up to your syslog information and that's pretty cool because if you see that something is wrong with your performance, with your metrics with the response time sometimes it's just a puppet run exchanging some software make it very easy to see it you can see depending on the event it's just a test event you can also add more information here you can also say there's a git commit by developer xyz so you can call him so it's really easy to have a quick analyze if the performance changes what happens so if you use Grafana and have kind of a log management tool in the middle you definitely should give it a try first of all we have to start I think what the difference between log and event so a log is just the flow of unstructured data so hopefully we have kind of a timestamp makes it important to work with it but it's not more than we have a timestamp and we have a bunch of information in a log and if you would like to do more with it we have to split it up in different attributes going from a log into event means we have to check what is what what is the timestamp services responsible what is the message and I would say it's always the same process except you only have the by law you have to probably store your logs then perhaps you don't care what's in there if you don't have to store it but if you would like to work with your logs you have to make it into event so the process should always be like going from a log to event thinking about locks there's a great lock you have grog for example you have different patterns where you can split up into different parts most of the services are already out there so we don't have to rethink it again just grab it and then another bit and then take some action if you really work with logs you have first of create really the identify the attributes and work with it later on also elastic in that area definitely is more the standard than in the time series area it's highly integrated so since elastic at the time was the first one elastic based on the scene then they I don't know kind of board or adopted log stage developed by Jordan Thistle and then Kibana so they get all these tools together and also beats was an external project named packet beats before so elastic did very well in getting the right components together in a complete solution that it wasn't now it's the elastic stack it was previously the elk stack but since they added beats elk doesn't work anymore you don't have any user authentication or stuff in it and if you obviously would have everything in a fancy way you need an expect for it but how they do it elastic I think it's very good because for example log stage has a very powerful API which is an advantage to other so you can really go to a log stage instance and see what's going on how the data is processed how quick it is and this is open source if you would like to have it in a fancy way in your Kibana you need to buy the expect but I think the border they draw between open source and enterprise edition is pretty cool because you have everything you have the API you can access them if you would like to need it in a more fancy freeable way you have to buy the expect so definitely log stage API came up with I think log stage 5 a year ago one and a half year ago if you're working with log stage and you see some kind of processing performance problem the API is very very helpful and by far it's the largest community so if you think about log management all that stuff elastic is the biggest community another cool tool is graylock graylock is also based on elastic search so they use kind of the same database but the biggest difference is that all the configuration and the ruling and everything where you have to do buy hand and log stage is provided by a graphical interface also if something like authentication authorization is important for you graylock could be a very good choice because everything is in there connecting to LDAP having user rules all that stuff is freeable and it's easier to start with graylock because you have an interface where you can see your input sources, the output sources and it works also very good in combination with log stage means it's always it's an or it could be an end as well means you can combine all these tools and there are a lot of users using graylock but using log stage to get the information from the system or another tool named fluindy fluindy like Prometheus also joined the cloud native computing foundation it's kind of a unified log layer so I would say fluent could be a replacement for log stage it has a very powerful log layer and connects very system with others I have 2 minutes you have 9 minutes I only have 10 minutes okay I have the right time zone hopefully okay fluent could be a good alternative to log stage or if you don't want to use elastic search as a data store if you use something else because fluindy is supporting multiple backends fluent is also a good point to start an advantage over log stage could be that it has full reliability means it has a file and memory based buffer system where you can also replicate it to multiple fluindy services which log stage doesn't have so it definitely could be a good man in the middle replacement user experience probably the last area which covered all the others I don't know who is doing end to end monitoring in some fashion okay we'll not have a lot of fans here I guess it's not super popular means really figuring out how your browser works not a lot of people doing it because it's a lot of work perhaps I would say for a typical ops guy it's not so much fun to be with a front end and fill in different variables but it's really really helpful so end to end monitoring gives you another experience that you don't have a disappointed customer and it's really often the case that technically everything is right but the user experience is shitty anyway means not talking about a bad interface that's another story but just your interface doesn't work like expected and therefore for some services for some I don't know if you're talking about a web shop going through your shopping experience adding a product to the cart doing a checkout and checking if the invoice is correct could make sense there were these tools out there web and check and out it they were cool a long time ago means they are really not actively development web and check the last release was 2006 still people using it and out it is also a little bit outdated also the current windows versions are not supported there are two end to end user tools I would like to mention here one is SAKULE SAKULE is a combination of a tool named SAHI in one of SQLIX one is for web testing and one is for FET client testing and another is for NUGAS compatible systems means NUGAS I think are central and up to all that stuff most of the checks can just launch a docker container it's very much isolated which I'm not really sure that some of the SAHI features are enterprise only you have to check out what you need here and another interesting product I think nobody knows here I guess is a product named Alivix it's developed by an Italian company It's a complete solution monitoring web user experience, monitoring Fed Client. They have an IDE where you can really create test cases on end-to-end basis. And they have a full audit trail notification system as well, send you a screenshot for example if something is wrong, definitely you should have a look. So if end-to-end monitoring, if you would like to check your user experience, if you would like to see if your Citrix is working, they are also able to work with mainframe terminal if the tier thing, I would say mainframe and DevOps. I don't know, it could work in some way. Anyway, if you have it, they can do it. They can also work with Java applets. They can do pretty much everything. It's really cool. So the conclusion, so if you go out of the talk, come back next week and say, did you learn anything? So what now? So your boss comes on Monday. Perhaps he's not looking so bad, I see, or I don't know. I'm sorry, there's no best tool. And that was probably not the goal of the talk to say choose that one. I think it depends on what you need. Means they're kind of two different approaches. There's a monolithic way where you have tools that do a lot of out-of-the-box, like Subbix and OpenNMS, they have everything in there. You install the agents, you have graphing, locking everything. And if it's enough for you, then okay, because it's easy to set up, you don't have to deal with all the external components. Also, perhaps your main focus is not tech. So if tech is important part, but not all the guys are in tech and you just won't like to have a monitoring, then it could be a good choice because you get something easily. If you need more, if you need to scale it, if you want also play with the newest fancy hot shit on the market, then a modular approach could be better. Means if you're a tech driven company, if your ops are really up to date and would like to play with a new way, combining different tools together is the best way. So I prefer a modular approach on a tool chain because one thing is that sometimes also the monolithic approach is good enough, but the problem is at some point you see a new thing is coming up again. So there's a new metric system you would like to play with. And then going from a monolithic approach to a modular approach, and say I would like to replace my integrated graphing solution with graphite whatever, it's hard. Best is take your favorite tool, set up kind of a real life use case. So don't test your local Linux box. So perhaps in addition, but play with the use case you have in the company, play with the integrations. So try to hook different tools together, choose your favorite. Sometimes you are not able to make the best decision. Like in life you can go through every argument but sometimes it's just flip a coin. If you don't know, just start with one and figure out if you like it. And if you like it and if it does everything you need, then it's perfect for you. So I'm hopefully in time. Thank you very much for listening. Are there questions or are there time, is there time for questions? Five minutes. Are there questions? Are you awake? Kind of. I didn't get the question, I'm sorry. Hello. Hello. I was just curious why you didn't mention pager duty in this. Because of open source. Open source. Pager duty is cool. Like others, VictorOps and all these tools, but it's an external alerting service. VictorOps is also cool because they're using iSinger, I know. No, it's an external service. Like a lot of other cool things in the software as a service market, of course. Like Eurelix or CloudWatch or Datadog, Librado. But this is for me focused on open source. Possible on premise tools and therefore it's not part of it. Of course I know that all these alerting external notifications, SMS voice is a big part of a monitoring tool chain. You can do it on your own with various solutions, but definitely pager duty is a good choice as well. More questions? There's somebody. This question is not related to the tech exactly, but as I used both the open source monitoring landscape as well as the proprietary stuff. So in your opinion, how far ahead is something like Splunk pager duty compared to the ELK stack and everything in terms of for a scaling tech driven startups? Yeah, that's a good question. So if you name Splunk, Splunk is awesome if you can afford it. Splunk is expensive like shit. So if you would like to store a lot of data, it's really hard to afford it. So if you say money doesn't, I don't care. I would say go with, it's super easy. You can call somebody and scream at him. It doesn't work. So Splunk is really, really good. But since you pay storage wise, you pay on the data you store, I think for me it's more alternative. What I see, a lot of customers go away from Splunk because, not because the tool is that, because they're not able to afford it. It's really only a cost reason why people go away from Splunk. Pardon? In terms of the features, definitely Splunk is much easier to configure. So I would say Greylog is more the Splunk approach than Elastic because all the configuration from the input and output is able to dump via web interface. Also enterprise integration to also open source tool would provide enterprise add-ons like Puppet or Ansible works pretty good with Splunk. Splunk is fast and Splunk is also capable to probably work with every input, output necessary. So it's definitely a good tool. So I'm not into Splunk, so I cannot install it. I can work with it. I know in general what it is and what it is able to do. But I would say it's really just a money reason. Otherwise a good tool. So like I said, if you can afford it, congratulations, it's a good one. I'm not meaning it bad. So if somebody from Splunk is here, I'm open for discussion. You never know who's in the audience. More questions? Yeah, there's just time for only one more question. Sorry? I've discussed about many open source tools. So my question is if we want to have an activity tracing, let's say all my use cases in the y-axis. I'm talking about a visualization and a particular graph where all my use cases are in y-axis and all my microservice one, two, three in the x-axis. So if I want to do something like a color-coded activity tracing for each of my use case, which tool is the most suitable one? Color-coded on what space? It's on the performance or? No, no, no. Activity tracing. Suppose if I have like eight microservices and there are eight microservices as one for UI and then one for the database, the service that's talking to the database and then to my IoT devices. So I want to activity trace my use case. Suppose let's say I'm switching on my device. So starting from my switch to the application, I want to activity trace it. So I want to visualize a graph like that. So which of the tools that we discussed now which will be more suitable to visualize that kind of a graph? I would say none of these tools. So if you would really go into application, New Relic is very powerful. Means it has some requirements because you have to replace PHP or JVM with their stuff but they are really able to go into it. There's an open source. I can tell you later, it's not part of the presentation but there's also an open source alternative to New Relic and we really can see what's going into your application and the combination of the user case coming from a service, what's happening in the database. I talk to you later because I have something but in general I would say none of these tools are good for it. Visualization could work in all these time series databases but they at the end just show a result of your previous work. I have something I can show you later. Okay, I'm here both today and also like I mentioned before, I think I came on Saturday. Thank you very much and enjoy the conference. Hi, can everybody hear me? I think I can hear myself. I am Zainab Bawa, I run Hasgeek and I am the producer for RoofCon. So thank you all for being here today. I'm gonna eat a little bit into your coffee break and I'm hoping that you will forgive me for that but I do have a bunch of announcements to make and a bunch of very quick thank yous. I'm not gonna hold up Aditya who has other important appointments to do quickly after his talk but I will take about five to seven minutes. Very quickly, there have been two changes in the schedule today since Krum who was to be speaking about the anatomy of alert is not here because of health reasons and so Aditya Patwari, editor of RoofCon is here today filling in for him. The talk that Aditya will deliver today is on deployment strategies with Kubernetes. Sorry but surprises are always the case in life. We have one other surprise and I'm again sorry about it but it was completely unintentional. There is an OTR, an off the track record session on microservices this afternoon from 2.30 to 3.30. Unfortunately Anand has missed his flight for the second time and so this is rescheduled tomorrow at 3.45. The change is reflected on the website. I'm sorry but yeah, life is full of surprises. Very quickly I wanna make a very quick introduction to RoofCon and then want to call upon Satish Mohan to talk very briefly about DevCon and why we're doing it here. RoofCon is in its fifth edition as Todd pointed out this time. We have 500 plus participants who've registered across both states so it's gonna be a very big audience and I recommend that you make the most of this conference by doing a bunch of things. Please include, please attend the talks in both tracks. There is an SELINUX tutorial that will happen this afternoon. The Ansible tutorial is already on. Again, those who couldn't make it to the RSVP early, we hope that there will be more partnerships with Red Hat and others to do Ansible tutorials for the rest of the year. Chat with speakers during break times, burn is around, I always feel like a midget in front of him but he's very huge to spot. And of course the others who are gonna be around across both ways. Participate in the off the record sessions. Off the record sessions are not panel discussions, they are conversations. We put together facilitators to help you to have conversations around microservices, AWS cost optimizations, art of software architecture, et cetera. OTR sessions will happen across in the porch opposite the auditorium. They will be in Pagoda, so please watch out and be there. There are two happening today. The microservices one is rescheduled for tomorrow but there's one on MySQL which Lig and Colin will be leading. And then another on AWS optimizations and hacks, which will be led by the Bangalow AWS user group and Jeevan Dongre. On that note, I also want to thank four communities who have actively contributed to this year, the Bangalow AWS user group for leading the OTR session. The internet freedom foundation for helping us push the digital rights agenda and to educate the audience on questions of privacy, network management, and internet shutdowns. Anurag Bhatia who will be speaking tomorrow from the Raipan CC activist project. We'd like to thank the IFF for helping us put this agenda together. Also want to thank the SCNUX community members who are here. There's a tutorial today and then a talk tomorrow by Tushan Bharvani. I think securing our systems is becoming a lot more important in today's day and time. So I hope you take the most out of this session and hopefully have more secure systems. And also want to thank all the communities associated with Red Hat. And I want to thank the PBSD Foundation for coming here and putting the agenda together for educating the audience about PBSD. So thank you so much for these communities. And on that note, I do want to mention that GroupCon this year has turned into a platform where it's no longer just about GroupCon. It's about all these communities coming together and GroupCon becoming the space for everybody to come and interact with each other. Given that, I'd like to invite Satish Mohan to very quickly tell us about DevCon and what is the future roadmap. And with that note, we will take on with Aditya. After Aditya session, we will have a very quick demo on the Contact Point app. You have these QR codes on your badges. We will quickly show you how to scan them and make the most of it because you can exchange contact details. And last but not the least, before Satish takes over, I hope you have a copy of GroupCon's cheat sheet. If you don't have one, please pick it up from the help desk. This has all the important information that you require including the Wi-Fi, SSD and the password. And other things including food tokens and how the systems work. And there's a further party this evening. So there'll be transport from here to the party venue. So definitely be there at 620 downstairs. There'll be various tempos going from here to the party venue. On that note, Satish, thanks for having DevCon here and very quickly about DevCon. Thank you, Sainath. Definitely, everybody noticed that the party... Yeah, is this? Okay. So yeah, so it's great to be here. So before I get into the DevCon, I think the moment I came into this conference, I met a lot of old friends and I see a lot of new faces that kind of... I can validate Sainath's point that I get, this has become a platform. This GroupCon, it's not just a conference, it's a platform where we talk about not just infrastructure pieces, but it's also about the application design patterns and the deployment patterns. So a brief on DevCon. This is a conference we started way back in 2009 in Burma, and it has more of a conference that focused on infrastructure technology. So all the community members who have been working with Fedora, just to come together and kind of discuss about the roadmaps and also talk about the features and the progress they make and also take suggestions from other community developers in terms of how these pieces will fit in together. So that was kind of the starting point for DevCon. And it has evolved since then, it has evolved into a platform again, which focus on application designs and designs and deployment patterns, and it brings in a lot of topics. So we just saw from the first keynote that even one particular segment in terms of monitoring has so many projects and with the new design patterns, every layer we add into has got so many different designs, so many different modules coming in there. So our objective with this conference is to make sure that we create a forum where the developers and the users can meet and discuss and that can cut short the time that we're going to deploy an application. So it's more in terms of like, learn from each other. So even as developers who is developing technologies, also would like to get the feedback saying that what next, they should prioritize or what they should bring in or what course correction they need to make in a particular stack. So that's the real objective. And when we had this thought of bringing this to India, we were looking for partners in terms of where, how should we host it and where should we start this? And then Hasgeek and Linux Foundation has been great partners and we found the root point is the great starting point to have this conference. So we're looking for a lot of conversations. So there's a lot of workshop and talks being designed and for the next few days. And a lot of people has put in efforts, not just the speakers, but also a lot of community members in helping us figuring out what are the important topics to pick up and one of the things the root cause audience would like to hear about us. But most importantly, the other objective we have with this initiative is also to increase contributions from this region, contribution to the existing projects or starting new projects as an outcome of the conversations we have in this forum. So that's the larger objective we have this conference on. So a lot of topics to cover in the next few days. So I hope you guys will have fun and I'm not taking too much time in your coffee break. So enjoy the coffee and then on to the next session. Thank you. All right. Our next speaker is a native of Bangalore, local guy. He is an expert in DevOps and Systems Engineering. He runs a consulting company here in Bangalore. So please welcome Aditya Patawari with the deployment strategies for Kubernetes. All right. Good morning. We have quite a sleepy audience and actually do help in sleeping. So one of the things that I really liked about this, okay. So one of the things that I really liked about this root corp is that there's a rope line there. I'm not really sure why it is a big rope bundle. I don't know what they're planning to do with it. But I thought that it's nice to point it out. All right. So I'm going to talk about Kubernetes today. And we are going to look at deploying deployment strategies, some very common deployment strategies. A little bit about me. I am a systems engineer and DevOps since 2011. And I led systems engineering team at BrowserStack before this. Currently I am running my own system engineering and DevOps consultancy called DevOps Nexus. I am a contributor to open source projects, including Kubernetes and Fedora project. And I've been an author, a published author and speaker at various conferences, including root corp, FOSTOM, Flock and FOSAsia. That being said, I actually lied when I told you that I'm going to talk to you about deployment strategies. That was my ploy to get into the schedule because they were not letting me inside. So I thought that if I put deployment, they'll allow me. So that was a very clever way to fool the editors. Right, so what I'm going to do is I'm going to talk about certain concepts of Kubernetes that will help you in deployment. They'll help you in designing your own deployment strategies. So, and by understanding these concepts like labels and labels and schedules and all those things, I want to learn deployment as a bike product. Right, before we go ahead, we need to have some sort of initial setup. I will basically need Kubernetes cluster setup and I need an image with which I will demonstrate how this will work. Now, I'm going to use the nginx image because that's pretty much very common thing to use. I request you not to follow the demo right now because that will screw the internet and consequently my demo will not work that nice. So please don't try to download the nginx image right away. So with that, I would probably want to introduce you to labels. How many of you here are familiar with Kubernetes? Or at least have heard what Kubernetes is, right? How many of you have used any cloud provider out there? Any one of them, right? Quite a few. In any cloud provider, there's a way to identify resources and the most common way to identify resources is to use something called label or tag depending upon what your cloud provider is calling them, right? So Kubernetes has a similar thing known as labels. Basically you pick any artifact like a pod or a deployment or a service and then you assign a label to it. And then subsequently you can identify that particular resource using that label. Now we are going to use this to our advantage. So basically when we want to do anything of the sort of deployment or routing traffic or whatever, we need to make sure that we have right labels in place. So I'm going to use quite a few labels here. Most common ones I've written down, like we have a label for environment, for application, for service. The color label is something which I will talk a bit more about later, but yeah, that's another label. Second thing I want you to realize is to imagine how your traffic flows from a client's machine to the service that you want. So we started at the very beginning. Somebody sends a gate request or something like that and it hits the load balancer, right? Now what happens to the traffic at the load balancer? Load balancer basically acts as a central brain. It tries to understand that where should I direct this traffic to, right? And the load balancer or the router or the service that is supposed to find out the resource for which the request is meant and then send the traffic there and then that resource or that particular server is supposed to take care of the processing of the request. Now this entire thing is quite standard. We have all have been doing this. If you have worked with any of the cloud provider, be it AWS, GCP, DigitalOcean, all of them have their own load balancers and or if you have used HAProxy, all of them work on almost the same principle. Now that we know we have this key concept in mind that we can actually tweak the routing of our traffic. Let's now start with the deployment strategy that we want, okay? So this is the pretty much, this is pretty much standard. So before I go ahead, let me bring up the initial cluster, which I'm gonna use, right? So I have two machines with me. This one, if you notice, this is the Kubernetes master. On that side is Kubernetes node, okay? I'll keep on toggling between two, but I'll make sure that the one which I'm writing to is maximized. However, you can always see that master is written here and node is written there. Kubernetes, if those who are not familiar, it's a very standard client server architecture. There is a master which passes, which against which you pass commands and there's a node which actually processes the commands in the sense that it is the one which is responsible to run the containers. Just to get a hang of it, let's see, I have one node ready with me, all right? Now, if you look at the initial setup, we have a two node setup, one master and one node, and I'm gonna use nginx 179 for this demo. So I'm going to create a deployment and a service. A deployment basically will create the Docker containers onto the node, all right? And a service is responsible to route the traffic from the user to one of these two containers that will be created. Let's quickly look at the things that we have here. I have a initial deployment with me, right? So I'm basically starting two replicas of nginx 179. Now, interesting thing to note here is that I have applied certain labels to it. There's an app label, there's an environment label, and there's a color label. Please ignore the color label for now, we'll come to it later. But most interesting label here is the environment in the app right now. So let's go ahead and create containers. And just to make sure that we don't have any residual thing running here, if I do a Docker PS, yeah, there's nothing running here. So I'll just create the initial deployment, it's created. Let me also show you the service that we have here. So this is my service. Now, if you realize that we actually tagged our resource, I'm going to use the same tag, environment production. And what this service is going to do for me is that it's going to receive the traffic on port 80 and it'll direct it to any resource that has this particular tag to it, right? So all the pods that I have, all the containers that I have with this particular production and blue tag, they're going to receive the traffic. I'll just create it quickly. When I, once I create the service, I get a endpoint. Basically just going to copy this and write. So I'm getting served with NGINX 179 here. This is my initial setup. Right now I've not done any deployments. This is my, you know, the first stage. Now going back to presentation, the most common thing that you do when you start playing with any application or when you want to do an upgrade is to test, right? Before deployment, you always test. And one of the most common ways to test is to do a canary deployment. Now canary deployment means that you basically route a very small portion of your traffic to a canary. I mean, very small portion of a traffic, production traffic to something which you want to update to. So what we are going to do is we are going to route a part of traffic. Now, can anyone suggest a way to do that? Like we have discussed that there are ways to route our traffic using labels, right? So can anyone here tell me a very good way to probably route the traffic easily, but not the entire traffic. Only a portion of the traffic to a different machine or to a different container. Any ideas? Is it too early in the morning? Okay, too early in the morning I guess then. Okay, so what I'm going to do is I'm going to create another deployment just like I created NGINX 179 deployment. But I'm going to make sure that I use the same tags. I use the same labels. What will happen because of that is the selector which my services use. If the label is same, then it's going to route a proportionate amount of traffic to the new set of machine as well, right? So let's say if you have a cluster of probably 10 or 100 nodes, if you just want a canary of one node, then have a deployment, a single deployment, a deployment of single node where you can basically, is it visible? Should I, is the font small? Good morning. I increased the form before coming but I didn't realize that it'll be still smaller. Is it visible now? I'll move it up a bit. Is it visible now? All right. So now what I'm trying to do is I'm basically going to run a canary deployment with one replica. Now for me it's 33% of my infrastructure because I already have two. I'll add one more, that's 33%. But if you want to do a real canary in production kind of set up, then probably make it less than 5%. Um, I'm going to update the image. I was using nginx 179. Now I'm going to make it 191. Thing you need to notice here is that I have added a few tags here, like type is added, but remember the selectors that the service had? Service had two selectors and that was environment production and color blue. Service does not care what part it's sending traffic to as long as the selectors and labels match. So you should take advantage of that concept and basically route a part of your traffic here. Before I do that, wait. Let me just show you. I'm just going to fire about a thousand requests to see what happens, right? It's all nginx 179. Now I'm going to create the canary. Right. So my canary is created. It'll take about a second to get created. Uh, if I fire the request now, I should actually see certain occurrences of 191, right? Can you observe that? In fact, we can probably go a bit ahead. It's going to fire the request in background and put a count that 328, which is approximately 33% of what we were doing, right? So this will give you a very good way to do canary deployments and test out the release before actually putting it production wide, right? So this is what we do. Uh, we just realized that services realize where the labels are and then send the traffic across. Now, once you're done with this, the next step is rolling deployment, basically deploying to the entire cluster. And that is slightly easier because Kubernetes supports rolling deployment out of box. So you don't need to play a lot around with labels and all. So I'm just going to quickly show you that. Um, but to show you that, to show you the rolling deployment concept, I need to upgrade the size of my cluster because right now we just have two or three machines running, uh, which might not be very helpful. So I'm just going to increase the size of my cluster to six. Right, now I have a lot of machines running. And what I'm going to do next is just set the image to the new one that I want. All right. As soon as I do that, the rollout will happen and the rollout is like this. So if you look at it, not everything is replaced right now, right? Only four of them were replaced. Fifth one got replaced. Now it's trying to terminate some of the older ones. So it's, it basically rolls to ensure that your, your users will not see a downtime. Not entire cluster will be taken out in one shot. It'll be rolled. Um, so while your users will not see a downtime, you have to understand that there might be some latency issues there. So when you actually do this, keep that in mind that your users will not see a downtime, but might see slightly degraded performance. So best is to either increase your cluster size like I did before doing rolling deployment or do it at a time when you know that the customers are, you know, you have very few customers around. So a rolling deployment will not actually hurt their experience. Rolling deployment actually has a lot more things around, which I quickly want you to show. For example, if you want to check out the revision history. Now there is an open bug because of which this is none. Uh, but what you can do here is that you can actually go ahead and check the revision history by revision number. If you do that, you'll actually see that there was a previous version with 179 deployed. Okay. And if you think that, you know, things are not working your way and you want to roll over, you want to, uh, go to a previous version, then that's easy as well because what you need to do is just do a rollout undo and that's going to put you on a previous version. Uh, so instead of getting 191 here, you will get a 179 back. Okay. So that's basically rolling deployment. Now. Lastly, I want to talk to you about blue green deployment. This is, uh, this is a different case which doesn't fall into rolling. Uh, what you do in blue green is that you maintain two sets of separate clusters. One is blue. One is green. That's why I use the labels color if you notice. Uh, what happens with them is that you say that my traffic is going to go to blue, but as soon as you are, you want to update, you set a new cluster called green and when you're satisfied, your tests are done on green. You basically route your entire traffic to green. Now, this is also very helpful if you are into immutable architectures where you don't want to, uh, you know, change what has been done once. So in that case, you probably create a new cluster, do your tests on that and just start the traffic. So what I'm going to do is I'm going to right now we are on, uh, 179. So let me just going to delete the canary. You don't need it anymore. Uh, once I delete the canary, I am going to create a green deployment. Now green is basically one nine one. So if you look here, whenever you hit this, you're going to get one seven nine because that is what the default is right now. That's what we're that's the blue and that's where our traffic is going. So now to get our traffic to the greens green cluster, what do we need to do? Any ideas? Basically, we need to edit our services instead of the instead of selector blue, we'll use the selector green and that should do the trick. Now Kubernetes give you facility to edit a live artifact. So now I'm going to do just that. I'm going to edit it live, which means that I'm just going to go here. These are the selectors that we have been using color environment. I'm going to set it to green now and as soon as I save this, my traffic will start going to green. Now, if I go here, I should see one nine one. Yeah. All right. So my traffic is now going to the green cluster. I can take down the blue cluster as in when the pending requests are served. I can do that. So with that being said, that's all I have on clustering and labels and Kubernetes. Do you guys have any questions? No. Yeah. Yeah. While changing the from blue to green as we are humans, what if you make a mistake? Is there a way to handle in Kubernetes? No. If you make a mistake, then your traffic will go here. It's generally a bad idea to make them. Okay. What I did here manually is probably something that you should not do manually. This is done for the demo, but ideally you should probably fire commands. So for example, if you notice when I fired, when I changed the set image to one nine one while doing rollout deploy, I could have done that using edit also. I could have changed it live and it would have if I would have made a mistake, it would have failed. So using a continuous integration solution like probably Jenkins or something and having those commands in place rather than firing it manually. That usually helps. And I do not recommend you to while I'm saying that it's possible doing it live. I do not certainly recommend doing it like this. This is a demo and this is just a way to showcase the features. But yes, ideally you should use a continuous integration or a system like Jenkins to make these changes or a deployment suit to make this that slightly peculiar because okay, what container is basically it's the same process running in a different name space and with its own isolation. It doesn't really, that's actually an interesting use case. I did not realize that that happened, but have you tried using Java restrictors like XMX and all those things? So Java is not honoring that as well? Okay. Yeah, that has to be looked at. I'm not entirely sure. My name is Satish. Suppose actually we have actually deployed here. Yes. Suppose if we deployed actually microservices and multiple containers instead of actually rooting actually these microservices they want to talk internally, how we can actually make sure that it's not hitting the load balancer so that containers can talk each other in the Kubernetes world? You're saying that you want containers to talk to each other without involving any sort of service or load balancer? Yes. I would probably not recommend that. I will tell you how, but I would not recommend that. I would want you to use something like Kubernetes services because that is helpful in case one of your containers decide to fail on you. That being said, if you really want to do that, pick any overlay network that you like. Flannel is very useful in that because I mean you have to pick an overlay as well as you have to pick a DNS add-on because if you pick just the overlay network, then you will have to memorize IP addresses. Your applications need to know the IP addresses which is not always advisable because IP addresses change. If you pick overlay along with the DNS add-on, then you just need to know the name and the container should be able to hit each other directly. Ideally it should. I'm not sure how your system is set up, but ideally, see load balancer, something like service is actually very lightweight. It's mostly by default it's based on IP tables. So basically you're just routing the traffic using IP tables and it normally works very well with very minimum overhead. I have never seen a delay of even like 2-3 milliseconds. Something that doesn't even add a couple of milliseconds is I think a worthy option if it increases the reliability of your application. I can... Yeah. Okay, cool. It does. Okay, so blue-green means that you actually... I mean, if you talk about the normal, the by-book definition blue-green, then you have to have enough capacity that you can run two clusters parallelly. Now, I know for most of the organization, it's a waste of money. For me, it's a waste of money. If you are on a provider like AWS or Google Cloud or Digital Ocean or something like that, then it's usually not a problem because you basically just pay extra for like NR or something like that, whatever they bill you for. And that's not a big cost. But if you are with a data center provider, if you have your own physical hardware that you're managing, then it becomes a cost issue. And yes, then it will be a problem. Rolling blue-green or a hybrid of rolling and blue-green, that helps. But technically speaking, that's not exactly blue-green because at some point in time, your capacity will be compromised when you are basically in the transition state. Compromise in the sense that it will be reduced when you are in the process of rolling out. Either you have to reduce the capacity or you have to make sure that both of them... that the users are getting served from both of them simultaneously. So you have to handle either that situation or reduce capacity situation. It's your pick. What do you want to handle? Last question. You are very... Oh, okay. Right, right. Right. Yes. Yes, yes. Hmm. It can handle it to some extent. If you are hitting your CPU, then Kubernetes can scale automatically. There are auto scalers available which work out of the box inbuilt in Kubernetes. That's one way. I personally have found it to be limited because it does not cater to all the parameters that I want to auto scale it on. So I sometime back for a client, I ended up writing a custom solution which will hook to Griffite and then scale... because scaling is just a command by the way there are Kubernetes APIs. So you can use the API, REST API as well and hit it directly instead of using command line. So what I would probably recommend you if you are just looking for very basic scaling, just CPU based scaling, then you have Kubernetes auto scaling group. You can look at that. If you want something more advanced, if you have more parameters to consider, then you'll have to get a little bit of hands-on with the code and API. It's not too difficult. Just have your data shop to Griffite or whatever a graphing engine you have where you realize that these minutes are being received and so on. And based on that, you can call the Kubernetes API to scale up and down whenever you want. So usually it takes a little time to understand but coding it is not very difficult. You can do that. If you are on data centers, then I mean you are not on a cloud provider. You are on a data center. Okay. The reason why cloud provider came in picture and gain popularity was because of this because scaling with the hosting product is very difficult. And I don't think Kubernetes or any other tool will scale hardware for you for hardware. You have to do a capacity plan. No, it will direct the request. Your application will time out. Is the standard thing it forget Kubernetes or forget any containerization or anything like that. If you have a process running and if you bombard it with if you have a web server running, if you bombard it with a request at some point of time, it's going to time out for you or for whatever customers. It might be a serving a partial partial number of records, but obviously for certain customers that will time out. That's not a Kubernetes thing. That's your application. That's your application. All right. So lights are on means that I'm sure off from the stage. Okay. Thank you everyone. Before we break for our morning and beverage break, which is coming up just next, I'd like to take a moment to talk about something that we may not think about very often in our jobs. We're always very focused on using the computer and technology and DevOps as an occupation very stressful. We sometimes don't stop to think about our own physical health as well. And has geek has earlier this year had a conference called kilter where we're thinking about physical health and fitness. And today as well, we're offering some activities for people who may want to step outside of their brain and more into their body. This afternoon at 3.30 in the lawns outside, there will be yoga and slack lining for something completely different, turn around 180 degrees and try something else. So if that sounds interesting or if you need to move your body, please by all means have a look at that. And also this evening, we'll be having a talk at the banquet on sleep and its effects on the brain. So for anyone who's interested in knowing more about this, I encourage you to make sure you don't miss the talk. So thanks everyone. Enjoy your coffee and chai. Hello. You over here to help me that. Okay. Welcome back everyone. I have a couple of announcements from our sponsors. You will have a chance to win exciting prizes. And I'm sure that appeals to everyone. So from widest concepts, you can win exciting prizes from the DevOps quiz. So have a look for the widest concepts of both outside in the vendor area. In addition, there is a digital ocean spin and win contest in which you can win a one terabyte portable external hard drive. That's one trillion bytes. Is that right? Yes. So excitement. Also, there will be a flash talk session today at 5.20 p.m. If anyone would like to deliver a flash talk, it's not too late. You can sign up at the talk funnel online or you can look for the whiteboard in the seating area at the box office. So please, if anyone has an idea for a flash talk, they would like to deliver. Come one, come all, put your name on the board. And you can give the talk in the session at 5.20 p.m. today. I'd also like to remind everyone that in your bag is a feedback form for the conference. We would really love feedback. In fact, we use the feedback to make each conference better every year. So please fill in the feedback form. It will only take a few minutes and it really makes things much better for us. You can drop your completed feedback forms in the laundry bags placed outside the auditorium. Very easy. Just drop it off. Please only put the feedback forms, no coffee cups or anything else in the bags. Or don't put your laundry as well. All right. I'd like to introduce our next speaker who has come all the way from Rajasthan, the land of kings and colors. But she's not picky about that. She'd prefer to be known as she is from India. So let's see. What else can I tell you about Kruja? I've taken some notes here. She'll be giving a talk starting the automation segment of our talks today. So she's talking about a little bot. So we'll find out what that's all about. And she's constantly striving to learn everything she can and share her knowledge with others. So that's part of her mission today. So without further ado, I give you Pooja Shah. Well, I really want to thank you for joining me here. I am Pooja. I work with Mo Engage. A brief about me is not so brief. After years of trying to find out about myself, I figured out one word which explains me more is I'm an explorer. I have explored a lot of things. But I'm yet one of you who still feels that I have a lot to explore more. In work, I had been a developer, a QA, an automation engineer, and now trying to double with DevOps and take-offs term. In personal, I am a foodie and sleep lover, but loving yoga makes me in a situation when nobody can find out that these are my hobbies. Coming to the work, I am... Coming to the work, sorry. I work with Mo Engage and at work, I try to bridge the gaps between all of my teams and with that mindset, I lead a team QA and currently actively focusing on building automated systems which can improve the collaboration across the teams and help us build, shape the healthier product. Along with the work, I am an open-source lover. With that interest, whichever project I use for my work, I make sure to contribute back to it. My recent contributions can be found in Jenkins and automation-related plugins. Coming back to the talk today, I am going to introduce you to one of our special team members. Her name is Alice and I can see some places glowing. Sorry for giving you hopes up. She is a bot. Yes, so we are going to go on a small, crisp ride to meet Alice, starting from Y, Alice to Bert to what all it can do and how a part of it, the ingredients and recipes, how we cook it, followed by the demo and QA at the end. So are you guys with me so far? Alice needs more voices. Yes, we are going to start with Y. I take you to one of our typical release days conversations. So just try to recall your release days to understand it better. It's nine o'clock in the morning. Ten o'clock you have decided to release it and all of a sudden a road blocker comes. So here is one of our team members named Karna who is coming running. Hey, Draupati, something is broken and you are about to release. Can you just figure it out? And here is Draupati, who is one of the QA member who was super sure last night that it was working and she's like, no, it cannot be. And she panics and says that, okay, let's try to talk to Beam who was the developer of that feature. Beam, on the other hand, has the general notion because, hey, I God swear, I did not touch this from a year now. How can it be me? Right? And same time seeing the situation, Krishna, who is the CTO, joins the discussion and says, hey, guys, shut the hell down. Try to find the buggy commit and let's revert that commit and move ahead. Now, Ajna has to come and do the dharma and the dharma is go to the GitHub and try to find out all full requests and find the commit which was buggy. It is going to take time. A lot of full requests have been merged now. Seeing this, Vishma Mitama, who is the developer, Vishma Mitama, who is the developer, gets the clear confusion like, oh, no, we cannot release now because it's going to be peak hours for customers. So let's just postpone the release and you know now, this raster being held place, they have to say, no, we cannot postpone, we have planned and we have committed the lines. What do you do? I want to know how many of you feel the same pain which I just described? Hands are raising, thinking that bosses are not sitting around. Yeah, yes. Same thing happened and if that happens, what do you feel end up with seeing this situation? What do you call this? Blame game. We don't like to use negative words but we say that some kind of conflict wherein people are not able to move ahead because of so and so whatever reason is and when that happens to you, what do you want to do immediately? And you feel, no, there should be some way. I also felt that and that's where we figured out what all problem we are seeing are just the symptoms. The problems, the root causes are beneath and we wanted to really work on the black area which we see here, the black area but the problems which we were not able to figure out till then and that's where I take you, from now I'll take you to the problems one by one and the solutions which Alice brings in but before that let's understand how a typical code flow happens in our organization to give you a better understanding. So here is our developers who write code in their own branches and pushes to GitHub and from there there is a pull request which goes to the developer branch from there there is pull request to our test branch we call it QA in our context and from there we decide that this is the code freeze and QA team will be testing only on this so that no confusion happens and when everything goes right here we move it to master and we take the release from here. So what we figured out is the root causes were within this area this area which we called it as a sensitive area and all the branches within we call it sensitive branches with their test and master so we figured out if we could put some kind of way to monitor this we could resolve those problems. So to look at this problem again I went back the developer inside me told that he was right he did not touch from in here now should he leave his current work and check that yes we don't want to lose the productivity and as a QA I feel the same whoever bug it is I just want a solution to take it ahead and same as the DevOps and finally the automation engine inside me comes out in all this and say that hey this communication problem is also repetitive in nature it's happening every release can we do automation around it can we do automation of monitoring this sensitive branches I take you from here the how part what are the problems and how it solves it. So for example the very first problem the last moment panic attack comes because of we are somehow taking the unreviewed commits inside the system and that's where Alice says that hey do not let them go inside. So at present scenario GitHub has launched the approval and review review feature of it but when we started it was not there so we started with this and still it has its own meaning I'll take you to that in demo section again. So whenever such thing happens whenever somebody merges the code without the code review that Alice does is it auto reverts it immediately informing people. So we are using for communication we are using Slack and for code collaboration we are using GitHub so we Alice knows and parses and finds out what is wrong and reverts back. Second problem no quick record so we saw that that was Arjun who has to go through all the PRs and find out and it was him all the time. So could we make it more understood by every member in the team and that's where we decided we'll have a permanent channel for each repository we have and Alice keeps recording for all the events happening in the sensitive branches. So for us merge was the sensitive event and the branches were QA, Deo and master so Alice does this. So looking at this we can come to know if let's say I'm a QA engineer and I have to approach if this feature was working two hours back and it doesn't work anymore now I do not need to just alert channel hey everyone check this. I have understanding okay these two request probably I can check and who is the author I can actually discuss with them and it eases up everybody's task and in that sense so whoever has committed within that timeline they can see and quickly fix this. Next problem no danger boat. So mistakes are prone to happen accident can happen if you don't put the danger boats and that's where Alice said hey be proactive and put the danger boats. For example this feature we introduced for DevOps especially so we found that there was some machine level DB level changes which people were doing and missing for getting to inform DevOps team and that's where we decided okay how to not repeat it and one of the example is this. So wherever some sensitive files getting touched by someone in one of those sensitive branches Alice comes and rewards informs respect to DevOps team hey this is being changed so if DevOps has to check that and whatever they want to do they can do it afterwards. One of those example is this also. So we said that from we will take release only from the QA branch not from the any other branch. So we Alice we in the same Alice does is what wherever somebody creates a pull request from some other branch random branch to master Alice autocrosses it immediately. So we are saving our time of mistaken emerging wrong PR and then reverting the code again. Lack of awareness so this is one of the one where Alice got more feathers in its hat is Alice says that hey people have tendency to forget how to keep reminding them and that's where it says keep me posted. So Alice posts for multiple things for example Alice the moment you as a developer merge some code into some special branch Alice reverse hey you have merged this and so and so in this now be nice and just mention it the release notes. So it helps QA team to derive the bandwidth when they can release and others also have the clarity of the things. Next is this so whenever we are about to release we want to inform all this is auto alert Alice Alice and say we are about to release especially to tech leads if something is missing or something they want to add up just they can they are aware. This is one of very interesting things I myself has done this mistake a lot of times. We took release live without J's update and we all know what happens so it was for five minutes our UI was not afflicted and people are trying to find out what went wrong was just the forgetfulness. So we decided okay when there was somebody merges to the release or master branch Alice immediately pinged hey did you fill this checklist if this checklist is not filled they won't take the release live or whoever the release engineer is no automatic guide. So we were around 20 engineers when I joined Moingate and we exponentially increased to around 40 engineers and telling everybody educating everybody about the process was becoming tedious task for me and that's where I saw a full potential of including automated guide. So bringing the talking bot so I'll come to the take inside how we are doing. So you just ask Alice about the system it knows everything about that system from static replies to dynamic to giving it a task for example hey Alice give me release notes it's reverse hey Alice how do I take my patch live it reverse the instructions what do you need to do hey Alice dynamic hey Alice start my machine it goes and triggers whoever is responsible to start the machine hey Alice get me the branch name so if I'm a product manager or somebody else who do not want to go into all details of SSH to machine and finding out what code is being applied that's very handy for that person Alice reports back now coming to the take behind it's no magic it was all here but what it took me to do it was connect the start there were github APIs, slag APIs, report APIs and Jenkins so I really want to appreciate Jenkins community completely because all the ideas I got from there to automate things and how we did this so we have github repository and each repository has a webhook mechanism from github wherein whenever you merge a pull request there is an event triggered and this event triggers Alice so there is a full payload of there is a full payload of pull request which Alice parses and says does the particular action, the business logic business logic can be pushing back to Slack business logic can be talking to talking bot, you bot and Jenkins so these all are interconnected to do whatever whoever is best at doing what task they are just interconnected with the main source called Alice so let's have a look quickly have a look at demo so for simplicity I have kept one is it visible? now somehow not increasing the laptop no it was a joke I had a demo recorded so coming back so I have one commit ready and I say this I create a commit root con live from root con 17 and I say create pull request the moment I do Alice does multiple tasks do you see immediately the guideline has come the respective person has to do now people will have conflict this we can do in the contribution probably with GitHub only but I'll come to it also why we needed this because for different branches we wanted to do different checklist the moment you see this there is a Slack channel for simplicity I have created a Slack channel to show you so here is a repo code it has got the entry hey this is being in this repo this is being opened by this etc etc and if you see if you see this sorry so if you see this I have created a change in the file which has a rule of if the file is changed product plus one is essential and whichever that rule says that if this is being merged without the product approval this should be either rejected or being informed so I merge this commit let's see what happens in the channel you see it says that hey this is merged without the product plus one in one go I got this detail and there is one more entry in the repo code which we saw the common place to track back it gives the entry okay this is being merged so now Alice has all the trace back so if a person wants to know what were being merged they will come to this channel and release channel is especially to inform to take actions if it has to be reverted back or automatically revert back so you can set the rules or deselect the rules in Alice for that purpose now commit to talking about let's talk to Alice I can ask Alice I can invite Alice as a user in my Slack channel I can say that no it's not Alice it's not Alice I can say hey Alice that's life so if it knows it rewards if it doesn't know it say I don't know and hey Alice you can say what is branch on my one more for my machine it goes to that machine this is a dynamic reply it goes to that machine and it reverts back and if you feel that this is too much noise in channel you can directly talk to Alice you can directly ask Alice hey of hey Alice when is the next release these are the dynamic feeder data which one of the QA teams probably gets feed or the release team can feed and you can go more in-depth and say hey and what are the release items it goes and finds out whatever is being feeded either static or in a file wherever automatically you can write from the merge event you can do whatever you wish to so and if you're getting bored you can just ask hey get me a cup of coffee so things like that so you the idea is you can do you can do static dynamic or sort of things coming back to the talk yes so we saw that it can solve multiple problems so we can control and say that hey Alice do this so I saw DevOps engineers especially having a trouble like my morning 5 o'clock or 4 o'clock they have to wake up and log into laptop and deploy some script on some machine run some scripts this can be very handy we can just feed in okay go to this machine and do that you can connect the Jenkins or other CIE things with it and at the same time you also want to ensure that nobody else can do it in Alice you can do those authentication you can say that hey do only if it's me and you can also say hey Alice be more intelligent for example I have seen DevOps team suffering this like not suffering having this question again is my release deployed on that particular server or what are the all release on that particular server you can just have this kind of setup critically so whenever release is going from whatever branch that push event you can save the data from there whenever machine sorry whichever machine the release is going DevOps have that their own way of scripting finding out all the server details and that can be pushed in one file which Alice can keep reverting each release and this is one of the interesting thing which we did because I did not want to be a grandmother so I didn't want to tell everybody every time that this is a code phrase we are not going to take more items out of this except for things we did we fitted in Alice whenever code phrase happens from test branch it automatically reverts it has all the entries what is being merged during that period and it gives the quick details people can see okay if my code there is getting tested and QA teams can plan their bandwidth future so what do you think what can be the future apart from this whatever is written it remains right there is no single word you can code whatever you want to but what immediate future I am seeing is the continuous integration the reason is simple because continuous integration requires all is the the code commit data and Alice already have that all it needs to do is run some checks and move it to the next staging server so that is what I am working on probably will be there code right soon it is coming soon and some people are like you are going to be kidding yes I am kidding it is not coming soon it is already there we have open sourced it and you can try it and if you like star it, fork it contribute back to it I feel I would love to know more stories around what problems Alice solved for you and with that I want to give credit to a lot of people mostly the problems at work and going to the coming to the APIs and the photo credits also I want to give to the unsplash Google NTP sorry to summarize this we started with solving one problem that was people were merging the unreviewed comments we started with one problem and gradually we learn in the process that Alice could do much more behind beyond that now I hope you have idea about solving your own specific problems in a more automated way and an idea how the future Alice will look like with that I appreciate your time thank you Q&A I can take few questions hi alright so I saw your checklist for developers before the merge is done right I know the developers mindset is I want it to be merged right now just click the box and merge it now how long have you implemented this checklist how effective has it been yes so the checklist we started using for our own the first checklist which I shown in the slides if you remember that was for the release checklist that's for sure we are taking care of it because we already did that mistake right so for release checklist it's not a problem everybody follows it because then releasing will go wrong now developer checklist developer checklist yes it is hard to implement but it's like it's like there are people who actually do not want to miss things so for them it is very beneficial and for whom who after seeing also do not want to do that's where continuous integration is coming in so we do not allow the we do not enable the merge button itself until those things are done was I able to answer this actually the thing is we have one bot called bot so we have integrated with the Jenkins but the thing is my slack user and the Jenkins user might be different so I will control the credential it's simple so I created a Jenkins what do you want to solve with that what I want to control is like we have production deploy to Jenkins so I want to lock we are triggering the build so slack user might be the different one and the same user might have the different username for the Jenkins so how can you think of that so the answer is yes or no you want to do that or not so for us if I can create same name as Alice nobody gets to know about it but what is the slack user yeah internal authentication when you want to do for that users whatever user you have created you can do for that there is no need to make everybody everywhere Alice you just can name Alice internally for that specific user token you can authenticate okay only this user token should be allowed to do perform particular task but then you will not able to lock that whoever is triggering the build suppose I have a bot anyone can send the message to the bot so I have to lock it in the Alice not in the Jenkins so what if I want to do it at the Jenkins level Jenkins level so everything whatever message so Jenkins triggers a call sorry Alice trigger a call to Jenkins to perform certain task and in Jenkins in post build we have written reverse back to Slack so we know that who did what it's never open ended it's never open ended just do this in Jenkins we have always written a post Slack call to reverse back in saying that hey I did this on this particular server so people get to know it's not like somebody personally talks to Alice and do something no so that will be our handling can you help me a little more to understand this question what do you mean by phases here okay okay I get it I get it so so with HuBot one thing is when you interact via Alice through HuBot HuBot has a scripting it's called coffee script it's just like a JavaScript and reach in all you need to write is more of a regular expressions there so probably it's like the more rich regular expressions you write it becomes more natural language there so people do not need to mug up or remember commands they will just say plain English I want to start this machine then Alice will understand okay which command to hit against it people will come with plain completely plain English you just need to say in whoever joins new in you just need to say hey just talk to Alice they will figure out so that's how I have improved in writing my own regex I used to give to people so that's where it came up with more natural style of understanding okay so we're cutting down questions speakers will be available outside you can talk to them later and we'll have a QA session as well thank you all thank you for timing I'm all around here I would love to chat if you have something in mind and some ideas more you have to implement in Alice come talk to me thank you thank you Pooja I'd like to just take this moment to remind everyone and especially the people who came in late from the coffee break please do remember to fill in your feedback forms we rely on the feedback from the participants to make the conferences better each year and you can put your completed feedback form in the bags outside the auditorium for collection I would like to encourage everyone to do a quick stretch move your body just a little bit so we can stay awake take a deep breath roll your head around and then I'm going to turn over to our dear friend Karthik who's going to do a demonstration of a wonderful technology called contact point this is you may have noticed these little codes on your badge there's a secret hidden inside there hi everybody okay so this is a quick two minute demo so this is a really small app on the Google paster it's not on iOS yet we are open to contributors for this project if you want so it's a simple app that right now just gives you the ability to share contacts that are printed on your badge so so I'll just take you through a quick demo okay so just a quick run through for those of you who haven't downloaded this yet it's got a simple overview of all the things that are happening during the conference so if you're having any announcements sponsor posts and anything else regarding the flash talks we will have them as little announcement boxes here as you can see one of them is the core of conduct for now and you can also go and view the schedule on the schedule tab so all the talks here are so if the badges don't have the updated schedule like the two talks that are switched this one will always be the first one right so for today and tomorrow across all three tracks and the interesting thing is this contacts tab right so if you go to this and then hit the floating action button the plus one here you will have to log in first do that yeah so once you've done that when you hit the plus button it should just open up the scanner this is as simple as scanning the tool yeah so the idea is that you guys will be able to scan each other's contacts and this doesn't save it to your contact list instead you click add and you go back you should see a list of contacts in this contact tab right and you can hit the little export button on top right and this will generate a VCF that you can then share on email or whatever else and you can also tap on the contact if you want and then add it to your phone right and the other thing I want to show you is we also have a Slack team so you guys would have received an email I think yesterday and on Monday if you had bought tickets before that regarding the Slack team so I really encourage all of you to join so I can show this to you on the so if you guys don't have an invite on the first tab is this button called discussion so if you are not part of the Slack team you can just say no send me an invite and then it opens a little pop up where you can give your email and then get invited yeah so this is what the Slack team is like we have a lot of conversations we're just trying to experiment with having an online thing with people at the conference itself and try to see if you guys will break off into smaller groups and have discussions so yeah if you guys have any feedback on this I'd love to hear it I'll be sitting either here or near the registration desk outside over today and tomorrow so if you guys have any feedback on things we can improve in this please let me know and like I said all of this is open source on so you guys are welcome to at this URL you guys are welcome to send patches and come it because our idea is that this is an open app that other conferences can hopefully also adopt in you okay thank you Karthik I'd like to remind everyone let's see a couple of things oh the two sponsors I mentioned with the lucky draws it's not only those two it turns out a number of the sponsors outside at the vendor booths are having lucky draws and quizzes so you can win all sorts of exciting prizes so have a visit to the vendor stalls I'd also like to remind you guys that Hasgeek is doing some workshops this weekend there's three different workshops I believe and there's still a few seats remaining so if you're interested in attending the workshop it's not too late have a look on the rootconf website and have a look at the workshops that are available okay our next speaker is coming from Delhi and he is going to be talking about some more monitoring back to the monitoring part of our talk and we're going to be doing actually a Q&A session with our monitoring authors later on in the session so this is Manan Balala he's coming from Delhi again he has worked both with not-for-profit companies and for-profit companies so he's seen it from both sides today's talk he'll be giving a little bit of experience from his for-profit work on tooling and monitoring for performance critical applications so please welcome Manan Balala to the stage hi everyone just wave if you can hear me at the back awesome alright so thank you for staying back hope you got enough coffee and energizers to keep you going until lunch so this talk is about monitoring my name is Manan and that thing below is the Twitter handle I'd love some feedback on this talk I'm trying to constantly improve it so please if you have any feedback for this talk, write it there and that'll be great so yeah this talk is about monitoring but it's not about monitoring in any generic context it's in a very specific context the context of a large online retail shop website and we'll see how that changes the equation we'll see what sort of a monitoring tool we'll need when it comes to an e-commerce website have you begun regarding I hope so yeah it's about monitoring in an e-commerce context we'll also look at trying to conceptualize the entire monitoring system into phases and we'll see we'll try to divide it into phases phase sort of warrants its own tool there are certain tools out there which cater to more than one phase but hopefully after this talk if you see a tool you'll easily be able to map which phase of monitoring setup does it sort of go in right so first a brief about auto auto is a large online e-commerce retail company it's based out of Germany it's the second largest e-commerce company in Germany Amazon basically surpassed auto in 2014 before which auto used to be the biggest a little bit of history began post world war 2 when this guy decided that he wanted to sell shoes and he took pictures of shoes pasted them on a catalogue photocopied 300 copies hand delivered the catalogue to the local areas and the orders came in he delivered them over time the guy built enough loyalty that today auto is the largest mail order company in the world the second biggest e-commerce company in Germany operates in more than 20 countries the net worth of the auto family is about 18.4 billion and of which about 4 billion comes from retail a big chunk comes from their real estate business as well that is the auto premises this is where I was working with them and so that campus is about 205,000 square meters if you who aren't able to see how big that is it's roughly 35 football fields big like two football fields put together so it's pretty big but when we talk about scale at auto that's definitely not what we're talking about we generally talk about their online shop auto.de which records about a million unique visitors every day and about two orders happening every second and given those statistics which are public statistics we could sort of extrapolate and say that if it takes these screens to sort of place your order that by the way is the home page of auto if you say it's fairly fairly simplistic if you compare it with the flipkart and snap deals that's because they appeal to a very different persona than what flipkart or snap deal sort of appeals to so they like to keep things simple and that's basically their selling point but basically if it takes these screens you pick a product you add it to cart you later on you proceed to check out takes about eight screens to get your order placed and if one in thirty customers sort of convert we can say it's about 480 page impressions happening every second we saw that so auto has their own big billion days and they call boom days translated to English basically in German it's something else but on these days we saw roughly 1,000 page impressions every second by that statistic I'm not really trying to sell that this is large traffic of course there are companies with much larger traffic and facing much more visitor inflow as compared to auto but the thing that makes it unique is so when it comes to e-commerce the thing that makes that unique is because we're talking money there is no other context in which it is more easier to relate downtime with direct losses so we had we had these numbers being thrown at us that okay for a one minute downtime you just lost us 100,000 euros or something like this so it's very easy to sort of connect the loss value so essentially we don't want to do anything that affects the money we're making right so we want to maximize that we want to sort of offer an optimal solution which keeps the inflow going we want to continue building more features there are great features coming in everyone trying to get one up over the other but we don't want to build any features that don't tie back to the original point we want to keep building features which add more value which in turn basically bring in more money we need to make better decisions and we figured out that whatever features we're building can give us enough insights to basically decide how to go about fixing a bug for example to say there's something broke in production and someone coming to you over the top of their head they just say hey this is a massive failure we need to fix this now unless it's backed by real data unless you know how much of your users are affected by it it's really hard to judge what sort of priority to associate with it so it's very important that the monitoring setup we have equips us with an understanding of how big of an impact something has lastly we really want to verify this assumption for this I'm going to take an example let's say we were a little late to the game and we decided that we now need to personalize our e-commerce shop and we said that if we personalize one user out of three we basically expect a revenue increment of x we make this claim and using this claim we're able to sell the feature to the business and we basically are able to begin development so on so forth we develop the feature roll it out and it's great everybody's happy except we never go back and verify whether the claim we made is actually are we performing at that optimal level so are we able to personalize one and third of users and without that it's really just not making any sense so we're developers it's not our job to make up numbers we need to base our numbers on hard facts and that's what our monitoring tool should give us so what do I need from my monitoring setup I need what every other monitoring setup does I mean there's no discounting the fact that you need basic database monitoring we need to know if our queries are not performing well if basically maybe we need to add an index maybe some queries are slow maybe we need to shard and our monitoring system needs to tell us this much we need to monitor standard server metrics every monitoring tool does that we need this too what's our throughput currently how many requests are we serving what's open connection pool count and whatever right and we want to be alerted upon exceptions so this is a no brainer we need all of this what we also realize is that what is really important to us is also to measure the state of the system and while this is a a vague statement to make what this essentially means is we want to grasp how the system is performing not just an infrastructure point of view for example your website may be performing very optimally at 1 a.m. at night but there are no users there so essentially you need to know what's happening right so it becomes important to also track these other metrics and for this we turned over we said that okay let's just ask the product team let's just ask the business teams how they would measure the state of the system and and so this whole all of this agile jargon what it's done is it's broken through those walls now you can communicate and they were equally shocked they were like okay thank you for coming to us yeah we'd like to define our e-commerce system using orders per second and users who are on the website we're like okay these are important metrics we need to keep a track of them and so we realized there's essentially two sides to monitoring there's monitoring of infrastructure which ensures that you're offering a stable and optimal performance and then there's the business side of things which is key performance indicators lastly our monitoring tool should definitely allow us to narrow down when something is broken to narrow down on the source of the bottleneck narrow down on the source of the bug and the thing that helps us here is continuous delivery for example with continuously deploying your code onto production what we've essentially achieved is the marginal difference between the last deployment in this one this change has become smaller and smaller so it's become much easier to track down what exactly broke something and it's it should ideally become that much easier to fix it so our monitoring tool should leverage that right lastly tying into the last point we want to validate whatever business assumptions we set out to achieve are we achieving them or not and that brings me to logging so logging is great right we developers we love logging they point out the exact source of the problem exact line number down to the character and they tell you exactly why the error happened what sort of error happened after the error happened and that's the inherent problem with logging it's inherently reactive when I get a log based alert I sort of feel like I'm on this planet which is about to explode and somehow I need to figure out something that fixes the situation I mean I'm not saying the logs are not good the logs are great right if it weren't for logs I would probably be having beer on the planet like I wouldn't be doing anything on the planet but at least now I know that there's something wrong but I mean what about the time leading up to the point when it actually broke could I have known that this is going to happen so that's why I feel that perhaps if you're building a monitoring system today you don't want to base it on logging you want to base it on something more explicit so one more problem with logging is the signal to noise ratio with the distributed systems microservices the number of systems have grown their logging has grown of course you have great tools splunks kibana elk stack all good very easy to track your logs but for a newcomer to the system it's very hard to gauge in a glance what's happening in the system so even with all these tools the signal to noise ratio with logs is just terrible so I like to define this if logging is about scattered incidents that happen in the system accidents that happen in the system we figured that metrics explicitly collected metrics about the system help you in understanding the state of the system better and they allow you to gain valuable insights but the biggest advantage of explicitly collected metrics is that they give you insights at any time they don't necessarily give you anything after something's broken but leading up to that point they're excellent sources so we went forward with basing our entire monitoring system on metrics and proceeding with that so this is where we actually get to our monitoring system which can be divided into four phases our first phase is the collect phase where you essentially gather the data and I'd just like to begin with the disclaimer we want to gather as much data as we want but only if it is going to give us any useful insight it's very easy to collect all the data out there except you'll be running out of disk and secondly you'll just be increasing your clutter so try and be reasonable about it right so when I spoke about metrics I was talking about this metrics library it's a library written by Kodheil and I think he presented it at a meetup and there's a talk about it called metrics metrics everywhere I strongly urge each one of you to watch that talk if you haven't already it's an excellent introduction to the library and why the library essentially makes sense so metrics is a library that you plug into your system and it runs with your code you instrument it in code and what you're able to achieve is to collect data while your code is running in production and that's its biggest selling point so ours was a closure project but the metrics port exists in there's a Java there's a Scala port and I believe there's similar solutions available in most languages but yeah we figured that it was as easy as just including the dependency and then there were three parts to it so first you have to create a registry a registry is nothing but a container for your system so if you have a microservices based architecture each microservice can have its own metrics in its own container next you just create what sort of tool are you going to use from the metrics toolkit it gives you five tools we'll be discussing each of them in just a little brief but essentially in this case we're using a counter and a counter basically is good when you control when something is increased, decreased, opened or closed so a counter serves a good purpose here and finally it's all about just instrumenting it in your code just increment the counter where you see something opening or whatever but essentially you know your code better than any third-party tool so that's why this makes sense this is much more explicit than a third-party tool saying let me do all your monitoring for you so the first tool is counters as I said it's great if you control when something is opened or closed so particularly good when you want to check connection pools okay how many connections right now open it, increase close it, decrease gauges are great if you're relying on a third-party or an external source you want to gauge its value at every moment and you see how its value is changing for example if you're relying on a certain disk space you're using Redis you want to keep track of how the Redis memory is changing over time you can use gauges meters are great for tracking the rate of something so for example if you want to track how many requests you're receiving per second meters are a great tool while these were good use cases for these tools we found even better use cases with them tracking actual business metrics for example something like counters can be used to track the number of users you're personalizing and it's as easy as if I'm giving any personal recommendations at all just increment the counter and later on you can visualize how many users are you personalizing are you personalizing more than you were just a while ago are you not similarly for gauges if you have like a cache of products with which you do a lookup and you base certain recommendations on that and you see those recommendations underperforming or overperforming you can directly correlate okay so the gauge tells me that the cache product count was too little or too much so that kind of gives you an insight into why a thing is happening similarly meters can be used to track how many orders are being placed every second and the next ones histograms histograms are particularly good if you want to measure the count of something or the frequency of something and how that changes across time so you can check okay if you're offering a response how does the response size change for example something goes down maybe you don't have I don't know database your query gets messed up your response size just reduces drastically and you can easily tell but we'll come to a better use case of this which is how many of my customers what quantile of my customers are getting five personal recommendations or less so essentially histograms allow you to measure quantiles what is the main number of recommendations that you're serving which is an average is the median how many recommendations are 99 percentile of your users getting so that's what so it gives you a lot of insight timers are great timers essentially have a meter built right in and are able to measure how long it took at that level at that rate for example you can measure page rendering time and you can correlate it with how many requests I was getting similarly a great insight is okay it took me 80 milliseconds to give recommendations to my 99 percent of my users when I had 200 requests per second but the moment my requests became 800 per second I suddenly started taking 200 milliseconds so there's a problem that the timer is pointing out to you and we can take action on it again implementing them in code is fairly straightforward it's depending on your language it's always just one command one line of code or something so now that we've collected all this data we want to make sense of it first we want to store it somewhere and the choice of storage solutions has been increasing luckily all of these metrics are time series based data so what makes sense while storing them is a time series database of course a great number of solutions were being discussed today morning and one of those solutions is graphite and others in fluxdb mostly all of them sort of have an architecture similar to this one you'll have some sort of a back end storage where they would be using some sort of database maybe a sqlite or maybe a different file system altogether they would have an api which is your queryable interface you may use sqlite, you may have an ugly string that graphite provides and then they all have a layer in between which is your true metric storage which takes the pain of compressing or also helps in queries of course with metrics they arrive at a very tremendous rate they're just a lot of metrics that you gather so it's not optimal to write them all to disk so that's why usually the solutions have a cache built in where the data goes first and is a regularly flush to disk and when reading you essentially fetch from the cache and from the disk so you maintain a certain recency just some of the characteristics of a storage tool would be it should have a good query dsl in this case I think influx is a good job they picked sqlite which everybody understands graphite not so much because they pick just pure strings and you have to apply mathematical functions query editor is pretty prone to mistakes and everything so yeah well take your pick it should give you the ability to apply functions so essentially you want to apply mathematical functions because the same metric may give you different information based on what function you're applying and for example a counter may give you a current count of open http requests but if you check the slope of that talking statistics here but essentially you check the acceleration the rate at which the counter is growing you can easily check request per second just increase so essentially I'm facing some traffic now more than before the solution needs to be scalable again the metrics are coming at a tremendous rate so whatever solution you have should be scalable graphite says it works in distributed I didn't see it working so much but maybe there are solutions that works better than that the results should be up to date very important otherwise your graphs would be just completely silly right and this is a great note to read the influx DB how their journey has been first they started with using level db to store their metrics and then they proceeded with a different data structure all together today they use time structural mercheries it's a great blog post and there's a link down there hopefully you'll get it when we share the slides or you could just check it on the influx website now that we've gathered all this data the next phase is to visualize them so that we can finally draw some insights out of the data that I've gathered and for this it's very important that I'm able to drill down into why certain things happen so any visualization tool that you use should allow you the capability of drilling down it should be interactive I should be able to read what a value of something was at that very instant what happened at that point so it should have a good query query language and editor which is not prone to as many errors so essentially it would be good if you use easier syntax and you should be able to correlate changes in metrics to certain events that happen I think I think this was also discussed today morning when we discovered our annotations feature in Grafana similar features may be offered by other tools as well so given that we need all of this present to you the oscillator so this is a tool we wrote it's a d3 based tool it's written in closure of course you may like the language you may not like the language but you sure can like the tool it's very easy to implement a chart it's very declarative and the charts are highly interactive thanks to d3 which of course is a great library but the advantages that you get is you can define your charts in basic just data so for example you can have a simple map defining what your chart is called what your page is called what charts it has so you define tiles and you may have multiple charts on one page we'll see a demo soon but essentially each of these tiles each of these charts you can choose what kind of chart you want to show for example in this case we're showing my user request I'm pointing it to a graphite target in this case because that's what we were using when we wrote this presentation but essentially what we also open-sourced was a graphite DSL so it's a bunch of functions a bunch of closure functions that allow you to work with graphite strings without you having to actually write those strings so this minimizes the chance of making an error right so I have a small demo so right so this is so it's a random generator that I have and it's generating one request or the other and I'm sort of plotting the count of these requests if you check the chart is fairly interactive at any given point in time you can correlate you can sort of compare what the value was of one particular type of request and what the value was of another type of request moreover you can drill down into the timelines essentially you can see what was happening in the past one hour what's been happening let's say in the past 24 hours well it was an application running on my local system which is why you see the breaks hopefully your production system does not suffer through the same but yeah so essentially you can drill down into timelines you can also have summarization so how has the trend behaved if you summarize over a 10 minute scale so it seems to be going down a bit you can also have on the same oscillator you can have all your environments so essentially you can have your development environment you can have your pre-production or what you define what your environments are so you can have as many as you want right and then you can kind of hide one only see one you can have as you like you don't just have line charts you have all sorts of charts you can have bar charts, pie charts if you're into that maybe stacked charts makes more sense another another great feature is again annotations so these are events that significantly change something in your application and may lead to variations in these metrics so it makes a lot of sense to have them on the same graph as the metrics visualization for example those dots down there kind of point to what happened at that instant that the graph changed and so it becomes easier to correlate okay there was a code change there was a deployment there was a provisioning you can easily correlate why the change happened based on that event right so I had a few learnings based on this sort of flow the first was to choose the right metric and the first thing when someone talks about monitoring is hey let's just plot the CPU load and left out of that conversation because I honestly did not know what goes into calculating the CPU load and so I googled and I figured out that the formula to computer CPU load is actually very complex and you don't want your metrics to be as complex as that I googled further and I saw okay what spikes CPU load essentially what makes a CPU load spike and I figured out that there's this post that says if you're parsing text often times your CPU load will spike which is odd if my application requires to parse text I will be parsing text what is a CPU load plot giving me so essentially choose simpler metrics metrics that point to exactly what's happening for example disk utilization, user logins simpler metrics again the second learning was that plotting a mean value of something may be absurd at times for example we had a contract with the herd party that we'll give them recommendations they can show them and we had a contract we'll serve in under 50 milliseconds and when we actually saw it on our graph we were serving in under 20 milliseconds and we were happy we said okay so that's pretty cool we were well below 50% of the threshold so that should be okay until they came back to us as they said they're running into circuit breakers and this is you're often going above the limit and that's when we realized what we were seeing on the chart was actually mean an average value we plotted against the max which is the max value in every summarization interval for example the max value in every minute we see a graph something like this so essentially we were easily peaking over 100 milliseconds every minute and that's why the team was kind of running into circuit breakers so just choose how granular you want to be just don't pick mean because it just sounds cool so yeah choose wisely it's important to choose the right tool for the job histograms are particularly good if you want to measure the size of something the frequency of something how that changes right and the last phase of the monitoring setup is alerting because we all want to go home and sleep at some point in time so essentially we don't want to keep adding we don't want to just keep glancing at the chart and when it came to alerting again we being the team that we were we didn't like any of the solutions out there and we wrote one ourselves and this tool is called xray it's essentially again a tool written in closure in which you can essentially define your condition for the failure of the alert in a simple function so you can have a function saying if this happens if it returns true it's all cool if it returns false, alert me and then you can define what sort of alerting strategy you have do you want one alert every five minutes do you want if it's continuously breaking or do you want to be alerted on each instance of the condition breaking you can also define what sort of alert you want to receive you can have slack integration because it's all code right and that's what the advantage you get from having an explicitly written down solution so you could have slack integration any sort of emails or text messages I don't think you want to go for that though but yeah for example in this we're simply logging and you can change whatever like you can pretty flexible in the end xray also gives you a nice little dashboard gives you three level views this is the middle level and I've had to blur out the names because it was a live system I can't share that but essentially you have all your environments and how the checks are performing on each of those environments and you could basically drill down into the ones that are read and check what exactly happened, what was the expected value, what is the value we got you also have a higher level chart which sort of aggregates all of this and gives you whether the whole system is behaving okay or not so yeah we thought it was extremely high importance early morning in our stand-ups the first thing we used to do was decide what's going to happen in the rest of the day based on these alerts what is the value of the alerts we are getting we added these to the definition of done of our story so any new feature that you are adding would not be complete until you add the relevant alerts or change existing alerts as required we wanted to make it as visible as possible in our case we were able to provision a nice little monitor and slack notifications to ensure but at the same time you don't want to spam anybody so try and choose the right alerting strategy the entire solution kind of looks like this you capture your metrics you aggregate them store them in a nice time series database visualize them using some sort of a graphing tool and finally you alert and you choose an alerting library for this there's plenty of solutions that offer all these four phases tick stack and whatnot there's enough solutions you can take your pick for each of those phases just last note on continuous delivery as I mentioned since the marginal change between deployments has become so small it's become increasingly important to find out what broke something for example there was this one instance where we noticed one graph just shift drastically and what we realized was we had a deployment but since the last deployment before this there had been only 10 commits that went live so if it's just 10 commits it's very easy to kind of find out rollback 10 commits apply one commit at a time see which commit broke so it's a matter of practice as well no monitoring tool is going to allow you to really do wonders without you following some sort of etiquette and I think it's important to follow the etiquette of continuous delivery especially in a retail shop where you constantly want to release features you don't want to hold back the teams and a note on software that claims to monitor everything for you it just seems like they know your system much better than you know your system so I don't know why you want to go why would you want to go for something like this especially because some stack that we saw there are like 100 alternatives to that all of them free why would you ever want to choose something that doesn't even know your code so the only way I can think of they could do monitoring is by hooking into individual function calls it could be too expensive both performance wise as well as monetary because they charge an exorbitant piece and so really I mean a great monitoring setup is available for free really no need to go for such solutions so these are the references that I talked about the first one is a great talk again if you haven't seen it metrics metrics everywhere the oscillator is available at that link it's on github and the xray the alerting library is available at the next link I urge you all to follow the auto dev blog where they publish excellent things about how they're changing the architecture they're one company that's keeping up with the times and essentially doing great with all of that that's all I had thank you does anybody have any questions oh yeah so we fell into the trap as well honestly we faced enough black for that so initially when we built the solution we were spamming everyone in fact we were spamming more than just our team even like the organization so we faced that trouble essentially we left the choice fairly flexible because you may want to categorize based on the priority of the alerts since it's all in code it was a little easier for us because we could essentially see okay this particular alert if something happens I do want to be alerted and this particular alert don't on the other hand you could also have summarizations okay I want to be alerted if this thing happens continuously for 5 minutes if it's a one-off I don't want to be alerted but really I don't have a perfect answer for that it's really a grey area but we yeah we fell into the trap as well I'd say we really did not we really did not focus on building a product that was different from Grafana we think both are great products honestly at the time that we built the oscillator Grafana wasn't as mature as it is now that being said we would probably have still done it anyway we love writing explicit charts in code we like to have more granular control with the language that we are using I guess those could be the reasons and writing something yourself although frowned upon but for larger enterprises sometimes makes sense because it offers you the flexibility that no prepackaged tool could offer you hello yes hi so from the talk it looks like the matrix the library which you are using ideas to infer the lot of details from the code rather than from a log file and push it to a system so what happens is like in a web application at an end point you interact with like 10 other different system and collate the result and finally present to a user as a result what happens is like you have two places to infer the information for an example my web application can infer how much does a recommendation engine took the time so I can put what to say like the point at that web application level code or at the inference engine and later you can push to the system so which style did you follow and is there any reason to choose putting everything at a web application level or at the final system level we had matrix on both we were collecting matrix from both essentially the rule of thumb that we followed what Kodohil also mentioned in his talk is every microservice may have maybe should be capturing their own matrix and you may have up to 30-40 metrics from each microservice reasonably sized microservice of course we often go above board but yeah I don't know we just I guess it's one of those separation of concerns thing the web application did capture these metrics for a third party that was not written by us because we did not have control over them and we did want to measure how that third party is behaving so that we could communicate effectively but yeah we went ahead with the approach of just having every service have its own metrics and container kind of thing. Thank you. Hello. Hi. This is about collecting data on only useful data part so how do you choose to collect useful data I understand the intent behind that statement but there's also this fear of missing out and have you run into any questions where you actually missed out on some useful data that might have led to say a new metric. Of course obviously but yeah I guess what I'm trying to say is yes you should capture nearly like everything that you think could be plotted on a chart but definitely not something like okay what's the CPU load at the time right that's why I kind of tied it to the end of the presentation as well capture metrics that are simple that really offer you some insight into what's going on into your system. Of course we missed at times capturing useful insights at times it was hard to even keep the charts updated because there may have been a change in the feature after which the metric also should have changed accordingly but did not test helped us in that situation I guess that's not your question but yeah again that's a great area we ran into that trouble I guess you can aspire to be as vigilant as possible when you're building a new feature anything that's simple enough to be captured gives you some insight to be plotted on a chart just do it but yeah there's no black and white. I had a follow up question to that so when you said that you captured business metrics as well as infrastructure metrics would you say that looking at the business metrics daily is kind of the only important thing and if there's something wrong with the business metrics then you go into the infrastructure metrics not otherwise why would you say so so essentially something like if I'm using a redis I want to keep track of him how how much redis is filled or for example have I expired too many keys to suddenly am I done sorry okay so yeah so I guess we were keeping track of any of these charts but by the same logic if suppose you have redis at suppose 90% fill capacity but your business is going as usual there's no problem right no you may run out of capacity then your business will not be doing good that the trigger should be the business related metrics that's kind of what no so essentially the entire intention is to not let the business metrics be affected right there may be business metrics that tell you that or there may be infrastructure metrics telling you that so okay people who want to ask questions can stay back we'll have speakers Manan, Pooja and Bernd to answer your questions on this stage the people who are not really interested for the Q&A session can proceed for lunch so we're taking more questions but combined for all three speakers Pooja and Bernd could you please move to stage where does it happen sorry what solution okay okay well so the first part of your question was how do we get developers to be motivated to maintain the code that they've instrumented just for capturing metrics so essentially this is a culture change right in this case we were the driving team we were the team that went to the business and saying we think this makes more sense I guess with Agile the entire point is that everybody is kind of contributing and you don't sort of work in silos but yes I do understand that there are teams and there will always be teams which have a clear division and that makes sense for them but in this case it was not so we chose to be explicit and maintain the code we took that call that we will be happy to do this and the benefit we got was flexibility the benefit we got was we don't want to actually monitor something that we don't really care to monitor so I hope that's some sort of an answer I'm happy to have this discussion afterwards as well hello I'm Pawan here from Goldman Sachs sorry I can't hear you I'm Pawan from Goldman Sachs so this is a specific question to process monitoring so I just want to know what was done in auto or like in your space for critical process monitoring because say a process is down your order you know order rate could go down anytime and keep a host is down can you put the microphone did you get the question to some extent I think he's asking about critical process monitoring ok sir so yeah in certain cases there's a host outage and you have a barrage of alerts of process being down and you have to be with that so was there any intelligent solution that was adopted I guess it's not really related to this talk but we essentially implemented some like circuit breakers so that even if some third party or some critical process is down we still fail gracefully but yeah I mean some would argue that you know if you're critical process it's such a critical process how graceful can you be to be honest I was not able to get the question I'm sorry critical process goes down some sort of outage so how do you how do you handle that so your entire orders order processing is down what sort of thing can you do to sort of manage circumvent the entire problem so in my opinion it's also that process is part of an application means you can simply monitor if the process exists but it doesn't help but at the end it should have an effect on an upcoming service and then getting to the root cause you figure out means if you think about the first talk going from the top down thinking about your business as an impact and then seeing what application is responsible then perhaps a process is not working means also that it could be fine in the process list doesn't mean that the process itself is working means process monitoring itself could be could be easy means it's a process alive but how many of them are allowed to be alive how much memory is consuming it's probably hard to answer in a single sentence but it could be a complex topic it is I'm certainly thinking so okay I'll repeat so you're saying that it could be expensive to put so if I understand your question correctly what he's asking is so instrumenting these metrics in code what if the code itself is unstable in some way and so saying that if the code is unstable the metrics would not be gathered in terms of you know you're collecting metrics imagine a web application like yours you're collecting metrics through a request response life cycle right that's going to be expensive again you would rather sort of push it asynchronously to another system okay so so metrics does not push events directly it does do batches and it won't push every event so then your infrastructure is probably it's very stateful right because oh yeah so yeah so there's a chance that if your application goes down you lose certain metrics yes that is true and the strategy that you could go for or you could choose to do one metric at a time which would not work well I'm sure because metrics come at a very high rate and it's better to actually batch them the second question on that is that every time you want to collect newer metrics like you have data let's assume for a moment you have a lot of data you want to collect newer metrics you're going to instrument your code every single time to sort of collect newer metrics as opposed to me sort of extraneously collecting that in some fashion right I mean just a question yeah so we sort of it as a part of developing the feature we did not think of it as an external I want to collect metrics from a feature that is already there so we said every feature that gets developed has the metric collection part in place and that is the response way that is part of the story part of the feature so we did not I was just trying to understand philosophically what was the thought process behind doing this which is doing that we just favored flexibility so essentially we don't want to collect certain metrics we choose not to collect them there's an aspect of testing but I guess it could be frowned upon but we tested what metrics we are gathering as well and you could do that because they're in code well let me I guess I did not say it's not important I said that you won't be able to get meaningful right if you see a change in CPU usage it's yes it's important and well I'm just saying that if your CPU load drops what is that really indication of so as I understand CPU load is a function of not one thing there's multiple things that can bring down CPU load bring up CPU load so it going up or down may be caused by five things and you're not pointed to any of those five things so that's why we chose to go for simpler metrics which point to only one thing going up or down but certainly not not important it's important it's just not easily intuitively trackable so essentially we have systems in production and we try and maintain them such that they don't go up to failure so without having them go up to failure how do I know what sort of capacity I can get one and is any system available which would help me identify bottlenecks in my whole architecture because not everything scales linearly so maybe right now I can see that everything can go to X but one of the whole stack will fail much before that so is there some way I can have a dynamic bottleneck identification so starting with the last question again are you talking about being more services building an application in a combination means having a rule for them means that a service consists of others or are you talking about the user story question we had before yeah essentially even a very simple web application would have starting from the load balancer to the application code to the caches to the databases okay one good way to do it if you use in a matrix database so forever influx or graphite that you put graphana on top and then add different sources from different services you can create a graphana panel and then add probably let's say the load from your database server the load from your application server and the load from the web server and then you can see how a request impacts your infrastructure from the three whole stack simple idea but then you can bring all metrics together at the same time which is an advantage of graphite means that you combine different things and also adding to his answer is then also perhaps load monitoring makes sense load monitoring process monitoring with the correlation of an annotation for example to see okay we changed something with puppet and then my load is increasing then it's available information because I have a different load profile but without any information it doesn't help you because you can see okay no we have more load but what now shut down reboot with the annotation to see that some code change so you see a commit message for example depending on your continuous delivery workflow you immediately see what's happened to a specific part of the infrastructure means you can see that a user request at 10 o'clock was going well and 11 o'clock after I don't know deploying some new software you see an increase of load at the application server after a request in combining these metrics on a specific time you can see that some performance profile change and that will I think could be a way to solve it okay so follow up question what about you know any way of finding the current bottlenecks say the whole structure keeps dynamically changing and the code keeps changing maybe I'm currently making one database call and then I want to making three different database calls so in the whole development cycle can I have an automated system which would tell me in the current scenario this is the peak load you are able to provide with your current setup it's kind of a base lining that the monitoring system learns what the load typically is at 5 o'clock in the morning I say I just want the peak capacity of my system okay means you would like to know what the maximum number of user request is not on a cloud yeah that depends on so many factors so it depends if your system scales in a linear way for example or if garbage collection kills you in some way so it scales and then it breaks down so it depends on your I would say it depends really on the development language and so many effects also sometimes databases scale very very good in a linear way and at some point they are not able to write down the IOP to really promise you can do a specific amount means testing helps of course there are thousands of testing frameworks which probably only know 2% of them but I think I would say it's not possible to say that infrastructure can handle 3 or 4 times the user you have right now in the future except you exactly know what every system involved in that process has a linear scaling aspect therefore I don't have an answer for that perhaps you have sorry yeah I think that's as good of an answer as I can also Hello, oh thanks Manat, great talk, thanks for that and quick question I didn't know about metrics and I'm going to look it up and I keep hearing about tools like Zipkin and all that do they play in the same space are they comparable and maybe burned you can also weigh in it right because I'm confused about the whole aspect about you know you have monitoring application life cycle monitoring or something so just break that down for me thanks so I'll start so I don't perhaps one can help with Zipkin if you mention I am honestly not aware but there are alternates to metrics so metrics follows the more push events sort of flow on the other hand there's tools like Prometheus which have a so essentially your Prometheus server sort of knows where your application is running with all servers and they collect full data from them but I don't know about that particular tool that you were mentioning I'm not aware of it sorry I think these are a class of tools that allow you to trace a transaction across multiple systems you know it's more of a Zipkin? Zipkin yes well metrics is more for one one instance of yeah I'll ask a question to Pooja quickly it's probably not related to monitoring but if I understood the whole bot approach that you talked about correctly right you had to rely on regex a lot to understand what the user has typed in right had you by any chance looked at you know natural language processing that are things like api.a that might make it easier is that planned that's the question thanks it's my question even I had thought initially that we should have some kind of algorithm of natural language processing but the initial idea was not to go in much details of that the initial idea was to have a control over the code which was messing up things and that's where we focused on solving that problem first that isn't planned but not immediate plans because there are hardly certain questions which are repeated to being asked for example when is my release what is there in the release which machine actually replies there are hardly totally countable maybe so that's overhead to do it at the moment and that's where my priority is continuous integration first wherein we can actually stop the pull request merge button itself to go ahead so that's the immediate plan probably NLP is what is very interesting for me also so probably I'll have a look at it thank you hello Manan you said don't use any arbitrary measure like me so in so I'll just divert here a bit whenever you're doing machine learning so you build models use a couple of algorithms before you figure out which one is best suiting your needs or purposes so if you look at most of the things that we are doing is collecting data and applying certain statistical analysis on that data so do we have some heuristic method in which you apply this statistical model first see if it works for you if it is actually getting the patterns out of the data that you want so is there like step 1 use statistical model not this use statistical model B so is there some kind of flow chart or something like that if you could share with us if you are using it and any of you could answer this question I'll try to so essentially there's a stream of DevOps called anomaly detection which is heavily making use of knowledge of statistics and applying models is a fairly straightforward use case there isn't much derivation that's happening here essentially just yeah so I'm guessing in this case I am not aware of a lot of statistical models who say that are applied at least not standard ones of course people might be coming up with ad hoc statistics on top of something but yeah but there's a stream of which is catching a lot of traction of course it's anomaly detection where you kind of learn from how your particular one statistic had behaved and you kind of make assumptions about whether to alert or not for example when it happens again because you can now correlate that this would happen at this time of this month of this day kind of something like this that's as best as I can answer maybe one of you can I have no answer but I know somebody who knows it and his name is Avishi Ashalom and you can find a lot of talks on YouTube and I think he studied math whatever and he's giving crazy talks about math and metrics I can give you the name afterwards I don't know how to write it I will have to look it up Avishi Ashalom or something I give it to you they'll share it with you yeah we're on Twitter whatever and his talks are really a good point to start about thinking stats and metrics and understanding it very well perhaps a good point to start okay I'd just like to remind everyone that we're happy to take more questions if there are but this is eating into the lunch hour the lunch finishes at 2pm and the talks resume here at 2.05pm sharp and we're gonna try not to be late so just to remind you now that was it oh food break I'm out I saw your face like you're getting nervous story of my life is your mic on it's hard to okay can you put the mic a little so you have 100 nodes out of 50 no I got the question so you want to scale so sure it's possible I think that is what providers do they look at the traffic and accordingly or to scale of course I have no experience doing this and wasn't the subject for say my talk but I'm sure so essentially the tools that we wrote made use of monitoring data and alerted or not alerted based on that so instead of alerting you could trigger something else I'm assuming that should be possible but maybe one of you can I'm sorry that your mic is off now it is like this okay so my question is more about distributed monitoring so I have a system of tools which logically comprises one tool right so my tool talks to a lot of other systems before it can serve a response it's a general distributed architecture and these systems interact in different protocols so it's not like HTTP sometimes it may be RPC it may be TCP so I cannot really weave the HTTP headers and see how the data is flowing so right now the solutions are being implemented to correlate how a request is served by different systems at a given point of time is to use overlapping groups, group names so we are targeting systems with a certain name say this is product P1 this is P1's active MQ this is P1's Redis probably this is P1's database this is P1's MySQL we plot a graph or we want to correlate a graph and see how all the systems related to P1 are behaving so in that case we are actually bound to take care of giving correct tags at every point of time so is there a way wherein I can trace a request that this request flows from system A to B to C to D with different protocols and then have a holistic view without using tags somehow so that it doesn't have any manual dependency so systems like APM also do it for me but they are more like only one platform based say only Java based so if I go from Java to a different or go like a console or probably a Redis which is written in different language so systems get like they are like blacked out and it's also very costly to put APM into like every system to monitor it so what do you think it's like grouping, overlapping group with the only solution or can we do something else about this I don't think there is a tool for that I don't know the way you can do it it means enrich your code base and probably assign kind of a session ID or a random number or a serial number I know Netflix is doing it that way tracking their SMS and recommendation engine that they gave a talk on Elastic on about that how they trace a specific request in the infrastructure which means assigning a serial number and then to every lock entry they produce manually or out of the system the system outside which takes all these serial numbers and puts them together, leaves them together but the problem with this approach is that when we have different protocols to talk say for example if serial number is into my HTTP header and now from HTTP I'm going to a TCP so I'm losing that serial number because I do not know how to propagate the serial number along with a different protocol all together so it's like assume a JMS system I have an active MQ and it has its own protocol I don't know I think if you don't enrich your protocol how do you transfer the session ID if you don't enrich your data payload which is transferred with that kind of a marker it's hard to identify I have no better idea perhaps you you seem to have already done a bit of research on this but yeah I would love to pick your brain on this later I have no more solution on this that's it guys I'll show you lunch ok thanks everyone for your questions we'll close the session and enjoy lunch now thanks see you again at 205 sharp the features required by your application are met by your operating system sometimes that works out sometimes it doesn't if you decouple your application from your operating system you have more flexibility so everybody understands it right? so about me my name is George Clifford Williams you can email me at GCW at 8ions.com the G in my name stands for George I load the name George so you can call me that if you like I probably won't respond I go by Cliff live in Chicago married, no kids, two dogs I love spoiling my nieces and nephews I need practically agnostic means that when it's time to discuss the merits of something I will gladly say what I think is the best solution but when it comes to actually getting work done it's not time for deciding anymore so it's just time to get work done in my day job I spend a lot of time consulting with clients to put things in the cloud to help them develop better CI CD pipelines I do all the things that get labeled as DevOps and I try to automate as much as possible so what I talk about here is all from the perspective of someone who digs deeply into delivering packages to the cloud so understanding the problem if you are a developer or if you support developers by pushing out applications typically someone's developing code and they find an operating system to deploy it on everything works out perfectly and there are never any problems in Dreamland this view of your application is not really tied to the real world a slightly more realistic view of it is that you have a kernel that your application will need services from on top of that you have a library of some sort that interfaces with system calls you have your user land utilities and then packages that get installed and then finally your application code gets deployed this by the way applies to POSIXY type systems Linux, Solaris, FreeBSD, the sort if you are on something that is not POSIXY then sorry, I say it doesn't really apply to you so this is the way it really looks as compared to the kind of loose idea of how your application deployment might work so if you get that you set in, you write your code you deploy it on your application you deploy it on your operating system what happens when you need to upgrade so if you upgrade your operating system let's say can someone name an operating system or distribution that they use just for one out FreeBSD so let's say you deploy your application on FreeBSD and then you need to do an upgrade because you could upgrade your base operating system hopefully everything goes well but if it doesn't your application may break in interesting ways and when that happens you might have to rework your code or you might have to back out the upgrade now, why would it break there could be conflicts with versions of libraries there could be a security fix that forces a change of some tool to be deprecated which is pretty common so what if your application isn't affected by an upgrade but is affected by needing something different maybe a tool that conflicts with another tool or a newer version of something that's already on your system there's a chance that you could use a private repository on FreeBSD systems it's pretty easy to use the packaging system to say go fetch this set of packages from another place similarly on Red Hat or Ubuntu you can point to other repositories add that into the mix of the repositories that you use to deploy your application and hopefully everything will be fine if it isn't you could download and compile something yourself and the enterprise frequently, Java is a requirement that people need to download because there's specific versions that come with most enterprise operating systems or distributions and then your application may require a different version so downloading and compiling yourself it actually used to be the common way that people would deploy applications now it's rare alternatively if your operating system has a maintainer for a particular package you can reach out to them and say hey I need a new version of this hopefully they'll get back to you sometimes they don't if you're needing something that conflicts for example this is SSL there are many SSL libraries out there everybody is starting to be more security conscious with the way that they develop applications so TLS and SSL are pretty common your application may require a different version of a TLS library than what comes with your operating system and that can cause problems so one way to deal with that is with the troup jail or containers linux containers jails, solaris zones or environment management something like virtual inv on python or if you're using a java app setting the java home the kind of thing that lets you set a narrow focus for your application to find the tool that it needs alternatively you can just wait and hope that whatever tool that you need becomes the prevalent tool for your operating system a consideration in all of this is how you're actually deploying your code not uncommon for people to write their code tar it up, ship it over the wire with SSH or something or put it on like a gift server go out to machine, pull it down and consider that a deployment also some people package their apps in the native format for their operating system so in previous view it would be a port or a dev on a debian and ubuntu systems you could package it in the run time for a language that you're building in so if it's a java script app maybe it's an npm package or a ruby gem a luarop pycomic etc you can also do a make file maven and ant are really just anti make systems so you can specify where files should go ship that along with your tar ball sorry, ship that along with your tar ball and hope for the best then there are kind of smarter tools for doing all of the above puppetchef, antable salt civilly, blade logic they all do configuration management basically take a set of files from here, put them over there and it's better than doing things by hand but when it comes to deploying applications it's not the best solution so the problem statement what I'm here to talk about and try to solve is that when you build your applications on top of the facilities provided by your operating system you could be locking yourself into an ecosystem that does not meet the needs of your application and or customers and the solution is to build your applications to be independent of the underlying operating system and its packages so what that would look like in practice is you have your kernel again, your library that interfaces with the system calls your userland utilities and alongside your system packages you have a set of packages that meet all the dependencies for your application and then your application goes on top of the dependencies and this can be duplicated so that you can have multiple instances of dependencies for your applications and deploy your applications on specific stacks just for those particular applications so some of the ways that this helps with application delivery release management etc is you get fully autonomous applications meaning that you can upgrade your operating system and packages without worrying about breaking dependencies for your applications this means that if you're on Ubuntu and you have to do an upgrade because of some security patch you don't have to worry about going back and reworking code to work with the new dependencies or with the new stack you can create multiple application silos that contain conflicting libraries and tools so if one application requires a particular version of a Libre SSL and something else requires open SSL you can install those right alongside each other and they will not conflict because of the isolation that you get from autonomous application setups your deployments can be standardized across multiple operating systems so a lot of the clients that I've worked with have a mandate to not have more than 50% of any one distributioner operating system be not to have any one distribution operating system be more than 50% of the deployment that means frequently working with Sless and Red Hat and FreeBSDE and Ubuntu and HPUX and Solaris and some Constellation all at once taking the approach of building your dependency separately from your operating system means that you can deploy an application on any operating system with one set of configuration and not have to worry about specialties and file paths conflicting you can isolate the exposure to security flaws in underlying libraries that goes back to again there's some update that needs to happen it can happen whether it's in your application or on the operating system so if your application uses a particular version of a library and you discover there's vulnerability you can fix that in your application and not worry about it causing a problem with your operating system the features of your application can develop at your pace not the pace of your operating system package maintainer so if you need new features you can go out and get those new features implemented in your code and not have to wait for the next version of your OS and you still have access to all of your system packages so what this approach really does is let you decide what the delineation is between the platform on your network and the application and its dependencies and you can swap out the application swap out the platform it really doesn't matter because they're completely orthogonal to each other one doesn't depend on the other so I think that sounds great let's presume that all of you are on board and you want to get started with that there are several frameworks that come out of the box with the ability to let you do this on a positive system positive again, I mean Linux HPUX, Solaris something like that there's package source which is part of the NetBSD project it's very similar to freeBSD ports or any of the other BSDs port systems there's open package which was started by a former freeBSD contributor named Ralph F. Englishall he used to be a security or a security officer and he was also a founder of OpenSSL there's the Nix package manager it's currently mostly Linux specific and driven by the Nix operating system I happen to prefer package source it has more than 1,700 packages available it includes all the big things that people need and most deployments Nginx, varnish databases like Postgres just about everything that you would need it offers you the choice of binary or source builds so systems like Gen2 FreeBSD, OpenBSD by default if you want to install a package frequently you navigate to some file path and then type make wait for a while, get your package you type make install and you're off and running if you don't have time for that you can do binary builds of your packages and pre-stage them and use something simple to install analogous to a yum install or it's really easy to set up the multiple prefixes are what allow you to set up the application silos a quick example is that I had a client who had several Ruby on Rails applications that they needed to install and many of them required different versions of Ruby to run they were all deployed on one machine using package source with a different prefix defined for every application that needed to be run a prefix is basically a path for the dependencies and you can have as many as you need it's a simple straightforward process to package your own applications so when I was talking before about how you might deploy your applications using a tar ball or a make file if you package your applications in the package source format you can have something like Solve Ansible Chef, Puppet, etc go out to a box and install your application as a package which also lets you be able to roll back seamlessly it's easy to fork the repository and add dependencies that you need so when you check out package source it comes with you know, 1700 plus packages but if there's something that you want that's not in there you can add it very easily and it has unprivileged operation meaning that you can deploy package source for a given user and let that user manage its own dependencies for an application stack and you can do that for multiple users and they'll never complete package source is also portable it gets its whole own slide for portability if anybody in here has run Show of Hands has anyone run more than two of these so three people so daily I tend to work with at least three of these and having package source set up with autonomous deploys for deployments means that that's national when it's time to okay so what does it look like this is what it takes to get going with a very basic installation of package source by that I mean you pull the source code down that's what the get clone does you then change into the directory that's created and then go into bootstrap you run a bootstrap script and then you're all set to go then you can cd into a application path package source development cache and then type make it fall clean what that will do is build the package it will install the package and then it will clean up all the files that were created as a result every single part of your application stack so the idea again is to decide what you need for a system to be on your network and then decide what you need for your application to be able to be installed and treat the two as completely separate and here the package source develop memcache when you do the install there it will install the prefix that's defined in the environment so if I needed memcache install five different times I could re-run this command five different times with a different prefix and have different separate installations that I could then run in parallel so can someone name a run time or language that they use to deploy applications to something like Python, JavaScript to use Kepestrano for actually doing the deployments okay are you deploying Ruby apps okay so do you ever have problems with gem versions so with a setup like this you would do the bootstrap you go in build Ruby then build Kepestrano for on another system build Kepestrano to do the actual deployments for those who don't know Kepestrano is simply a tool that goes out and it lets you use Ruby code to specify a set of things to happen on a remote server similar in some respects to tools like Ansible except that it's not it's imperative that you actually have to do every step very much like shell script so you would have your Ruby script or I'm sorry you build Ruby as your application depends on all of the gems that come along with it would also be part of that dependency bundle and you would have a definition in Kepestrano to say this is what's required for my applications including Ruby every single gem and then deploy that when you needed to upgrade you would simply redefine the versions of the packages required for that rerun your Kepestrano script and then you'd have an upgrade that doesn't care what version of Ruby is required by utility in your system I'm trying to think of a good review utility that's commonly installed but and I can't but in a situation like this your application is completely safe from any changes that happen with the operating system packages so I ended up running through my slides pretty quickly I will now take questions if anybody has any I just wanted to ask how does this stack up against me just pushing my application into a container like why will I not go with a Docker container put all my dependencies in it and ship it everywhere instead of using the one of the tools that you just mentioned great question and thank you for asking it so how does this compare to containers basically containers exist on well if you're talking about Docker containers or LXC type containers then only on one platform Linux there are different distributions you can have Red Hat or Blue Two Origin Two or whatever but it's still one platform the clients that I work with can't have just one platform so they need a solution that works on multiple things a solution that works on those and because I'm known as a strong advocate for BSE systems sometimes people think that I'm aligning other platforms and I'm not the fact of the matter is that containers frequently hang in ominous ways you don't have to worry about weird unrecoverable hangs with a system like this it's very lightweight forward if you need network level isolation then it gets a little bit tricky so containers may work for you there or if you're in an environment where you have access to more robust container systems like zones on Solaris or jails on FreeBSD then there's kind of an administrative overhead tradeoff like would you rather manage containers or manage like just the application dependencies so that's basically the difference thanks so my question is regarding let's say for example when I break my OS suppose the TLS package has been created but then my code and my web server and everything is all of a sudden breaking now if I use package source and install the previous version then what's the point of sort of upgrading or having the newest security update then there's sort of like is this more of a temporary fix that we do so that things are in production and then we try to make it more compatible to the newer TLS libraries or something like that I'm sorry I lost you at the end there you said is this more for example if the OS has come up with the new TLS library and if we use package source to have the previous TLS library which is more compatible with my web server for now then I can still have everything in production and nothing would break now in that case what's the point of upgrading the OS in the first case then is package source only for a temporary measure or what do you recommend okay so if I understand your question correctly you're saying that if I can handle all of my application dependencies without touching the operating system what's the point in ever upgrading the operating system well no I won't upgrade the operating system so that I can have the latest TLS package right so if I'm gonna go reverse compatible and install the previous TLS package using package source then what's the point then is this a temporary measure like we use package source as a temporary measure so that everything is working fine and once we resolve the issues then we just knock this off completely got it okay so package source in this case would not be a temporary measure it's the way you build your application back right so from whatever layer forward and that's something that's you know that needs to be decided by a dev or engineering team saying that okay we have java let's say it's a open JDK 1.8 everything from that point forward is part of our application step you you may need to do rollbacks but that's related to your application not to your operating system you would need to start thinking about this division between the platform and the app you know what needs to happen and one doesn't need to happen in the other and it's not that you are using one as a short term fix it's just the way that things are from that point forward does that answer your question yeah it sort of does but yeah again what happens is that each application has hundreds of dependencies let's say for example if I have a node application it has bcrypt and bcrypt relies on these browser packages which come along right and then if I've created the browser then everything starts to break so I don't have to just fix my code but I'll have to probably dig into my dependency that is the bcrypt library and then modify it so it's good to pick the right version of the package that it needs so yeah of course it answers the question but it's just it's a lot of work around as well so you're talking about the administrative overhead of having package source packages yes so that's a concern just as it is with containers right so if you have a containerized system which by the way I'm sorry I should have mentioned before with containers there's one additional consideration with containers in that many people get prepackaged containers and there were recent audits of many prepackaged containers and something like 70x percent of them had really egregious security flaws in them I mean not like an app had like a buffer overflow in it but like bad configuration that like a systems engineer would like never do but yes you then have at least two sets of packages to maintain right your operating systems packages and the packages related to your application but again it's really easy to manage package source for public source you have you have things coming from upstream from the NetBSD project you also have and there are many contributors to that so Joints that happens to run a huge cloud platform they're access contributors to the package source project so it gets a lot of updates you know like fairly quickly and if you need to go out and do something on your own it's really easy to do but yes you're right there is administrative overhead there I think it's worth the tradeoff it may not be if you're always the point of one platform I'm sorry I lost you I have to give oh resource isolation yeah package source does not do that it's simply a way for you to install dependencies for your application and anything that you need beyond that you have to have either a user-laid facility to do it like something that comes with daemon tools or run it that does that or use an operating system facility like capsicum or jails or zones or some other thing like that great question so the question to restate it is how do you deal with building multiple packages and in particular the conflicts that can happen with tools like SSL or others where your application may require one requires another and frequently they don't play together nicely and can package source handle that the short answer is yes absolutely very easily in 99% of the cases there are some edge cases where simply having file system and path isolation for your libraries or your applications isn't enough and in those cases you might need to go to a container solution like salaris zones or previous dj or linux containers but in most cases with something like SSL the version that's installed by your operating system is usually not an issue for anything that's an application installed separately from the packages for your operating system so if you install package source you install libre SSL or open SSL or boring SSL or whatever and you install your application in the same prefix it will use those libraries and not conflict at all with the system SSL libraries the thing goes for 99% of the other libraries out there there was a problem a while ago with Postgres but that's been resolved and I'm not aware of any others that exist so the question was does package source support reproducible builds and yes for most of the packages I'm not sure that they've gotten through all of them but reproducible builds has been like a goal for package source for a while they've made really good progress on it I know that on NetBSD and Lumos I think freeBSD more than half of them can compile with reproducible builds which brings me to building multiple packages in general if you're going to build a repository for your packages so that you can do a very simple install similar to an app install or a yum install there is a distributed package builder that comes along with package source actually there are three of them there's one that's official build that will build every application that you specify in an isolated environment and you can tweak the parameters for that so if you need reproducible builds that's where you would go and do it tell it the list of packages that you want to have built and it will kick off a distributed build for that even if distributed means local to one host using multiple cores and then it will produce an index file that your package manager can use the package manager is called pkgin and you can install from there without needing to compile every single time any other questions? thank you very much just a reminder Cliff is giving a workshop this Sunday what is it salt stack and Ansible I think there's still some seats open so if you're interested in attending please have a look on the rootconk website and you can sign up okay I have a little bit of an announcement about this evening there will be a party tonight I guess you could call it a party and there will be transportation provided from here to the party venue the first shuttle will leave at 6.10pm and they will leave every 20 minutes after that so people should gather around near the registration desk to queue for the transport so just so that you know you will need your badge to get into the party tonight and you have to be careful because you also need the same badge again the next day to get back in the conference so don't lose it oh also today at 3.30pm just another reminder there will be slacklining and yoga session in the lawn if anyone needs to shift gears away from brain and tech conference and into your body that's available as well okay our next speaker is from Kerala from a city that I'm not sure I can pronounce properly hmm Alapuja Alapuja okay close the Venice of the East so please welcome Ranjith Rajaram who will be talking about what should be process ID number one in a container a topic I'm sure that will generate some controversy so yes welcome Ranjith hi guys my name is Ranjith I work for that app and I basically do product support so my session title is actually what should be the bid one in a container so I hope after 15 minutes actually I actually don't leave you with more questions but some answers which you can take it forward and build on top of it okay so please raise your hands if you have used dog containers or you know if you are using it for production or if you have either tested it or start that the containers are in place most of you so this is what I have for you this is going to be very short I will be using the first few slides to apply my main point and then we will have a demo so I will be using a demo to actually show you what the problem what problem you could face and why it is important to have why it is important for you to take control on the bid one and so we will be spending a lot of time on the demo and we will be switching between the demo and the slide more often so for this session I will be making use of dog containers since that is the most popular right now so how do you control which process becomes the bid one in a container right so you actually do a docker run when you do a docker start how do you control which process becomes the bid one so in a docker file we actually have two different ways one is the entry point the other is the cmd I will just use only the cmd option so how do you decide which process becomes the bid one you actually mention the process which you want to become the bid one by mentioning it with the cmd director so there are three different ways how you can use this popular director so this is the most popular one where you actually mention the application in this cmd director so this is just a ps output from two different containers that are running on the system and if you look at it so now I have this is my first container where I am actually starting the bid one here you could see the bid one is actually the bid one and the second container is just a sample output where your bash is the bid one and the top is actually child of the bid one so now if with the help of the bid namespace multiple containers the process which is running in multiple containers can have the same bid number but when you look at from the outside of the container from a host perspective each process has a unique TID number so now let's take a look at the demo so for the demo I am going to use a simple Python application so this particular Python application has some known issues so this particular example is just taken to highlight the problem so I am going to use two containers for this so the only difference between these is for the container one I am calling my Python application directly and for the second container I am actually calling a shell script and within the shell script I am actually calling the same Python application I am just going to change which process is going to become the bid one right so now let's take a look at the demo and on the extreme right side you actually see some lines which are being printed out I will tell you why I have done so now and in the demo I will show you a different container so this is my first container where I am actually calling the Python application directly and so I am just going to start the particular container so on my second tab so this is the same system so on the second tab I am just going to start my second container where I am just calling the bash script first and then within the bash script I am actually calling the Python application so this is my first container and within the bash script I am actually calling the Python application so now let's try to access the particular application so we will use a simple curl command so now you can see that I am able to connect to the container and also I am getting an output thing which is the hello world so for the same I will use the same command same curl command for my second container also just that it will have a different IP because the application which is running inside the container is the same so you would see the same output so now let's take a look at the p.s. output let's see how it looks like from a container perspective so this is good because as I said before I am calling the bash script first and within the bash script I am actually calling the Python source so now let's go to the first container and do the same thing let's look at the p.s. output do you see a problem here this is the same application it is just that in the first container this is the first container I am calling that particular application directly in the second container I am using a bash script and then from the bash script I am calling the particular source but now whenever I try to access this particular application now I am just going to show you so do you see an increase in number of different processes so each time when I access my application I am getting a different process at the same time for the second container where bash is my pit one so when I try to access the particular container actually I am not seeing any problem here so now why is defunct process a problem so now let's take for example if you have multiple containers running on each system the same instance of the same container running on each system and at one point of time when the load of the system goes high you would have defunct process within each container and what really happens is now let's take a look at the p.s. output of the whole system when you take a look at the p.s. output of the whole system you would actually see the actual pit number which is used by the process which is running inside the container so here you can see this is the actual pit number that is consumed by the process which is running inside the container so if you have multiple containers running on the whole and they all have the same sort of a problem then you would end up consuming all the pit numbers which are running inside the container so by default by default on a well-opened system of a pit you would actually have 32,768 pit numbers so at any point of time you can only have 32 processors or 32,760 processors so if you have some container that is actually leaking there are often processors and if the processors are not being reaped you would end up in such a situation where you will not be able to log into the system as well because to log in a system it has to fork a new processor but since all the pit space is being used you will not be able to log into the system so the only way is to either re-put the system or if you have some other way to reduce the number of ports which are running on this processor or take care of the often processors so now if you run the same application on a bare-metal system or on a virtual machine you will not see this problem because on a bare-metal system the pit one is actually your system D or your init so they actually have a special feature where actually they take care of processors so they actually clear the often processors so this is problem number one now what is problem number two now let's take a look at it so now is bash the answer for the problem so the bash is not so when you have bash as the pit one you are not seeing any often processors so is bash the answer for the problem the answer is an s and a no because s because it has the feature to read often processors why it's a no look at it so what I'm going to do is I'm just going to stop this particular container this bash container so now what is going to happen is you would see this particular command getting paused for 10 seconds it is actually paused for 10 seconds so it is not hung it is just waiting so now it has come out so it is able to stop the container but it took 10 seconds so now what is happening in the background so docker stop actually first it senses stick term that is it is asking the container to do it but here it first senses stick term it waits for by default 10 seconds and then it actually senses stick skill so here so when the container get the stick skill it actually you know exit very quickly without closing all the stuff so if you have a DB sort of an application running inside the container and if you don't handle the stick signals properly you end up having a problem where your data is not consistent where the data from the cache is not flushed into the oil system so now how do you find what is going wrong here so let's say for example I'll start the container again let's find the process so the process ID is 3960 so if you look at the work file system for an PID you actually have a file on a status where it actually gives you a lot of information about the process so here we are just interested about stick catch so stick catch is actually in an hex value so we will have to convert it into a human readable way so for that we actually have a bash script so now let's take a look at what is happening here so now the bash script is actually registered only for two signals one is stick so one is stick int so stick int is nothing but a control C sort of stuff and then you would see why bash is able to reap the child process because it actually also leaps into sick child so since the bash has a particular feature so you are actually able to find it using a stick child so now is bash the answer for the problem is bash the answer for the problem so we actually know bash is able to reap the child process but it is not able to handle the signals properly so now do we have an hack so can we use bash as the answer for the question what should be the PID PID one in a container so the answer is yes you could use it but you would have to rely more on some sort of script so now let's see how we could do it how we could make bash even handle handle the stick signals properly so this is what I said before so here if you see so what we could do is we could actually put the actual application in the background and then register for that signal in the trap so if you register for trap and you actually wait for the signal then as int and term so the moment it gets the term it actually takes it and it passes the term signal to your child processor and then it exits so it will do a graceful exit so bash can be an answer for the problem but do we have a better answer like what should be the PID one in a container you have seen this so so as I said before if you run the same application on the whole system it is actually taken care because your PID one is actually a system it actually passes the signals properly it can also read the child processor so now what about having a system be or a enamel in it inside a container so now to solve these problems to with processor weeping as well as signal handling it is still there we actually have a couple of open source implementation known as a tining and dumping it so this particular tining or dump in it is so small that it will only deal with two things that is processing and signal handling so it comes around 120 kb so you could have this particular binary container directory and then you can actually call it using this particular way this is how you do it so you actually use it in your KMD directive either tining or dumping it and then your actual application so when you start that particular container so this is how the tree would be your dump in it becomes the PID one and your python application becomes the child of it so this particular enamel in it can actually handle two things often processing and handling of it so now there is a thing that is coming with docker 1.13 known as the init flag so now this particular flag is actually a daemon option as well as you can actually pass it with the docker 1 command so what it does is so the moment you start a container using docker init so now let's go to the demo so this system actually has docker 1 1.13.1 so I am just starting the container the only difference here is I am just adding this particular new flag I am actually adding the new flag known as init so let's see what happens here see what is happening is I am using the same container but since I have used the flag init so what the docker daemon does is it actually starts enamel init so that is what you see here the init and then it starts the python app application as a child of it so now this could be a very good answer for this problem like how do you so let's say if you are not the author of an application which you want to migrate to a container and you actually don't know how it behaves so this could be a very easy way to do it that is with docker 1.13 you can actually pass it as a daemon flag so that every container that starts will have the pit one as a pinwheel init all you could control it per container by starting it with docker run by using it with docker run so we actually saw the emel init then we actually saw the docker init then what about having system d inside the container so for some of us yes system d could be heavy but then there are some additional features which you get when you have system d inside a container so now with the latest version of docker you can actually have system d running inside a container without using the privileged mode so this is actually done with the help of OCI hoops so now what are the additional benefits so now we actually saw you can actually deal the problem with signal handling as well as tile vaping using pinwheel init but then why do you want to use system d so system d actually comes with some additional features which I think you should consider it for example let's say for example you are actually migrating an application that actually writes it log to the dev log or the particular log is captured by the payment which is running in the host system so if you are planning to migrate any application to a container so then what you would have to do is you actually end up writing the logs either to the std out or to the std ever and there are multiple ways also by using log4j you can actually send the logs to a remote system so it actually differs but when you actually moving an application from a bare metal to a container things should be easy for you so this is where the system d actually helps you out now let's take a look at it so I am just starting a container which is actually going to start the system d so now let's take a look at the so here you could see the pinwheel init is actually the system d so what is the beneficial benefit that you are going to get so with the help of OCI hooks the moment so the moment when you start your container and your container so if you see here so here you are actually telling it to be sbin init so if it is a fedora 25 or 10, 20, 7 sbin init is actually your system d so when you start a container the moment you start the container and if the docker daemon with the help of the OCI hooks it actually detects your pit one is actually your system d then it does some additional things for example it actually registers your container with the machine CTL it will take 10 seconds to each container so here you can see so right now I only have one container which is running and that particular container the pit one is actually used by a system d so now what I could do is I could check for the logs of the container by running this particular command using the force system itself so now whatever is the output that your application that can be captured here using the general CTL now for example let's say you actually use a very simple application so the logger what it does is actually writes the logs to the devlog and you can actually capture that on your force system so now if I run this journal CTL command again this actually is a very good benefit if you are just migrating your application if it was actually running very well in your bare metal system to a container so I will just open it up for questions please here you could do it but then what you are actually doing is actually migrating an application which works very well on your bare metal system to a container but now if you actually want to handle these things you actually end up writing more actually make more changes to your application so now let's say the same application is running on your bare metal system you actually need not do anything else so if you are not the author I am saying if you are not the author of the application it actually so as I said before I have used this particular yeah see so that is what I am saying if you have an application bug and if you let that particular application bug to run inside a container without you knowing what is happening and you would come to know about the problem only at the last minute when you have multiple containers running all those containers actually leaking for societies you would end up having this problem so now how do we handle these things in a better way so you could have a minimal in it as a pit one so that you actually have some a confirmed way that application bugs can be handled because what will happen was for example how I came across this myself one of my customers was using Tomcat within a container and all of a sudden just days before the actual production we actually noticed this problem so that particular application was running perfectly fine outside it wasn't creating any zombie even if you run this application outside because it wasn't creating any zombie because we never knew what was happening whether it's an application bug or not so this can be captured at the time of testing but unfortunately we actually found this at the last minute so we actually had a tweak around so that is how I came across this particular topic what should be the pit one in a container so your questions are very valid but then any other questions so we just say one minute so if it is just one container that's fine let's say if you are using hundreds of containers of the same instance at one point of time it becomes a problem right then it becomes so then you have to choose the right orchestration tool then you have to use container native storages so it is not that easy for you to use the horse mount option we have thousands of containers and this container can run on any nodes so you have to make sure that particular horse mount is there in all the nodes so instead of using horse mount then you would start looking at nfs or the container native popular cluster it becomes a little bit top problem but you can do it you can actually do a volume so from the horse system you actually send it to your symbol server that is your ALT stack or maybe you can also use the planks you can actually send the logs from the system D to the planks server to a remote server so when it comes to docker logging or container logging you wouldn't keep it on one particular horse because you actually don't know whether the horse will go down or not so you actually send the logs to a remote server so I think you are done so thank you for your time thank you very much I'd like to take just one moment to remind everyone about the feedback form that's in your conference bag please do take a moment to fill out the form it's very helpful for us to improve the conference every year you can put your completed forms in the bags just outside the auditorium and I'm now going to turn over to Zainabhava our fearless leader just a quick announcement the off the record session on my sequel at 3.45pm it's been moved to the dining area food court because the outside there's rain so please be there downstairs we'll also have ushers ushering people in but if you had food downstairs that's where the OTR session is before the next speaker goes up on stage I'd like to take about 45 minutes to quickly introduce the editors of rootconf and just walk people through the process by which we did editorial can I have Aditya I'll up here so this year okay so this year we have five editors for rootconf and I was the editorial chair which means I was just herding the cats Aditya Patwari who's been speaking at rootconf for I think several years now and it's been a great opportunity to work with him this year Philippips who's doing the second year as editorial he will most probably retire next year editorial for the third time we have three absentee editors and I hope they're watching us on video I have a lot of gratitude for Saurabh Hirani who has taken a lot of time and has taken rehearsals and sessions from Singapore and given very granular feedback to attendees Sezyagni Khanna who's not here who's another conference in the US right now who's also been a past speaker and Piyush Varma who also spoke last year who's an editorial also in the US editorial team of five people with myself we started curating talks from early February we opened up a call for proposals and received a total of 95 proposals including talks for DEF CON and the way we structured the editorial process we did a bunch of things differently we did a lot of public commenting on proposals where we had questions where things were not clear practically every proposal submitted draft slides and a two minute preview video for us to assess both the person's content as well as the presentation skills and before we confirmed any talk we did a rehearsal in order to double check for ourselves whether we were making the right decision or not so between March until I think last Sunday we were rehearsing with speakers practically every speaker who's gone up on stage here and who will speak at DEF CON tomorrow has gone through rehearsal and in some cases even for so apart from myself who's employed with Hasgeek Aditya, Trouble and the others three are full time employed in other places we've been doing constant calls late nights across continents and trying to coordinate all of this so we definitely would appreciate if you give us feedback on the talks if you think anything has not worked out right that could have been done better we always try to improve the editorial process and keep it transparent and be happy to talk to you if you'd like to propose to speak at any point in time at DEF CON, at Meetups and at conferences do reach out on the help desk I'd just like the audience to give a round of applause to both of them and to the absentee editors thank you very much I have a lot of gratitude and appreciation for Aditya and Trouble to do rehearsals and to be very brutal in feedback so don't run away Zainab so I should point out that Aditya and I and Yagnik and Piyush and Saurab we just we are gainfully unemployed and we just idle around we never show up for conference calls the only thing that makes this conference work is this voice in the background and on Trima and on email are you coming to a rehearsal have you read this all throughout the day quietly sitting on your shoulder like an external moral unit and this lady here who makes it work so don't applaud Zainab she makes this work complain at us praise Zainab well this was fun that's all I have I think it was a good learning opportunity best part is that now I can roam around because I already have a technical approach one more thing so I will be forced into editorial retirement next year because apparently I think you're only allowed two or three apparently I'm not yet allowed into editorial retirement but if you think you have the soul crushing ability to rehearse speakers and to deal with the voice of Zainab at all hours of the night I would encourage you to volunteer for next year's root confeditorial we'll break you in gently and then next year or the year after rather you'll be on your own so if you think this is something you want to do come and talk to us it's a lot better than being talked into it so volunteer now or be volunteered okay just a last quick note before I take center stage our editorial policy is there are two key things that we that we do very diligently one is past past speakers are eligible to become editors for future conferences very active community members are eligible to become editors for future conferences trouble has not spoken at any Hasky conference or at root conf but he was pulled into it because of his involvement with FOSDEM and a lot of other conferences and we felt that it was a good decision to bring him on and to streamline the process the second thing is nobody continues as an editor for more than three terms you can either do three continuous terms you can do one year take a break come back another year take another break come back another year or you can do three continuous years but you will be voted out after three years because the only tyrant is until I'm replaced I'm the only tyrant okay so thank you very much for your time and yeah do talk to us if you think that there's anything that we could do better we already have a blog post that Phillip has written and that's on the root conf blog about how we select the talks for root conf so do check it out we try to be as meticulous as possible if there have been slip ups you are free to fill your feedback form and comment others up front thank you very much alright our next speaker has come all the way from Mumbai and his claim to fame is that he may in fact be the nizam of Hyderabad you can ask him about this later I'm not sure what the claim to fame how that works but this is what he says so he'll be talking about provisioning for bill metal servers as soon as his microphone is working yes very good so I present to you Azhar Hussain right good afternoon everyone thanks for joining so afternoon sessions are special right because you've had your food you're settling in to the spot now and you know a perfect lullaby session is all you need so you know what I'm going to raise up so quick show of hands people related to IT here seriously right so people who have got half their wardrobe from free stuff that's surprisingly few I'm not going to take it back how about people working day after 12 o'clock right ladies and gentlemen I gave you the sys admins in the room so let's get started a little bit about me I'm an operations engineer at endurance so my day to day job involves working around a bunch of vps solutions bare metal servers and dedicated servers when you boot this system into pixie boot you're going to get a screen which gives you two particular options which is the default option and the other one would be sent to a 7 so the default option basically means it's going to boot using the local disk of the system of the server itself and in case you decide to go with the sent to a 7 option it's going to start installing the sent to a 7 ISO so razor does this thing slightly differently in the sense that it uses ipixie now ipixie extends pixie in a few ways the most important of which I would say is the ability to use HTTP directly instead of having to stick around with tftp so finally jumping into razor here before we jump into the depth of razor and figuring out how things proceed over there we'll just take a quick look of basic razor installation so basically you need a dhcp server you'll need a tftp server you need to set up your dhcp config according to the pixie configurations over there you'll need to set up something called the microkernel into your tftp serving directory and finally you'll need a bunch of razor server and client packages and it's going to use postgres as the backend store so most of these things are fairly simple and straightforward steps and at the same time you always have puppet modules available so that's going to automate the entire thing for you so at an age where we have so many existing tools available for network booting what was the point of creating another one and why should one start to even think about using razor like what is the point of razor so now the whole philosophy of razor is the need for wanting to be able to consume physical resources like it's not real so you want to consume your hardware servers, your bare metal servers like it is not real, you want to consume it like it's a virtual machine the way you have VMs in aws and open stack where you specify you know what give me a 4x4 machine with 4 gigs of RAM what hypervisor it's going to be on or you know what kind of hardware configuration is underneath similarly you don't really have to know what hardware specifications are you know what is the latest density RAM which you can plug into this particular system as long as that system has the set configuration give me a system which has 40 cores and something which has more than 30 gigs of RAM and I'm happy with it I don't really need to know whether it's node 1 or node 2 so you have another aspect over here which is the whole cattle versus pets philosophy where in a lot of tools treat their servers as pets you're supposed to configure them up with names beforehand like the way you do it in cobbler, you're supposed to specify you know what this is the system I'm going to be setting up for this particular use case and you specify the market and you specify the profile and only after that your system starts to boot as opposed to razor all you have to do is, given that you have a system, 3 of which have 4 gigs of RAM you need to say that give me a system which has 4 gigs of RAM and it's going to pick one of those systems and it's going to set it to boot so you have a bunch of components present in razor so which are quite necessary to understand what they are to figure out where things are proceeding further to start up with you have something called the nodes basically any hardware node register under the razor server for management is called the node next up you have is something called tags a tag is essentially a unique name coupled with a matching rule so let's say if you have a tag of web servers and your matching rule is something called you should have number of cores greater than 5 so any of the nodes which have more than 5 cores they're going to be attached with that tag of web servers next up you have is repositories so repositories is nothing but an object representation of your IOSO what your installation disk is next up is policies so policies is the glue in the middle this is what matches the tags to the repositories and this is what is going to install the actual operating system onto the server next you have brokers and hooks so strictly speaking brokers and hooks are not essential components to razor so what broker is essentially is since razor has a very clear line between the time the installation is done and the time when your config management needs to start off so you have the mediatory thing called brokers so broker is going to hand off once the OS has been installed to your puppet agents or your chef agent where it's going to start setting up the server according to your config management hooks is basically letting you allow to run arbitrary scripts at different points of time to give you an example of how hooks could be put into use so we have a lot of management present on our switch on our switch sites so typically you want your deployment to be done on practically private subnets over private VLANs right and finally when the system is going to be put into production it's going to go over public VLANs over public subnets so you can have a hook fire into your June OS or your Nexus switches at the end of the OS installation and it's going to change the VLANs present over there into the public VLANs and your public subnets are given that you have specified the IP address over there and on top of that it can apply whatever shapers and filters you want right so I might have mentioned about microkernel quite a bit back so what this is is it's essentially a CentOS 7 kernel loaded with puppet factor and the point of microkernel is any system which has been pixie booted and is brought into razors management is going to be loaded in with the microkernel so what the microkernel does now is it's the idling state of any node present in the razor system so the microkernel collects all of the facts from the hardware and it's going to dump it into the razor db now when the facts are present in the razor db the tags can have an async call run over there and match whatever nodes according to the matching rule right so if you're looking at the flow of razor here so any system which has been pixie booted is going to be loaded in with the microkernel the microkernel is going to go into the system it's going to collect the facts from the system it's going to collect your details about your physical hardware of the sort of what sort of disk do you have do you have SSDs or do you have hard disk drives whether you've got what kind of processors you've got and the counter processors you've got whether you've got hyperthreading enabled on them stuff like details about your network card your memory options so on and so forth and finally once all of this data has been collected there's going to be an async operation going on in the background which is going to check every single tag and its rule correspondingly with its nodes and if there is a match over there the nodes are going to be attached to the tag and there's a sequence of steps which is going to follow on as soon as the tag is attached if there's a policy attached to that tag the policy is going to be attached to that particular node and once the policy is attached the system is going to reboot now and this time it's going to be loaded in with the particular operating system which needs to be installed and after that you have a bunch of kickstart files and pre-seed files if you may which are going to be created on the fly and finally once the OS installation is done the brokers would be given control of the node and at the end of the complete installation there's going to be a flag which is going to be set which is basically telling the razor node that this system has been installed and you do not need to reinstall the operating system on this particular node in case the device has been set to pixie boot and rebooted again so I've got a fake live demo of this process here basically this is how all of your nodes would show up in the razor database over here so I'm basically firing a bunch of razor CLI commands to the razor server so you can see that there's a bunch of nodes already present in the system here with their MAC addresses and a couple of them have tags already present over there and corresponding policies as well so this is a brief description about one particular node so you can see that the flag has been set to false if you look deeper into that particular node you can take a look at the facts so this is the entire factor dump for that particular node so you get details about your processors whether it's a physical machine or not your architecture if there is an existing operating system installed on it what kind of operating system is installed further on you get details about your partitions as well whether if it's present in a rate configuration or not so moving on next we have the definition of a razor tag so all objects in razor can be put forward as a JSON object and directly imported so over here you can see that the actual check is the number of processors should be greater than 20 and the label has been put as demo so if you list all the tags present on this system you can see that we have one demo which is the number of processor count is greater than 20 it has one policy attached to it but it doesn't have any systems attached to it at this point because there's no systems present in this razor dv which has more than 20 processors if you look at the definition of a policy here you can see that the repo and the tasks have been assigned over here as well as the broker so we've assigned the no broker at this point right and you can have very generic stuff present over here as your root password and the max count would be the maximum number of nodes that you want to be installed into this particular policy so let's say you have a caching server policy and you don't want to have more than 5 servers present at any point of time so you can apply that a max count of 5 is there so finally you have a very important thing over here the tags array which is basically all the tags to which this particular policy needs to be attached to next up is a very basic definition of a repo which is basically the endpoint from which it's going to be collecting the ISO from for the first time eventually it's going to be cached in the system another interesting thing about Razer is it also manages to store all the IPMI credentials for each and every node the fact of storing IPMI credentials for all the nodes over here is that now Razer has control of IPMI tools and it provides a direct wrapper of using IPMI tools and this is beneficial because now you can directly extend Razer's API and call IPMI tool commands directly from there so basically if you have any power state information present in the system so basically you can specify in Razer that you do not want this particular system to be powered up you can specify that this particular system needs to be on a desired power state of off and finally you can see that this is the installed flag over here which says that it is false in case it was said to true this system would not be pixie booted at any point of time right so finally I'll be getting into what was the use case present here and why we decided to go ahead with Razer right so first of what we needed to do was we needed to install a whole bunch of operating systems and which wasn't specific to one particular flavor including windows next was the fact that we needed to provision systems based on rules where you can specify that you know what this is the configuration I want and give me a system of this kind and I don't need to figure out you know what what kind of servers do I have and then figure out in that list and then start spinning up servers from there next was a discovery and inventory management system so typically in a DC operations environment you end up having two systems you have a DC system something like rack monkey or device 42 where your DC ops team is generally going to be filling in data and let's say after a hardware upgrade it's the onus is on them to update it manually as opposed to Razer which inherently has a micro kernel present so the next time the system is rebooted the micro kernel is going to be loaded and any hardware changes that have happened over the time are going to be reflected directly we also had an API which Razer has an API which is completely restful and quite extensive which is also quite useful for our purpose and finally the complexity and ease of maintenance part of it so this particular project was a lot of operational heavy rather than having a lot of dev work on top of these boxes so typically we didn't need any config management system so we just needed to install a bare bones operating system on them and then give it off so here's a couple of alternatives we evaluated over here so we started off with cobbler because we've been using cobbler for a while around here so cobbler doesn't have a direct rule based engine by default and it doesn't even have a management system a discovery system of sorts so which is why cobbler is not big for this particular use case next up was open stack ironic so this is a typical example of the cost not being justified for the returns around here because this was a completely new DC and having figured out the intricacies and problems dealing with open stack clusters we decided not to go ahead with open stack ironic and finally there was foreman so foreman practically does everything which Razer does and a lot of other stuff as well so this was actually a split hair decision the fact that Razer can just do this particular part and it based completely around the rule based engine was something which was which tipped the scale in the term of Razer but if I think about it if we had a typical backend architecture where you have your web servers, you have your application servers and the whole lot over there in that case foreman would probably be a good option as opposed to Razer so in the end a bunch of these tools do a lot of things but they do these things specifically well so we needed something which was quite generic and basically throw we can throw anything at it and it should be able to install it yeah I guess that's about it so when you're using Razer did you work with I mean did you try provisioning a UFI machine I don't think so no so that you can only do it in legacy bios mode yeah now that I've tried it off but I'm not sure at this point I know it doesn't work so I just wanted to try it that's where foreman would actually work better alright thanks everyone I just wanted to let you know that if you have received a red lanyard and if you would prefer to have a white one there's now an additional supply of white lanyards so if there's a difference to you you'll be very happy to know this the red one is for when you don't want to be photographed okay so if you don't want to be photographed and you have a white lanyard then get a red one yep okay and in a reminder we're about to have our beverage break this is your last chance to submit your flash talk so put it on the white board out there and be ready to present at 5 20 p.m and that's it enjoy your break and we'll be back here at what time is it 4.35 p.m enjoy Mr. Philip Papps am I on? yes alright let's see if I can walk over any cable alright so this presentation is entitled 3DSD is not a Linux distribution and whatever you take away from this presentation you should be sued by the fact that it is not a Linux distribution am I feeding back? it is not a Linux distribution so you can relax, calm there's no anger nothing to be afraid of it is not a Linux distribution everything is fine and we've had a lot of Linux presentations at this conference I think there's a parallel track about it and I'm just here to tell you that you don't have to live this way it is in fact possible to find a unix operating system which does not make you angry comes with all sorts of nice features and does not present any surprises along the way so thank you all for the introduction my name is Philip Papps I'm wearing many different hats the hat I wear today is the director of the 3DSD foundation HAP the 3DSD foundation supports the 3DSD project in many interesting ways we have meals to hardware we are a charity registered in Boulder, Colorado in the US and we take money from companies and individuals who use 3DSD and feel that it needs to be improved and we spend this money on sponsoring conferences and organizing BSD conferences and arranging travel for BSD developers to conferences around the world we also spend our funds on hardware and improving support for 3DSD for different kinds of hardware and it's kind of platform so if you actually use 3DSD is anyone in this room using 3DSD in production or yes good the usual suspects anyone else ah good very good so at least a handful of you if any of you happen to have too much money lying around and you have fiscal incentives to get rid of it the 3DSD foundation will be happy to relieve you of this burden and use it to improve 3DSD so that was the 3DSD foundation so who so you know that was the organization who am I I'm a journal hacker I live in the mystical world below the system fall layer and I'm also a conference organizer repeat offender I apparently I cannot ascend the conference without somehow ending up organizing it or doing something I don't know how this happened it just happened I'm also a consultant I'm in the business of telling people that they're doing it wrong and getting paid for it but my my domain where I live I live mostly in device drivers I live in real-time operating systems I also have a troubled history with electronics particularly radios and I'm a professional paranoid so as we go through this presentation the professional paranoid and you know the hardware the simple hardware mind creeping in from time to time so bear that in mind also this presentation was originally written by George Middle Neal another 3DSD foundation director who has many similar properties in his background except for the electronics so on with the show what is this 3DSD thing you know is this Linux you know you type 3DSD into a search engine and you see this little devil guy smirking at you are we some weird satanic cult no we're not we produce an operating system a complete operating system unlike some other open-source operating systems we feel we take a more holistic view of what an operating system is we don't just produce a kernel we don't just produce a bunch of tools we don't just produce a bunch of libraries we actually produce an entire operating system with all the build-glue you need to build it yourself and to build your own release of it so if you're familiar with Linux you might have heard the word distribution once or twice and it makes you cry a Linux distribution is a collection of tools written by different groups of people with often conflicting interests and priorities and somehow it works 3DSD is a complete operating system produced by a team of people who feel that they are producing an operating system there are no conflicting interests in the 3DSD project the people who maintain the kernel are also the people who maintain the C library the people who maintain LS and staff all that sort of goodness which I understand lives above the system call layer we produce our operating system and we also produce the tools to build it and to maintain it if you look at how 3DSD will come into that in a moment you get everything you need basically if you have 3DSD you are ready to code as soon as you install it 3DSD is a physics-like operating system and at last count we support 24,000 third-party packages in our fourth tree or our package system so any piece of software you are familiar with on other physics-like operating systems and even Linux on 3DSD so you package install nginx and you have nginx just like you would expect it on another operating system and in addition to all this operating system tools and third-party glue we also document everything so I was smirking some of the presentations earlier this morning I think Manan in particular was looking up what is the CPU load average what does that mean and typed it into some search engine and we went on to stack exchange and many layers of searching later he found out that the answer was well nobody really knows what the CPU load average what that means the previous project feels that you should not have to go through this so we provide documentation so if you want to know what is the CPU load average you can do man-minusk uptime or some other tool that gives you the CPU load average and it will tell you in the run queue and it is determined by et cetera et cetera so we have documentation that will hopefully make your life easier and we are also an open-source community so the operating system does not stand by itself it's a community of like-minded individuals that produces the system and we hang out at conferences and we turn up in places so I'll tell you a bit about the community as well so okay who uses this 3D as the thing there are 5 people in this room or 6 people in this room great I don't know if your names are on here but at least I hope some of these names should be familiar to you I think particularly WhatsApp is quite popular in India I think it's a messaging application for phones they run on previousd entirely so all of their servers all of their stuff runs on previousd and they admit to running previousd another example of a company running previousd is NetApp if you have a lot of data does anyone use NetApp probably not willing to admit to it NetApp is a big company producing storage systems for people with a lot of data who care about data their on-tap operating system is basically a fork of previousd and they just track previousd Panasas and Dell well Dell EMC have similar sort of use cases but really it's just previousd so a lot of these names are in the storage industry so Panasas, Dell and NetApp wherever they are at the top that's all storage so that's one area where previousd is very popular another area where previousd is popular is networking so Juniper Network is the usual example Juniper makes routers and switches and other networking equipment other stuff that lives there and their Junos operating system is basically a fork of previousd the control plane of your Juniper router is basically a previousd machine which talks to the actual forwarding engines other examples Yahoo Mail runs on previousd it still does it's remarkable that it works and then Apple is another good example of someone using previousd Apple takes the previousd user space and parts of the kernel merges it with the mock VM system and mock ports and some other stuff add their own secret sauce and they call it Mac OS X I call Mac OS X so I call it pretty via the Unix for the desktop so that's you know oh and Netflix of course thank you Todd another company in the networking industry Netflix I think is responsible for about what 30 or more percent more than half the internet bit flowing across the internet on any given day all of those bits flow across previousd machines Netflix's caches are basically previousd machines tuned to deal with this massive storage in such a moving picture pushed into previousd machines so all of these are big companies so previousd is not this marginal thing which doesn't really exist and it's not been dying since 1998 or whenever slash off came up with that previousd is not really real and is being used by many companies you might be wondering why are they using this well you know why are they using this they are using it because previousd has a history of innovation and gradual innovation over longer periods of time so the previousd project will think something up and we'll develop it and then you know five years or ten years later we'll find it in other operating systems like you know Linux we produce great tools the previousd is more than the sum of its parts companies who just like the fact that they can take apart the previousd and use it in their application people in this room who shall remain nameless have shot parts of previousd into space people who shall remain nameless have used previousd parts of the previousd operating system on embedded networking devices the components of previousd just find their way everywhere and the tools are great and they stand alone well and previousd also has a very mature role so we'll historically we released when it was ready and that didn't really work very well so about ten years ago we started releasing software roughly every six months we say you know maybe it's time to roll a release so we go into GoTrees and we go into Flush and you know we polish we add fit and polish to our operating system and then we release it and we move on so we have a branch development model which I'll talk about a bit later and this model is easy for companies to follow so if you are the Juniper of the world or you are a Netflix of the world you want your upstream operating system easy to follow if you can talk this with say Linux where every couple of weeks someone throws a kernel over the wall and you know maybe it works, maybe it doesn't maybe it boots, maybe it doesn't and the kernel then goes into conflicts with the other release models of say the sea library which has the same release model you know maybe it's ready maybe it isn't and if you've ever built an embedded Linux operating system you know that you can know three versions of Lipsy the Linux kernel and I don't know the compiler will ever want to work together you don't have this problem in previous because all of us have the same goals we all work towards the same operating system we don't fight companies who use previous we also really appreciate our documentation and the fact that it's available in many languages George originally gave his talk in China and he said that hey the previous handbook our main corpus of documentation has been translated into Chinese and it's a very well maintained translation I took a look at the Indic languages and unfortunately the only official language of India which has a good translation the previous handbook happens to be English so if any of you like translating documentation you are encouraged to join the previous project I'll talk about that in a moment and you'll find useful the previous project also has a very business friendly license which I don't want to talk about for too long but if anyone has ever in the GPL you know that it is massive and full of scary traps and dragons the BSD license is about 300 words and it basically says here is some software use it if it works well for you great we'd appreciate a note but it doesn't feel any pressure our community is also something I like to advertise the previous project has many many mailing lists and when you join them you can usually find a warm welcome you send the patch and people will thank you for the patch and if the patch is wrong they'll shout about the patch they're not going to shout at you I'll talk a bit more about the community later on so these are all very compelling reasons to use the previous BSD so how did we get here George is in 3 minutes I'll talk a little bit more but the history of Linux is well known this guy in Finland had an itch to scratch that's why he's an operating system and that's what I've told a little bit the BSD history is a little bit more let's say measured or a little bit more lengthy at least so in the beginning dinosaurs roamed and chaos reigned that was the 1960s and then eventually people realized that we need operating systems no sign of BSD yet but someone came up with MULTEX everything is fine cruise is secure time sharing very good and then some people with a strange sense of humor said well we've got MULTEX we should have UNIX as well and then things got a little bit still no sign of but then after a year or two people started using the UNIX thing from their labs and they discovered hey this is some stuff but wouldn't it be nice if we have nice things so nice people at Berkeley started producing add-ons and patches to the UNIX operating system from their labs so people found useful like say X and VI because the UNIX editor was N can tell you that it's not fun X and VI are a lot more fun so that was one of the things that was on the Berkeley tool stage another thing on the Berkeley tool stage was a Pascal compiler it turns out it's not everyone likes writing T but a Pascal compiler was one of the Berkeley tools and other tools just came up and then in the early 70s someone had the bright idea that maybe these computers could talk to each other instead of people being angry at computers sincerely angry at computers so DARPA the defense advanced research project agency threw some money at the computer research group in Berkeley and said you know build the internet I don't think they said that but I only got three minutes for this slide so build the internet and out of the CSRG came things like TTP IP UDP and all of these protocols which we are now well we now know and love then in the 90s we had a bit of a pick up where the people at the lab said well you know this DSD stuff is very interesting but this is actually our stuff and it's the first one to be using it so there was a lawsuit that distracted people from DSD for a while and people started using the next system never mind eventually 3A6 DSD came up, we have DSDI and then we have finally DSD is free after all this time and it quickly turned out that DSD was not just one thing so we have DSD historically a bunch of people at Berkeley here's the Unix have fun with it or here are some add-on tools for Unix have fun with it when DSD was finally free at the early 90s it turns out that there were two or three competing views on how this DSD should be free and we came up with a couple of different DSD projects we have the NetBSD project which felt that DSD is very nice but the best way to use DSD is on everything from your toaster to your main claim to everything the previous DSD project felt well yeah that's nice we like portability but we actually would like to be fast so they disagreed with the NetBSD on where the platform works there's two different DSD project forms the NetBSD project going off to be portable and it's running on every printer in the world these days and we have the previous DSD project which runs everywhere else in 1986 the OpenBSD folks had a disagreements with the NetBSD people about priorities on security and portability so OpenBSD forced off and a few years later we also found your force forced off in less than two months the previous DSD was a powerful tool which takes the previous operating systems just like Juniper does or just like any of the other consumers the 3DSD does and they optimize this for running on a desktop Apple does the same thing and they optimize this for running on a desktop and there's some smaller DSD distributions all that aside I'm going to talk about them I'm going to talk about 3DS what do we do? We produce a whole system what do I mean with a whole system? I mean that we produce an operating system which comprises all of the device drivers all of the compilers and associated tools debugging tools as well I don't know if anyone has ever tried to debug Linux but it's just you can't certainly not as soon as it's installed try and hope we have editors in our base system and we have a packaging system that allows you to install anything that is not there so the point of what we actually produce as a 3DSD project is an entire operating system that is ready to code when the install is done obviously your definition of code will depend on what layer of the stack you operate on and our definition of code is you can write programs in C or C++ as soon as your install is done provided that your editor of choice is either X, PI or nano and that's ready to code when the install is done we also have a packaging system so if you prefer another editor like them or you like to live a life of pain and you use Emacs you can just package install these things and have them but the whole system approach basically means that what we deliver is known to work at least together with itself you never have this awkward situation where your compiler is not willing to produce anything that your assembler likes or your linker doesn't like the object spat out by your compiler or your debugger does not understand the particular variant that else that comes out of your linker those are all things I've encountered on Linux by the way and we produce this as a team we work together on this and we have one bug tracker for all of these components and this allows us to deliver a polish operating system that just signs very nicely so if you write a device driver for instance you don't need to go and write a tool for that device driver to dribble some bits you can just add it to if you think whereas in Linux every single network device driver has some user-based tools to fiddle a bit there's also ETH tools but there's at least four different variants of ETH tools so it's really easy we don't have this because we have this holistic approach we produce the whole stack and it's not because you're in device driver land that you are somehow forbidden from developing if you think sometimes I wish that were the case if you think it's not my favorite use code but there's nothing preventing me from going into if you think and adding something for my particular device provided of course that architectural constraints are met so I'll talk a bit about what parts of stuff we have in this whole operating system we have some file systems which are real selling points for previous we have two files we have a whole bunch of file systems only two of them are really relevant we have the UFS file system the traditional Unix file system which is essentially the same file system which is on Solaris or any other Unix because they are basically derived from our original UFS UFS is it's a rock solid Unix file system so it's predictable it's not necessarily always fast depending on workloads but some workloads are actually high performance and previously UFS is high performance on read and write under certain workloads our UFS also has snapshots I think Solaris has UFS also has snapshots now and we have journaled soft set data we have a terabyte of data on a UFS volume it doesn't take three weeks to be FSDK but if you have a terabyte of data really what you want is you want ZFS and ZFS was originally developed by Sun with open source that had an interesting history but it is the only file system you need if you have a lot of data and you care about your data ZFS is really the file system you want ZFS is a file system and the volume manager combined it has snapshots which are copy on write they take no time at all ZFS destroy is instance you can instantly destroy your data also unlike any other file system I've encountered ZFS is aware of the fact that your disk and it does not trust your disk in the background it snapshots all your data and it just checks that the disk is not lying to you because disks lie and every other file system just doesn't get this just assume that the disk is telling you the truth and that data does not get silently corrupted if you care about your data you really want ZFS I believe there is a ZFS for Linux as well I have never heard anyone using it successfully or I've never heard anyone who is using it not complain about it ZFS also works on Ilumos Sonos or Solaris or whatever they are called this week of course the ZFS is slowly becoming the reference well not the reference implementation but one of the more mature implementations of ZFS so if nothing else and you care about your data ZFS would be one good reason for you to go and install ZFS you still need backup but you might not need to restore them quite often ZFS also has a lot of security features there are a lot of talks about containers and container systems and namespaces all sorts of things and all of those are just cheap ripoffs of what previously called jails well cheap ripoffs done wrong as far as I'm concerned so if you want lightweight virtualization and you don't want to ask the question what should PID 1 be in my container maybe you should look at jails so for the record PID 1 is whatever the first process runs in the jails it catches big child then your zombies will be reaped so jails are lightweight virtualization basically it takes the charoute system call from previous z or from unix rather and it teaches it about networking so it just constrains the power of the root user of it beyond the file system so it's a charoute with an IP address or any number of IP addresses we also have mac and audit frameworks which you are in an industry that cares about security and cares about knowing what happens in an unforgeable sort of way mac and audits are things which should be familiar to you or should be something you care about previous z has distributed auditing so you can track every file system event every network event gets an audit stamp and that can be sent to another machine so the logs can't be tampered with mac can also constrain other things so I've just been reviewing a kernel module for real-time priority constraints so that you have your process you can use real-time priorities without having to be rooted on your system so mac can give you more privileges or remove privileges it's like sce linux but done right we also have capsicum in previous z we also have capsicum in previous z capsicum is a research technology from Cambridge which tries to sandbox individual applications it works on a capability system and a capability is a privilege you have which you can either shrink or delegate but you can't ever expand it so once you enter a capability sandbox you are constrained by that sandbox so use cases for this are things like decompressors or things like tcp dump where if you're running tcp dump and you're getting some protocol on the wire you would prefer not to have this protocol exploit bugs in your shell and suddenly run random code on your system capsicum allows the disector to run in a sandbox that does not have the privilege to such files tcp dump can still write to the terminal but the disector can only communicate with the tcp dump through a well-known or well-defined and predefined IPC mechanism with compressors if you have this compressed file which you downloaded from some dubious source and you start uncompressing it you don't want to bug in zlib or a bug in l.ma or whatever compressor flavor of the week you have you don't want that to suddenly go and blow up your system so that's some of our security features we also have good compiler technology we moved to LLVM and clang years ago so GCC is no longer part of our world so we have a lot of compiler in previewsd not just compilers but also debuggers I don't think anyone here cares about the t-toolchain so I'll move on Dtrace I think someone asked this morning a question about how do I know what my process is doing at any given time previewsd has Dtrace which is a dynamic tracing framework which also came from Solaris originally I think Todd gave a workshop on it here last year basically Dtrace makes the blue smoke to all this code going on and it shows you what goes on inside your process it tracks system calls it tracks library calls function boundaries you can just you see what your code is doing if you're trying to debug on another operating system you're basically you're going to be doing printf or you're in the debugger single stepping with previewsd you don't have to live this way we have Dtrace on Earth is my process doing Dtrace oh yeah it's doing that so Dtrace is another big selling point of previewsd and if you maintain complicated systems then Dtrace is definitely something to look at I believe the JVM has been thought about Dtrace a long time ago so that might be something to look at I ramped up earlier that previewsd in general has a networking history CTP and IP are Berkeley technologies previewsd in general is still the reference implementation of many networking technologies we have fluggable CTP stacks if you don't like the default CTP stacks it's another someone has recently distributed the BBR CTP stack I don't really know what it does I know Google is very fond of it we have racks we have the usual Cubic and new Reno CTP algorithms to have applications which run on CTP and you care about tuning congestion control in different ways previewsd make this really easy I think you can do this on Linux as well but every time I've missed the Linux CTP stack I want to describe we have three firewalls you'll find that only two are listed on this slide the maintainer of one of these firewalls is right there we have IPFW which is a very mature, very good firewall it's been around for ages the only thing that's wrong with it is that this syntax is not really always friendly and then we have PF which we imported from OpenBSD a long time ago it's got a much friendlier configuration syntax I have done a PF workshop and I've had people type a PF configuration file without looking at a single manual page by just guessing what the command line syntax would be or just guessing what the configuration file syntax would be one of those that compared us to IT tables does anyone know how to add a simple network address translation rule to IT tables without looking it up shut up minus J, master rate minus Q, jump through 6 hoops to hope for the best yeah we've also got something called DummyNep which is a very useful for people who test things that run on networks DummyNep can pretend to be wider or narrower pipes that you can introduce latency on your network and it can also do queuing so if you want quality of service, DummyNep can help you too but it's primarily a network testing tool that can introduce arbitrary latency on your network we have some other networking tools that I don't have enough time to talk about all of them in depth we also have virtualization things like virtual machines and things, we have that many of you use Unreal hardware Amazon and digital ocean and things that's great, that's wonderful 3DSD releases are available for VMware, for virtual box for QMU and for Hyper-V they will just work on cloud provider of your choice since I think 10 point something release we also have official Amazon AWS images so if you are an Amazon consumer and you would like 3DSD I think it takes you 5 minutes well it takes you 20 minutes to create an account and then it takes you 5 minutes to boot your first 3DSD machine roughly ish and those are official 3DSD images so you can just do the usual 3DSD tools 3DSD updates, etc to update your system we also have VHIVE which is a native virtualization tool I already mentioned JAIL which is lightweight virtualization VHIVE is the, well I would say the more like equivalent of KVM but it's a lot lighter weight it's a lot more consistent so VHIVE is all the multi level so if you want hardware system virtualization you should look at VHIVE I typed this slide earlier because I realized that I need to sell this thing to all these Linux victims we also have something called system call translation I call that Linux personality disorder if you have a Linux binary which for some reason you cannot compile with package source whatever this is 3DSD you can just run that Linux binary on 3DSD and in the vast majority of cases it will work fine and it might even run faster than on Linux so one of the few redeeming features of Linux is that it has a strict contract between the kernel and user space that is usually not violated and as long as the operating system presents a system call table that smells a bit like Linux a Linux binary will just run so 3DSD has a system call table that smells enough like Linux to convince Linux Elf binaries that oh yeah hey it's Linux I'll just run and that allows you to run a lot of binary only things that are inexplicably only delivered for Linux such as Oracle's database or various cab tools from people like mentor or eagle it also works well for you've lost the source code and you only have a Linux binary ok it still works as an easy sort of gateway drug into 3DSD and you might find that your binary runs faster on 3DSD than it is on Linux last time I thought it was some game I don't remember which game it was but it ran faster on 3DSD than it did on Linux despite being a Linux binary some newish features that are in the pipeline we scale pretty well already, 3DSD is a massively scalable operating system we scale to many cores but we have this 256-core ARM thunder machine sitting in a rack and it takes 40 hours to boot that's being worked on so scaling to more and more cores 200 cores, 500 cores, whatever a lot of memory that's all so it works up to reasonable limits but they're always room for improvement and that's being worked on as a priority NUMA there is support for NUMA in 3DSD it is ever improving we have ARM 64 support it runs, I've seen it work we also support a number of experimental network technologies like multi-pass TCP and data-centric TCP and again this congestion control system TCP algorithm from Google should you be running on bare metal as opposed to a virtual machine 3DSD also supports secure boots or known as UEFI so you can run on constrained firmware without any difficulties so that was what the 3DSD project produces we produce this operating system all this good stuff and you should all go and use it but the 3DSD foundation also has the duty to increase the membership of the 3DSD project so let me tell you a bit about how the project works and ties you into contributing to our wonderful project the previous project has a democratically elected 4 team the previous developers as in the people who have commit access to our subversion tree elect nine members among ourselves to be our core team the core team is the guidance for the previous project and this is done every I forget how many years I think it was three but every now and again we elect a new core team so how do you become a member of the 3DSD project well we have this thing called a commit bit and a commit bit basically means you can type SVM commit and your bugs are now everyone's problem we have a concept of mentorship which Google stole from us about 10 years ago and called the Google Summer of Code where experienced members of the community will notice people submitting patches to mailing lists and people complaining and backing up their complaints and they will mentor these new people into improving their code and improving their interaction with the community and we call this mentorship and we ask the core team that this person should become a member of the 3DSD project he or she should have a commit bit the core team goes and looks at the track record of the developer and says yeah these are good patches plays nice with others and all that sort of good thing just have a commit bit and then your first end commits needs to be approved by your mentor after which your mentor says okay go off and collect your own bugs and I'm no longer responsible for anything of that the 3DSD project also has a concept of a hat if you feel that you need to be a release engineer you will wear the release engineering hat and the hat stays on your head as long as you do a good job that will hopefully become apparent to you before it becomes apparent to anyone else and you pass the hat on to someone else the 3DSD project has no dictator which I'll just leave that line out there so how do you become a committer well all of you have a task you should join the mailing list I would recommend 3DSD hackers or 3DSD currents one of those is only a couple of hundred messages a day it doesn't hurt much you should check out our source code we use subversion as our revision control system but we also use github as a collaboration tool it's quite good for that so you check out the code and you discover oh no it's broken I don't like it or you know this needs fixing and I have too much time on my hand so you submit a patch to a mailing list and you keep doing this for a while until somebody notices that this person is submitting way too many patches and that person will then ask if you'd like to be a committer this gentleman in the front row here who's undergone this process he's now crying you can say no but it turns out that people are foolish enough to say yes okay fine I'll take a commit bit so your mentor then proposes you to the court team or to the court team to say I have this new committer can he commit please and then you get a commit bit and you have to get your commit approved and meanwhile this unnamed individual might be noticing someone else on a mailing list submitting many patches and he can then go off and say okay fine I'm sick of committing his or her patches can I please have a commit bit for this person and then the cycle continues so the previous project gets more and more members as time continues about 7 more minutes and you know I'm almost at the end of my talk I promise not to talk too much about it but I'm very happy of the BSD license so if you want to join many of the popular Linux projects you need to pass a rigorous philosophical examination and have all of your moral background and fibre examined by a team of I don't know high overlord on mental purity the previous project does not have such an examination we surprise you to BSD license which I have not memorized but there's a short time to stick on a slide everything you need to agree to well that and play nice with others but this is the previous BSD license it basically says I wrote this code I think it was useful if you want to use it I'd appreciate being credited I don't care if you submit binaries or source code or if you print it on post your file paper and display it somewhere I don't care just don't blame me if it blows up in your face simple enough license I think this is the GPL yeah it says all sorts of exciting things but in particular it is viral so the BSD license is not viral at all anyone can use it it's business friendly you can take a BSD binary you can use it for whatever purpose you like as long as you don't blame us for it blowing up in your face we are cool with that the BSD license 200 words you know you know so that's all 5 minutes and 25 seconds to go for questions we have a website double up your previous BSD of ORD as I said if you have too much money and you find previously useful let us know the previous BSD foundation is always happy to help out with monetary problems well the times where we take your money we are on Github if you are a Gith user you can just clone previous BSD from Github and start hacking join a mailing list to become a committer we also have forums if you like websites the previous BSD handbook as I said if you like translations you can translate it into any language you use later we hang out on IRC it's not difficult to find site is your favorite search engine previous BIRC channels as you'll find the document it's a violence peer done violence no questions oh yeah you are not allowed questions there are no problems with the bridge hey over here this guy this guy so you spoke of the file system with the snapshots and the ZFS can you just like quickly breathe through exactly what is benefits and what is it about it's basically a different file system so ZFS is the file system to end all file systems ZFS is a file system and a volume manager it was originally written by some in I don't know late 90s or something and basically what it wants to be is it wants to be the reliable file system for a lot of data so ZFS has a volume manager built in data which just grows if you have more disks you just plug your disks in and ZFS will happen to use the data it's built in mirroring and built in striping so things like RAID you just add disks and you tell ZFS I would like these five disks I just plugged in I'd like to see a RAID volume please which can survive one toasted disk or you tell them that I'd like them to be a mirror please so that if I unplug one of the disks you can tell the pool that and the pool is low maintenance so you plug in disks and it just works and it grows automatically all the storage will be available and it just works on top of that is the ZFS file system which uses the pool and the file system has features such as snapshots which are copy and write so you say okay fine I'm going to upgrade my operating system and I would like to be able to roll back if it blows up in my face ZFS snapshots and then the name of your file system and a timestamp or another identifier that makes sense to you and you have a snapshot, it takes no time at all it's constant time and the file system is copy and write or the snapshot is copy and write rather so if you're done with that snapshot you just destroy it and it was never there if it turns out that the upgrade it explicitly blew up in your face which obviously is your own fault you can just roll back to that snapshot and the other fork will happen you can also send and receive ZFS snapshots so there's streaming support say you have one machine with a bunch of ZFS file systems and you've got another machine somewhere that wants this ZFS file system you can just ZFS send on this machine and ZFS receive on that machine and the entire file system just goes across the wire you just pipe it across SFH or you pipe it across NetCAP or whatever you want any snapshots if you pass them on recursively just go along the wire also encoding rules, UTF-8 all those interesting features just travel across with your file system so ZFS is basically if you've ever used Veritas or another volume manager it's that plus a good file system plus well minus anger so that answer your question yeah thank you over there somewhere one minute and eight seconds before I get shredded hi it's a tough question Philip oh no no I defer to my colleague over here why is Linux more famous I'm not sure well I'm not sure if there's an answer to that so part of the answer to that historically is that in the in the mid to late 90s as the comm bubble was beginning to explode the BSDs were mired in other problems which could be irrelevant both at the time and with the benefit of hindsight so I think we just we were off doing other things in the late 90s and Linux just took off I don't know also Linux has a very low barrier to entry in the sense that you can find you write some code and you just drop it somewhere and some project will pick it up and it will end up somewhere because there's no sort of cohesion there's no shared goal to work towards you can find you add a bug to it you don't have to care about the kernel people that appeals to a certain hyperactive segment of the population who care more about instant gratification and are less I don't know tickled by the satisfaction of a job well done on an integrated operating system so I think instant gratification there's a dozen people taking my patches and the distractions in the 90s really led to Linux's success I use Linux for a while I've been clean for more than a year I just don't understand it I just feel it's a fair operating system there's nothing wrong with the success for everything gentlemen in the green shirt in the back I'm on overtime okay well you'll have to fight whoever shouts hello, yes is there something like HANZ is there something like high availability well it is high availability by default but you mean things like you have a cluster of things and one of your machines gets destroyed not at the ZFS layer directly so you'd have to do something ice-caze-ish or you'd have to expose your ZFS through some other means to have this high availability obviously if you want high availability for the checkbox then you'll need to do something like ice-caze-ish and I think I'm out of time if I have any special requests regarding tweeting or photograph I'm not on social media I'm too social no please if you can constrain yourself to not putting my name on any surveillance networks you might be plugged into but I appreciate it obviously I have no way to know so use your moral judgment I would appreciate no pictures of me ending up on the Facebooks of the world Twitters of the world thank you alright if you must since you asked so nicely alright go ahead so there is one topic jelly versus lux lux is the full disk encryption how you pronounce your G's it just works it's and actually one of the nice things about the full disk encryption in previous years is the gentleman who added supports for decrypting disks to the bootloader has actually never written a line of assembler in his life before he committed this feature so he just felt that he needed to decrypt his disks in the bootloader so he could have full disk encryption and he just you know I think this works with some of the most horrible code I've ever seen but people on the mailing list patiently helped him with yes but that's not how Intel really works so that's yes we have full disk encryption and we have a good community two questions answered in one go and now I'm really out of time thank you very much alright everyone we're now going to begin with the flash talks thank you to all of our submitters we have a total of 10 talks very strict about the time they're exactly 5 minutes long you will be kicked or dragged off the stage if you exceed the time there will unfortunately be no time for questions because we have so many talks so without further ado let's start with Harshal who is going to talk about tweetstormy.com oh and I have a couple of other announcements to make just quickly there is a talk in the banquet hall downstairs by Arpita about sleep at 1745 there is also an ongoing off the record talk which is about AWS cost optimization which has already started from 5pm but if you want to catch the end of that it's happening one last minute thing ah yes the first shuttle for the party venue leaves at 6pm people should gather near the registration desk shuttles will leave every 20 minutes the party is at brew ski pub you must wear your badge to have access to the party you must not lose your badge because you will need it again tomorrow to get back in the conference venue here ok over to you Harshal hi everybody so you might have seen me from the morning I was a volunteer what am I doing up here let me warn you it's not related to DevOps at all so I am a drug addict ok now that I have your attention that drug is Twitter and like most of the audience here I think this is a relevant audience so I have been working on a very little side project so if you are a pro twitter user which I think this is a relevant audience for I think you might have seen a lot of threads on twitter those are called tweet storms and like a lot of famous internet things they were invented or the term was coined by Mark Anderson of Netscape and you might have heard of his VC form so I will show you a few famous tweet storms this was one by a guy called Siddharth about sexism but not in the way you would expect there are other famous tweet storms as well MKBHD a famous tech reviewer he has another by Dan Abramov he is an open source contributor so I will show you what I mean you just right now it's not publicly accessible and the logic doesn't work so I am looking for beta testers you type here and instead of manually sending a tweet and then replying to it you can see the tweets in advance like then the logic is more advanced after 140 characters before you can break so that each tweet in itself is coherent and to view the tweets we will have a good mobile view as well it will be a completely free project just like for learning and so I am a computer science student and you might have heard of the principle of the principle of least power everything that can be made in JavaScript will eventually be made in JavaScript so I was like fascinated let me learn Angular that sounds fancy and let me learn Koa I already knew Express so I do when made the frontend I was like very tired oh what is all this Angular shit then I decided to go with Express itself because I didn't have the patience to learn anything else so that's it we will be launching very soon if you are a pro Twitter user and would be interested in this kind of thing please go and join the waitlist or email me find me the contact info is there in the photo and sorry for putting you through this blatant advertisement thanks okay our next talk is going to be Docker on ARM is he here okay we skip that one if he shows up later maybe then in that case we will go to the next one in the talk which was identifying anomalies using graphite functions by Aditya just sign the disclaimer hi so I wanted to talk about the way we are using graphite in to figure out how to expose metrics and basically find out when things are going wrong in a way that you don't need a developer to understand what is going wrong now to give you some background I work at capillary technologies we have some services or some integrations with a lot of external vendors for a lot of various things we also run a lot of services in house so we do have a few modules which require to integrate with maybe 20, 30, 40 services externally or internally now at any point of time we would like to figure out if these services are running fine or not running fine now the first part like many of the talks have already talked about is that collecting the data so we started doing that we started using the codahail library and graphite on top of that to get our metrics we quickly ran into a scenario wherein we were exposing upwards of 800 or 900 metrics for a couple of modules to basically find out how the system is running now we drilled down, we tried our best we got it down to something like 200 now if you have something like 50 services running and you want to find out the health of each service and some data there of the module itself you would require a couple of metrics for each module in particular so we were still looking at upwards of 100 metrics which needed to be monitored on a regular basis to find out if things are working fine or not working fine so obviously someone who doesn't really have insight into the system can't really make sense of that so we tried using graphite functions to make sense of what is happening now I'm not sure if everyone here has used graphite functions but it's a little cryptic to start off with but you can basically write regular expressions to get all your series of a specific type to show up on one dashboard we ended up having dashboards with upwards of 50 graphs which doesn't make sense to anyone so then you start off applying functions on top of this to start making sense of it now one of the basic things that we tried doing was you apply something called as a maximum above an average above a function now what would you want out of something like that you would basically say I have 50 graphs running over here out of which if any of the services that I'm interacting with let's say takes a time of more than a second to respond I would like to have those services show up on my graph so that when something goes wrong I just look at the graph I have one or two or three services showing up on the graph and I know exactly where the things are going wrong as opposed to looking through 50 and if you go through the legend of 50 items it is basically incoherent and if you are like me who is basically colorblind it's very very difficult to find the difference between the shades of all those 50 graphs so you apply something like a maximum above an average above graph and you bring that down to two or three or four graphs which is very very easy to understand so now assuming you want to take this the next step you typically would apply statistical means to or a machine learning algorithm on top of your graphs to find out if things are working fine or not working fine now going by the typical mathematical approach you could apply some statistics on top of it one of the ways that people do this is applying exponential smoothing functions graphite by default gives you the functionality of something called as a whole twinters operation now let's say you apply that on a time series data that you are getting basically what this does is it looks over the data that you have over the last certain amount of time it finds out the possible deviation that you might have in that data which is acceptable you can set how straight or linear do you want this to be and whenever the deviation exceeds these limits you see a spike this in combination with your max above max below graphs basically give you pointers to find out exactly when something is going wrong and where it is going wrong so now we are in a position wherein we just tell certain people some particular team to just look at these graphs if anything shows up on the graph you basically raise an alert so that's how we solved a lot of our monitoring issues Rahul Menon speaking on self driving kubernetes Hi my name is Rahul I work with waysapay so I am just going to take about 3 minutes of your time just to tell you about what kubernetes is it's basically running kubernetes in kubernetes I see a lot of value in it mainly because as you saw the demo this morning you can actually scale out your kubernetes cluster just by executing one single command and it helps with deployment, upgrading, scaling a lot of things so I've been trying to work on this for the last 3 months or so I've been doing this thing working and out into production I've still not succeeded but I can see the light at the end of the tunnel so sorry I lost my train of thought so if you have been following up with the kubernetes community there's this project in the kubernetes incubator called bootcube so this essentially what it does is it brings up a temporary kubernetes server a scheduler and a controller manager which then tells your cubelit to actually spin up your API server your controller manager even your hcd cluster you can actually host the hcd cluster which kubernetes server stores things on in kubernetes and people behind this has obviously been coroS I've been trying to follow with the project maintainer and trying to get bugs sorted out and from the demo this morning if you actually wanted to upgrade your kubernetes cluster you could actually do a live edit like shown this morning change the version and apply it's that simple it would go down your API server when upgrading the API server you get a second or two where your API server does not respond and your cluster is functional and it just works so that's pretty much what I had to say anybody who wants to talk about it you can find me outside yes I do have a blog post yes sure we can talk after as well our next speaker will be juber giving a digital transformation is he here he transformed himself out of the venue so then it's dananjay with a cloud for robots okay guys this is sort of continuing of the work we did at eth back in 2013 which was presented at bycon and back then docker didn't exist so we set out to build our own cloud platform using linux containers and as you may imagine it was quite a mess we managed to scale to about 50 nodes and that was sort of the end of it we went back to our research and our jobs then we decided to take another shot at it at the beginning of last year and the vision we have is quite simple robots as you see in the slide were supposed to be a part and parcel of our lives we all grew up with the jet scenes and we all grew up to robots surrounding us and that's really not happened I mean this is where robots are they're in factories or they clean your floor they cost a few million dollars and kill a few people and that's not what we want from robots robots the way I see it are assistance to our day to day life but building a robotics company is really really hard you need to put together people from so many varied disciplines to get a simple product out and that's often really hard to do so we want to take an approach similar to smartphones think about it 10 years ago you had devices that had a bunch of processing power and were connected to the internet and it was all monolithic you had just a couple of companies like Nokia and Blackberry and it was really hard to build a mobile phone but today someone sitting in Shenzhen can make a mobile phone because they know how to make great hardware someone sitting in a garage can make apps and all the complexities are handled by a single platform and that's allowed us to democratize mobile phones that created the mobile revolution we actually think there is some scope to do this for robots so in May this year in fact in 10 days from now we're launching the first service powered by our cloud which is basically autonomous drones and delivery drones that use the power of the cloud to do complex computation storage and processing and as we progress forward we see this vision in a way that allows us to orthogonalize and create and commoditize robotics if you're a guy who knows how to write a great application using JavaScript and knows nothing about robotics applications algorithms or hardware you can still contribute if you're a person who is an expert in building hardware you can focus and provide drones as a service if you are a person who wants to write crazy algorithms for routing and navigation well you could come on board and create routing navigation and picking algorithms so the idea is to sort of open all of this up to as many people as possible and since this is rootconf going into a deck stack well we are working on our own fork of open shift and communities and we've added a bunch of controllers so each robot is now responsible for its own compute in a bulkheaded design and what this allows us to do is sort of scale in and scale out and cloud computing does not mean providing an API to a bunch of machines consuming compute storage and network orthogonally as required and I think this is a key enablement required for robots to sort of succeed and for us to see them everywhere and yeah that's all I have to say so look out on raputa.org we'll probably open source components of a lot of these things and we hope to push back to the community and if you find this interesting hit me up and have a chat thank you I've got a minute so I am going to play a little video so basically right now we are full stack and our full stack extends to designing our own hardware designing our own chips designing our own devices and also writing front end code writing all of that in one piece so well that's the vision imagine the potential of connecting these agile machines and giving them infinite computation and storage thank you I am sure you are pretty aha here he comes you need to just sign this also if Sri Ram is here maybe you can sign the form ahead of time hey guys so someone I think Aditya already talked about using Grafana functions so it slightly builds on top of that so essentially my problem was we had outages right everyone has outages when we go and look for the post-mortem of the outages yes there was a graph there was alert there was everything everything which was required to tell people that there is going to be an outage was there but still we would come to know about it in the post-mortem yes everything was there but still we missed it so the problem was there was too much noise ok this is alerting that is alerting how do we get around it let's say we want to alert around latency this is how the typical latency graph looks like right when it comes to monitoring do I really care about whether it's one second millisecond 200 millisecond 900 millisecond no what I care about is what was it there in the last minute five minutes ago and what is it now as long as it's same as where it was let's say a few minutes ago I'm all fine beat one minute beat fifteen hundred minutes I don't I do not care so what we did was and then other thing that we had to do is if you want to put monitoring you essentially have to let's say especially in AWS you have to put a cloud watch alert or whatever monitoring tool you have to put you have to put on each and every specific resource that becomes time consuming so what we did was we imported all of the data into Prometheus Prometheus the time sees database I think we spoke about earlier today so out here this is let's say two specific pieces that we started off with for my two regions I have the NLD host graph so out here it's like all zero this is all zero this was a deployment that was happening few minutes back so now this essentially filters out the noise the moment there's anything non-zero I know that I should alert I should get alert on that and I should action upon it this is what I was talking about specifically in terms of latency so this is how we just saw like how the typical latency graph looks like and this is my graph specifically for the alert basis so here if you look at it everything is between zero and one I have set my threshold to three if anything goes above three that's the only time I care about okay there's an issue and it should alert other than that as long as it's all fine yeah it's going up and down basically the delta is going up and down last second it was 200 millisecond right now it's 190 next second is 320 I don't really care about that as long as you know it's within the threshold so this is what we what we do now essentially this is a delta function so if we take a quick look at it drop common label so that removes the rest of the data that in Prometheus that I don't care about delta function for all of my ELB latency alert and here I have removed the other ELB that I don't care about and I'm doing it over five minutes so this is one way of doing it the way it started off this is built on top of Prometheus so let's say this is the same graph in Prometheus right now the alerts are based on top of Grafana Grafana 4 point the latest Grafana basically has you can put alerts over there earlier it was not there when I started off so this is way like you can put alerts either in this or in Grafana now we do it in Grafana and let's talk about noise right so this is all the Grafana alerts that I get now from this system and if you look at the count essentially let's say May 3rd May 4th every day the number of alerts if you see they're in single digits only like when there's any specific outage let's say this was the day I had 12 alerts and that let's say if you talk to any monitoring guy that is like pretty pretty pretty less right otherwise we get like down in the alerts this is how it is we are planning to expand this to further more things right now these are all the four basic alerts that takes care of covering most of my things on a higher level for other things we have more rest of the monitoring still in place this is just to ensure that my downtime is minimum in fact with the help of this especially with the latency alerts we are now able to avoid down times and outages because the moment any of them start spiking I know immediately something is going wrong and now we can act upon it even before the outage starts so since the day that we put it about two weeks back the day we started putting this we actually caught one particular outage in progress while doing an deployment we were able to avoid it so yeah that's how it is what we are trending towards is detecting these anomalies and preventing the outages rather than you know fixing the outages post-mortem so yeah that's what it is thank you thank you very much our next talk is by Sreegam and it's entitled restful email my colleagues actually just put my name on the board so I don't have to speak about it my name is Sreegam I work for endurance and I'm not having a flight as such but you can go ahead and sign up on our platform Bluehost and we're watching a new product called restful.email so basically we give developers the power to send emails via API calls track and get their quota as such and also to mind whether you have successfully the recipient has successfully opened the mail and clicked through rates as such this is launching pretty soon go ahead and explore the tool and I encourage all of you to check out DevCloud and use it and I think Digital Lotion doesn't have this integration so that's like one of our selling points so just go ahead and check it out anything else you guys want to know? I have questions because I think it's just like 50 seconds that's it okay go check it out yeah thanks alright the next talk is nursery rhymes as applies to DevOps by Shakti yeah so the motivation is to use nursery rhymes that everyone knows about to share DevOps experiences and best practices so I've tried to put this together I hope you'll enjoy that please read agreed let's begin Jack and Jill went up to the server to run the test with Docker Jack pushed code and broke his test and Jill never spoke to him thereafter tester tester have you any bugs? yes sir yes sir three core dumps one for my master Scrum one for my lead one for my manager who's a friend indeed little two little three little containers four little five little six little seven little eight little nine little containers oh BSDF had jails forever Humpty Dumpty sacked to debug Humpty Dumpty squashed a bug all the team members and stakeholders gave Humpty a big tight hug Goosey goosey DevOps sir where shall I wander GitHub or Bitbucket this had on something sooner Cloner project folder code to make it better C collaboration working closer I shall be an eye-opener twinkle twinkle unit test how I wonder where you exist I will write unit test until the project is laid to rest code code code your way gently down the screen come it early come it often and life is but a dream see my little hands go hack hack hack and my little test run back to back I just have one word to say to you come learn DevOps and say I'm happy for you one two pick your crew three four shut the door five six write your scripts seven eight test them straight nine ten make them your zen eleven twelve time to sell thirteen fourteen customers are keen fifteen sixteen customers are seen seventeen eighteen defeat your routine nineteen twenty get paid a plenty project issues in the way in the way in the way project issues in the way my fair user fixing bugs right away right away right away fixing bugs right away my fair user merging PRs as I say as I say as I say as I say merging PRs as I say my fair user all the tests are passing hey passing hey passing hey all the tests are passing hey my fair user as a client you should pay you should pay as a client you should pay my fair user DevOps really save the day save the day save the day DevOps really save the day my fair user last but on the least when I say clap your hands give me two claps if you're happy and you know we'll clap your hands if you're happy and you know we'll clap your hands if you're happy and you know it then your face will surely show it if you're happy and you know if you apply software patches clap your hands if you apply security updates clap your hands if you apply software patches and you apply security updates and your application still survives clap your hands if your package install work clap your hands if your gem install work clap your hands if your package install work and your gem install work and your NPM install also work your hands if your CIA tool is running clap your hands if your code is compiling clap your hands if your CIA tool is running and you'll build that always passing then you're happy and smiling clap your hands if you have a plan a clap your hands if you have a plan B clap your hands if you have a plan A and if you have a plan B and you always use plan C, clap your hands. If you notice the containers crash, clap your hands. If you build the containers from cash, clap your hands. If you fix them in a flash and you really put them in a bash and your manager didn't give a trash, clap your hands. Okay, the next talk we have is SSH key management with Python and Jenkins by Mehul. To show on a small little script that I've written, one of the problem points that I usually face is even though we are a small team, managing the switch keys was becoming a problem. As an inventory, the number of servers that we have started increasing, at times people would change the SSH key and we didn't really have booted available all the time. Sharing of the SSH keys was becoming a problem. I tried to look for quite a few scripts around which would allow me to update the SSH keys and manage them. I didn't find anything useful. So we had two requirements. One is the user should be able to upload the SSH key by themselves. And second would be that we work with Backspace Cloud and Google Cloud. And especially Google Cloud has some provisions with API that you can update the SSH key to the API and that makes some of the things very simple. So what I did was I just sat down one evening and wrote a simple Python script using a couple of packages that are already available in Python. One is called SSH Pup Keys. You just give your SSH key to this library and it passes the SSH key and does give you, it separates it out in a way that you can use it. So what I was doing is take the SSH key from the user and first take the user name part and since I tied it with Jenkins, I was able to validate if the question is uploading SSH key for their own name. So what we had was that you, the user name on the server, user name in Jenkins and the email address for the user would always be the same. So whatever SSH key you uploaded would only be yours. So that validation was being done whenever SSH key was being uploaded and we used, I used a library called SSH Authorizer. Basically the library doesn't take the list of hosts that you have and pushes the key that you pass to it to the given host. That gives you, and then at first I thought like, I have the script, how do I give it out to the user? So first I thought about building a small web API for that. But then I realized we already have Jenkins and I didn't want to build authorization layer and that would be a lot of work to do. So earlier we have Jenkins and Jenkins can handle this. All the users in our company have the, the developers have the access to Jenkins. So I just wrote a Jenkins task where user can pass the SSH key that they have. It will be passed by the script and if everything validates correctly, it will be passed on to the server. And if it's in case, the next step that I'm looking to do is to pass it on to Google Cloud using their APIs. So you can basically, the good part about it is while the machine is running, you can replace the API even if you don't have the access to it. You just need the access to the API. So yeah, that would be it. Sorry, it would make a little more sense if I had shown you what I'm doing. Thank you everyone for staying until the end of all the flash talks. It's a very dedication on your part. Very good. So I'll just remind you about the party tonight. The first bus will be leaving, I think it's a bus. They said transportation. In about eight minutes starting from the registration desk. I'm not sure how many people they can take at one time but the, the transpose will be leaving every 20 minutes thereafter. And I think with this we close the first day of root comp. Thank you everyone. And we'll see you tomorrow.