 Good afternoon folks. My name is Shailesh and I work as a principle test engineer at blue jeans network, which is a cloud based Enterprise video communications company and though the names are just so we don't sell jeans It's just video I'd like to share my experience and my team's experience of building an automated resiliency testing tool which we name goblin and And so last night I was having a talk with my dad About the conference and he was flipping through the pages of these speaker bios and as he came to my page He read it for a minute and then he went so I'm driving my car and it runs out of petrol But then the reserve kicks in so looks like you're working on the reserve kicking in part So I was like yeah, that's what I'm working on So I really liked how he simplified the problem there So a bit about blue jeans Up to date we've done about 1.4 billion minutes of video conferencing What that means is for the past five plus years for every minute of the day We've done 500 minutes of video across the globe. So quite naturally uptime is our number one priority and Our users make use of the blue jeans service across using their mobile phones their iPads their tablets desktops existing room systems and We want to make sure that every time they have a video conference They have the best experience and thus resiliency of our system is really important to us So when we set out to measure the resiliency of our system, we wanted to do two things We use a bunch of third-party components in our system like zookeeper Cassandra, etc And we wanted to get a deep understanding of what happens to our system when any of these components fail the second thing was we wanted to measure the resilience of our own in-house software components and We went out looking for tools and we came across things like chaos monkey, which is very impressive But what we realized is that the set of challenges and use cases that we were trying to solve there was no real solution for that and What were those challenges? So as we decided to look at each part of our system and try to measure the resiliency We put together a set of test cases and failure scenarios for the interactions between all the components And as you can imagine as the system grew and became more complex these manual test cases and just the number of test cases became humongous and Manual testing would no longer cut it for us. So we needed some automated solution The second point that we wanted to be aware of is that we wanted to introduce controlled failures and Why that was important for us at the time is because our system wasn't really mature from a resiliency perspective So if you went to went and created random chaos, then it would have been very difficult to go debug those failures So what we wanted is what I call orderly chaos where we where you know what you're failing and how and When a side effect happens due to that failure, you can go debug that fix it Get your system to a good state and then maybe do random failures We wanted something fast. So We wanted a tool that would basically generate some sort of load on the system in our case We wanted to start a few meetings with a few participants simulated in it Run some failure scenarios validated the result record the result and we wanted this to happen within minutes The other thing that we do is since video conferencing is all about the audio and visual experience to the end user one thing that our company does is we get a bunch of people together and We run some tests with people, you know live testing And what our team would do is at 3 p.m. In the afternoon just like right now when people are dosing off we would go and ask these people to join a test and Then we would go fail a component and then record results, etc But soon our team started becoming like the most hated team because we were disturbing people in the afternoon So we thought we'll pass on the dirty work to a chat bot sort of a thing So the chat bot would bring these people and they would join a meeting We would run our tests and then and we wanted this to get integrated into a tool So that is the other requirement that we had we wanted to make the testing live and bit more fun So that is how goblin was born and Essentially what goblin does is it takes a bunch of components like Linux processes These could be third-party processes or standard Linux processes that you use or anything that you build on your own And you have a bunch of components like zookeeper rabbit couch base my sequel Cassandra and goblin comes with Support for these and you can extend it to any component. So given a list of components You can apply a list of failures to any of these so you can go kill processes. You can start restart Services you can exhaust disk memory CPU or connections on the node you can introduce latency between any two nodes or even cost packet loss and When you have n by n combinations you get a pretty Long list of tests and effective tests that you can do to test the resilience of your system So this is how goblin looks like in action. So you have goblin on your local machine and it has a configuration file that You have a configuration file that describes the system that you have and Then assuming that you've got tests written it goes and kills stuff in in the network lab and then it records your results and publishes them There are two essential modes in which goblin runs one is called a nightly a nightly run mode and That's fine. Thank you So what's the nightly run mode? Think of it as an overnight Regression suit that's running. It's just running resiliency tests for you So goblin gives you the ability to prepare the environment the way you could the way you would like to and then it goes and Causes some failure for example I want to go stop my zookeeper service and then it allows you to hook in your own methods so that you can validate the result In our case the validation would be did any participants drop from the meeting did any participants lose any video? Etc. Then it recovers the system from failure and then reports the results in a J unit style So here the picture shows a Jenkins job, which is finished and you've got reports on the failed tests and the past tests The other mode is the automated test bash the we are heavy users of hip chat which is a well-known chat application like slack and What we do is we integrated a goblin to be to be a hip chat bot as well and it goes and pings a few participants to join a meeting and it communicates the test Steps that we are going to do so that everyone's on the same page and knows the expectations of out of the test Then it goes runs the test kills something Ask for feedback from the users on the same hip chat room So you have a documented result and then it recovers the system and goes and runs other tests So chaos monkey lovers don't kill me for putting this light up but I really wanted to focus on the different use cases that these two tools solve and to be frank chaos monkey was the inspiration that we had when we started with goblin and we learned a lot from it So chaos monkeys designed to run in AWS And to generate random chaos because nets Netflix believed that It's okay for things to fail and you want to build a system that can auto recover It runs on non-holiday work days so that engineers are around to fix the problem. It's not a test framework It's a collection of really good tools Whereas goblin was designed to be a test framework from the get-go and we and since we don't run on AWS We've got our own data center. We wanted the ability for it to run on any environment We wanted to use it for regression testing I want to know how my resilience has changed build over build or release over release and The live Testing is also something that I think a lot of you might like and it was designed to be a test framework So let me just do a sneak peek of some of the Ruby code that goblin has written in here's a Cassandra Library that's written for goblin and you can see that there's a bunch of methods That allow you to stop a Cassandra service or restart the Cassandra cluster itself typically you would import this library into your own test class and then call these methods and Typical test class would look like this where you would have to override a bunch of mandatory methods For example take simulate failure here I want to get all my zookeeper nodes and then run an exhibitor process restart on all of those nodes I can also hook in my Validate method and a recover method the way I I seem suitable So this is where you can get goblin it's up on github at github.com slash blue jeans slash goblin and We are definitely looking for contributors, please try and use it give us feedback on what you thought about it and Also, please check us out at blue jeans.com on the home page there is a try button and That is going to lead you to this page and when you click on start now you will get into an instant meeting and you can share that URL with a friend or a colleague and have a video conference right there and I hope you enjoy that experience so thanks for having me here and You guys might be dying to get out, but if there is any questions, then I'd love to answer them Thank you Is your goblin configurable on any Environment or is it basically customized for your environment? We've built it so that it can be custom it can be applicable to any environment All you need is a configuration file with a bunch of nodes That you've written like if you have say a zookeeper cluster you put those IPs in there and mark it as like a zookeeper Category and you can have several such categories as long as your test machine where goblin is running can access the cluster That you want to run these tests on goblin should just work and my other question was if you run some in like In introducing the latency. Yeah, and if a person needs to correct those So here do we have to manually go and solve the issue or? So are you saying how do we undo the latency? That's built-in as well. So there's an apply latency and a remove latency method That you can use it basically uses IP tables under the hoods Any other questions? All right. Thank you guys