 Welcome to another edition of RCE. Again, this is Brock Palin. You can find us online with all of our old shows at www.rce-cast.com. I also again have Jeff Squires from Cisco Systems and one of the authors of Open MPI. Jeff, thanks again. Sure, Brock. Always good to be here. Like I say almost every episode, this is Get to Learn Fun New Stuff and today we're actually talking to a partner of Cisco. So this is a good stuff here and also got to feel obligated by my corporate overlords to mention my blog out there on blog.cisco.com and I think it's linked to from RCEcast.com as well. Yes, and so our guests today are going to be talking about Zenos, a monitoring system, but we have two guests. We have Simon Yakish who's a principal engineer on the project and then we also have Randall Reinheimer who is at Los Alamos who has adapted Zenos for use in the HPC environment. So guys, welcome to the show and take a moment to introduce yourselves. Hi, this is Randall Reinheimer. I'm the deputy group leader of the HPC support group here at Los Alamos. We have about 60 people and two and a half pedoflops or something like that of HPC on our force and a couple of years ago, we took Zenos as a great starting point for a monitoring solution that we needed and adapted the standard software to do HPC. So that's where I fit in. Yeah, and hello everybody. This is Simon. I work for Zenos and I've been with the company since the very beginning and I'm basically responsible for assisting large customers, tricky installations, features, basically everything that needs somebody to get stuff done, all the way of thinking of me as a Swiss Army knife basically. Okay, well, why don't one of you explain exactly what Zenos is? I think it's kind of came out a little bit, but what's kind of its overview and what's it aimed to do? Yeah, so I'll take that question. So Zenos is basically we call ourselves a cloud monitoring solution. It's a distributed monitoring solution that allows you to dramatically scale and monitor all aspects of your infrastructure or data center or cloud, whichever term you prefer. So what's a little bit of the background is on us? I've seen it pop up a number of places. It seems like you guys have been growing. Where did it come from? It was originally developed in some of the first kind of service, software as a service type of environments that were probably around back in 2003. It was started by a gentleman called Eric Dahl. He basically was kind of using a traditional set of software tools to monitor and manage their infrastructure back then, and he just kind of grew sick of the big, wieldy systems that they had and very expensive systems. So he just started to set out and basically write Zenos. Nowadays, we then turned into an open source project that basically now has spread around the globe. We have installations in all seven continents. We're somewhere in like 120 different countries and have installations right around the 12,000 to 15,000 mark that literally call home every day, basically. So, yeah. So what kind of things does Zenos monitor? So hardware, software, infrastructure, what can you watch? Basically, Zenos really is the idea of Zenos is to provide you a unified management solution. So it really has the ability to watch essentially everything that you can imagine. We watch everything from the very hardware details of your blight and universal power supplies to the transactions that your Apache and your web server are basically, or your queuing server or your middleware server are basically performing. So really the entire stack, the idea is that we basically provide you a holistic view that allows you to look at everything anywhere inside your data center. Now, it is my understanding too that you're also a partner with Cisco. I haven't been involved personally in your work, but I see that you can also directly monitor Cisco's line of servers, the whole UCS line. So could you tell us a little bit about that? Yeah, sure. So about one and a half years ago, Cisco was kind of on the verge of really pushing and starting to push Cisco UCS. We've already had support for VMware and we kind of found UCS to be really doing essentially the exact same thing for hardware that VMware did for the OS. So really abstracting away the hardware and since we've had support for VMware, the logical conclusion was really to then take the next step and essentially add support for the perfect software. So the perfect VMware hardware combination essentially. And so we basically added in, because Cisco was basically smart enough to add a very good API from the very beginning, we basically tied into that API and we can now basically give you an entire view of your UCS stack, meaning we discover all your blades, all your chassis, all your service profiles, and then we can actually tell you which VMware host of which hardware host or which operating system is running inside your host. So we can really give you a holistic view of business to blade essentially. So from the very chassis up to what's running, which application is actually depending on that chassis. So what's the architecture of Xenos look like when if I get it and I set it up on a box, where do I go from there and what do I see? So set up is really easy. We really try to cater to, obviously there's a lot of open source tools available in the market in that space. What we've really found is that a lot of them are wieldy and difficult to deal with. What we try to do is make the entry as easy as possible. So really I always say I want my grandma to be able to use the software. And really we make it very easy. You can download a VMware image, you can download an ISO image, you can download RPMs. You can essentially fire up the VMware image on your desktop. It starts up the application and provides a web interface. When the web interface is provided to you, you have the ability to add devices. It simply asks you for the type of device, be it a Linux server, be it a UCS chassis or manager, be it a VMware virtual center instance, be it a network device. And then it asks you for credentials and the protocol to use to actually discover and monitor the device. You can get a system going within a few minutes essentially. And from that point you can then just grow. There's obviously methods of auto discovering your infrastructure, methods of automatically loading spreadsheets of your infrastructure into the system. And then once you get the system started, we really want everything to work automatically. So after you giving us the details of your devices and kind of entering the specifics of them, we basically automatically discover and what we call model your environment and create a kind of a view. So we can actually see the interfaces and file systems. And we can process the events coming from the various logs and so on and so forth. So this is maybe where HPC and Lantel comes in because our architecture as we use Zenos is different from Zenos as designed. We stumbled across Zenos really a couple of years ago looking at some of these unwieldy and expensive solutions that Simon referred to in about 2007. Trying to find a basis on which we could build an HPC monitoring system. And the thing that we liked about Zenos were the community model, the initial architecture itself, the model of the system that sits sort of in the middle of Zenos. But we also recognize that there were differences between a data center, say a web farm kind of place and HPC. And so one of the architectural differences, one of the architectural modifications that we made is we made a hierarchical system. So where Zenos as designed and as deployed is sort of a single flat thing. Zenos HPC, which is Simon can talk about this, but it's our contribution to the Zenos open source, has actually a hierarchical structure where you can pass information up and down a set of Zenos instances. Yeah, so actually Randall points out a nice fact here is that while the system is easy to use initially and kind of covers what 80% of the population wants to do, we really are focused at providing you, if you want to have a framework, you have a complete framework to basically manipulate and change the behavior of the system at a very fundamental level and basically Los Alamos, Randall and his team have taken great advantage of that. So Randall, I wonder if you could give us a few other details. A lot of the listeners here are going to be quite familiar with HPC types of setups. Did you change Zenos to say, you know, watch the resource manager and look at utilization and memory and, you know, HPC care kinds of things, look and make sure if the end of that. Well, our major interest was not in trying necessarily trying to cover all of the individual elements in a HPC environment, that's almost the easy part, you know, and no goes down who cares, right? But the interactions and combined failures between elements across nodes in a cluster through the interconnect out to external network tied together with resource manager operation, all that kind of stuff as a whole picture was what we were after, not so much individual failures, but what do things look like if you could look at them all together, which is where we started with Zenos. So it sort of went out and looked at everything and the differences again were very tightly coupled. So combined failures are a big deal and we're also very hierarchical. So hierarchical architecture suits HPC a little bit better. So when you're saying that you have a hierarchical things like if you have a core switch out, you're not going to get a notification for every single leaf switch and every single node going out that sits behind that. You just know that core switch is out and the system knows to topology. Well, that's true. Zenos actually takes care of that more by itself. What I'm talking about is you have a set of Zenos instances at sort of the lower layer that you can really dive into the details of errors that happen on nodes on a particular cluster and the interconnect on a particular cluster. Then you can pass up some of that information and interject information about say the DRM server and the job journey across that particular cluster. And then you can pass it up yet another level and sort of have an operation staff just looking at one big set of green, yellow, red, you know, kind of traffic light boxes and be able to move up and down that hierarchy because if you have a failure at well, just different information is relevant at different levels. The interconnect information is a separate set of information from the networking that connects the clusters until you want to be able to not have to look at everything in one place, but just look at the relevant pieces together passed up and down. So at the site I work at, we use Torq as our resource manager and we use Torq's health check script to set nodes offline when error counters on network interface gets above a certain amount, a machine has been ummed and a process has been killed, certain things like that to kind of notify us of issues. Why go a completely separate infrastructure with Zenos rather than kind of using some built-in functionality in your resource manager? Well, because that only works for the resource manager. We were looking for an overarching tool. We actually had a very interesting session. We had a birds of the feather session at Supercomputing 10 down in New Orleans that was much more well attended than we expected. And one of the interesting discussions there was, well, should the configuration management define the system, or should the monitoring system enforce a configuration? Right now we have those things separate. We use CF engine for one, we use Zenos for the other. I think it's a really interesting question about whether you could bring that concept full circle to either have configuration management that does monitoring or monitoring that does configuration management sort of all together. I don't think right now there are things that attempt to do that but not in a fully featured kind of way. And again, just to answer the original question, I think everybody as I've talked to and I've talked to a lot of people in HPC, they have a lot of good tools but they have a lot of good individual tools. They have tools that work at their specific site. Almost everybody has a script like you're talking about that looks at node health and offline nodes. But there's no cohesiveness to that. And one of our hopes at least is that if we could form a community around HPC monitoring, maybe we could get some best practices going in those kind of areas. Also I think one thing that you might want to consider here as well is that while torque is great to actually run your jobs, it's really not necessarily designed to also do your monitoring. So there's concepts that probably completely pass you by with torque that come to light. So for example, Zenos is a great system to capture all your events, be it syslogs, be it SNMP traps across your infrastructure and then kind of overlay that with what the impact on your very subsystems are and that kind of thing. That's something very difficult to achieve with a resource manager or a job scheduler. So this is a perfect tie in here. Can you give us the laundry list of inputs that you can accept or systems that you can monitor to include software systems and we already talked about Cisco, UCS, but what other hardware platforms do you support as well? I think you probably don't have the time at this podcast to actually cover all of the devices and hardware and protocols that Zenos as a whole covers. The system is very extensible. We have plugins that we call Zenpacks that are extremely powerful. They can do anything from add new processes that perform a totally new type of collection that bring in Cisco MDS events or the like or accept MDS events from your security infrastructure to small scripts that essentially run a check against some sort of device and basically return a number. So we really have about 150 to 200 Zenpacks really that are either internal to us or contributed through the community that we have. Just a high level view. We basically support SNMP. We support SSH. We support arbitrary commands. We support this reception of UCS events. We receive VMware events, SNMP traps, syslogs, Windows event logs. We support Windows performance collection, RPC, SOAP, REST. All of those types of data sources are essentially something that we can tie in with and do various and fetch information from. So these plugins that you mentioned, what language does one have to write that in? The plugins are mostly, it really depends on what level of integration you're looking for. A very easy plugin that essentially is nothing but a command and a predefined what we call template is basically something that you on one side, you can click, use our web UI to actually click and assemble that and then export the Zenpack and then you could basically have a small command that is written in anything, be it Java, be it Perl, be it Python, be it Bash, that then basically performs something on the host against the device or locally and brings in that data. So that's kind of a very simplistic plugin. More complicated plugins are generally written in Python. They basically integrate with our application server more closely and basically bring in model data, meaning you actually go out, you query something, you discover the structure and the relationships between things and then you bring it in. So for example, for UCS where you go out, you discover the blades, you discover the chassis, you discover the service profiles. You actually write something we call a modeling plugin that goes out and brings in the modeling relational data into our database and into our web view essentially and those are written in Python. So how many hosts or devices or switches is a ZenOS designed to support? Well, as a cloud solution, that is a very, it needs obviously to be a very flexible number. We have customers and community users that use ZenOS to monitor two servers in the basement. And obviously we also are deployed in some of the largest cloud providers or service providers or data centers in the world. So for example, Rackspace is a custom of ours and all of their global network operations uses ZenOS to monitor all aspects of their network. That's basically about 30,000 network devices. Accenture is another custom of ours where they basically monitor all internal IT infrastructure. We have obviously, you know, various projects going on at Cisco. So really, and then obviously Los Alamos also has a large number of nodes that in one way or another we tie in with them. So really the system is designed to essentially, it's a very risky thing to say, but essentially to scale infinitely because it has an ability to be distributed, right? So the distributed element or the distributed architecture allows us to essentially scale horizontally. As your infrastructure grows, you keep adding more and more hosts to our system, allowing you to scale the system larger. So a lot of monitoring systems literally do just that. They just monitor and notify you by page or email. Can you actually have ZenOS take an action? Can you actually have like almost like a flow chart built in, like do this if you see this, do this if you see this? Yep, we actually have, as you point out, we actually have an event management system built in. So things such as Netcool or HP OpenView, those types of things have been replaced very effectively by our customers. As such, we can not only watch, we also notify you. Anything inside the system almost triggers an event and when we can act upon those events as well. So we could basically, we could try to restart your Windows service. We can bounce a process or something along those lines. So we have something that actually can perform actions based on the events that are coming in inside the system. We actually take advantage of that here at Los Alamos. We have, you know, several thousand nodes probably in a few dozen network devices plus interconnects that we're monitoring our classified networks. And one of the things that we do is do some automated event, off-lining nodes, on-lining nodes, that sort of thing. And we've also sort of taken a little bit of the possibility of error out of commonly used, commonly repeated tasks by our operations staff. So they used to have a process in a notebook that, you know, do this, SSH to this node, all of these things. And now with Zenos, it's push this button and away you go. So it's all scripted. So we took a lot of the uncertainty out of how those processes got executed away by using the software. So you mentioned earlier a couple of times in the conversation so far that Zenos is open source. How does your community work? Who's involved and how does the contribution model work and who makes the decisions and things like that? Yes. So Zenos basically, you know, we have a series of users that obviously, you know, roughly around, I think the community account stands anywhere between 20,000 and 30,000, something like that. With 10,000 or 15,000 deployment, you know, the numbers might be off, but deployments that call in daily. And obviously users are involved to varying degrees. We have a core set of users that would be difficult to put in, you know, in the dozens basically, you know, that contribute really effectively to the project. They contribute ZENPACs. We have documentation out there on how to develop and basically contribute ZENPACs to us. We have a community manager that actually takes the input and kind of grooms the ZENPACs, makes sure that they're okay, publishes the screenshots and instructions and make sure that they are upgraded and kind of as we go along. The big decision on the infrastructure and the core of the project, so to say, are usually made by us. We take communities and we take community input obviously on best practices and how the product is used, but the kind of the pipes or the piping is generally handled by us essentially. The ZENOS HPC, for instance, is separate. I don't expect that it's going to take much into the core because what's right for HPC isn't necessarily right for the bulk of the customers and yet we're able to leverage the core software. So we have sort of our own, you know, community area off to the side. The vast majority of the software is the core software and then you just have, you know, a set of ZENPACs. And it's probably worth mentioning what else we added. I mentioned the hierarchy. We also have a history of the model. So there's a model of what you're monitoring in ZENOS, at least the last I knew that just changes as your infrastructure changes. For us, we need to go back maybe and see what did it look like last week when everything went to pieces. So we have that history stored. We also have a roll up of events so that if you have a whole bunch of nodes go down at once, but you can tie that back to one single event. We have a mechanism to, you know, tie all those children events to parent events. We do issue tracking that we report out on. We do asset tracking so you can follow a node around a cluster if it gets fixed and put back in somewhere else and exhibits the same problem. You can see that in the monitoring system. And I don't think any of those are in the core. I think those are all in HPC ZENPACs, but Simon can correct me if I'm wrong about that. Yeah, so you're actually correct here. Basically, we kind of allow, you know, that's why ZENOS, the community version is also called core. It basically allows us to abstract away all the functionality that we keep adding to the product and what community members might add to the product. And then basically, you know, people like Randall and so on and so forth can add ZENPACs that can then be optionally installed by users. So we really, you know, we have a forum that's where the ZENPACs are kept. Community members give feedback and rate on ZENPACs. You know, that's how kind of the work in the community interacts with each other as well. So what's coming for the future for ZENOS? New features, new versions? Yeah, certainly. We're always working on bigger and better things. Really, the next level for us is to take it up a notch, right? So we've so far managed systems and environments that have tens of thousands, but we really want to get to a level where we literally manage hundreds or millions of nodes, essentially. So we've done a lot or impressive work on refining some of the elements that we found to be of problematic scale. So the event system has really been reworked in our next release. It's called Avalon. And so it will really bring to the table a system that can scale at cloud, right? So we have the ability to process, you know, literally thousands of events per second. We have the ability to keep, you know, model information that is very rich and literally includes, you know, billions of nodes and objects. So that's really the next level. And we want to provide a view that is very much more intelligent of what we have at the moment. So at the moment we're very good at presenting views, presenting data, so to say. But really where we're lacking is we're lacking at taking the data and turning that data into information that is actionable. And that's really where we basically want to go next. So we want to provide a layer which we call impact, basically that allows us to have an ability to turn the data that we have into basically information. It's probably already too much, I said there anyway. So I'm going to stop now, otherwise the higher ups are probably going to get upset with me. So then let me ask you, what is the difference between what is developed in the community as the community product and what you guys put out as your enterprise product? Yeah, so basically what the enterprise customers get when they decide to, you know, subscribe with us. Obviously, obviously, you know, one of the elements is obviously support. We also go and actually take the series of Zenpaks that we have that are very, you know, enterprise focused essentially, right? So VMware Virtual Center, Cisco UCS. Those are technologies and things that you generally not interested so much if you have a small environment, but only when you're an enterprise customer. So those are kind of the value adds and packs that we put on top of the core product that really make this make our product and are valuable to a large enterprise. Also, we kind of bring the system up to scale so we have the ability to, you know, we have various tuning capabilities and so on and so forth that we basically put in place as well. So really the system performs at enterprise scale with higher reliability and so on and so forth. So what's the contact information, the point of entry for somebody who wants to get involved with Zenos or Zenos HPC? So for, well, for Zenos, I can answer for Simon, there's a Zenos.com, but for the HPC in the community area of Zenos, there's a specific HPC area, which actually is going to get updated when the newest version of Zenos comes out. We're going to do a fairly major revision at that point. And I think we can refer to that link someplace on your page. Yep, yep, we'll make sure any links you guys have for us, we'll make sure is in the show notes on our website. Obviously, I'll have a bunch of stuff for the show notes as well. But as Randall said, it's easiest way is to hit www.zenos.com. There's a lot of video content. There's a link into the community pages. You can download the product if you want, get started with the VMware edition. We have a quick start guide that really is no more than five pages and you basically have your first server in there as well. So very easy to get started and lots of information out there. Okay guys, well, thank you very much for your time. This show will be up soon and we'll get those links from you and put it on the show notes. Alright, thanks everybody. Thank you.