 I will have impressed. So let's begin with Agenda. I will talk a little bit about the history. What makes Munich very unique? We'll see that it makes it also really unique, but in not a very good way. What makes Munich unique in 2.0? It's the version that got in Weezy, and it's very interesting. We will also see how our calling to the new features of 2.0. You can scale much more the Munich install from the 1.4 package. We'll see also the limitation of 2.0. Since now, you can scale quite well. Directly, you can scale really well. Practically, well, we'll see. We still have some big, big issues. Very different from the one in 1.4, but still. I will present rapidly the roadmap for 2.2. But hopefully, it's released this year. Well, that's a challenge. If you have any questions, I will stop at 10. And we will have 15 minutes for questions just after. So brief history. Munich was born in 2002. It was named LRD. I didn't know that fact before. I just know it because I researched for the presentation. And it's not well-known fact, but some code, most code, still dates from that day. So it's quite important to see the issues when changing code. It's more like geology. You have every layers. You want to add functionality, one layer, new functionality, one layer, but what? You own no base. So I hacked Zooming for 1.2 in 2007. I mean, 1.2 was very static. Well, I maintained it in my own private place. And in 2009, 1.4 came out. And I asked if I could send my patch to Munich. And, well, they got accepted. And from 2009 until 2011, so I was slowly gaining ground in the Munich community until now where, well, I just took over the leadership from the previous team. And it didn't happen officially, but it's just the way it is. And so in 2002, I released 2.0, thanks to Halder, who said, hey, you have to release now. Otherwise, I mean, you will release in 10 years. So thanks to him. So Fingers was very, very hectic at the early days of 2.0 because I realized that the biggest point was since it wasn't released, we didn't have many testers. And since we didn't have many testers, I didn't want to release it since I still have some bugs that came out and so on. So thanks to Halder, we broke the cycle. And we released in 2012. It's interesting since it's 10 years since it's born. Someone said every software gets good after 10 years, well, maybe. And it's in WYSI since September 2012. And it's in stable since WYSI got out. So in 2013, I released 2.1. It's an unstable branch because I didn't want to have the same problems as with 2.0. It's lack of testers. So I just packaged something, packaged the development branch and released it. It's unstable. Normally, it works, but well, it's unstable. You know what unstable means. And the biggest thing is the internals will change in the 2.1 lifetime. So I said October 2013 is target for release 2.2, but time will tell. If you don't fix deadlines, you will never release things. Better be late than never. So the very simple design principle of MUNEN is I really love this quote from Alan Kay. It's simple things should be simple and complex things should be possible. That's exactly the motto of MUNEN. MUNEN makes simple things simple and complex things possible. It's very easy to use. It has a sane out-of-box behavior because when you install it on a server, it automatically starts monitoring. And if it's not, it's a bag. Please report it. And it has a complete plug-and-play infrastructure compared to others. I mean, the only thing you need is to declare the node because well, broadcasting on a local network is not very practical, in my point of view. So it's the only thing you need to say. You have to pull this node, and all the node will just send all the config to the master. So when graph are drawing. Thing is, our user, the vast majority of our user just have one server to monitor, and it's the same the MUNEN install is on it. And that's why the default MUNEN are always targeted at these users. Since if you have a bigger install, well, then you already know how to change config files, usually. And as I say, some are running bigger install, and these are the ones that interest me very much in 2.2, since, well, we do address really well the one node install type. But for bigger install, we have very much problems. We improved very much from 1.4 to 2.0. But now we hit other limits that we will discuss about just after. OK, so new features. We really have now a full CGI implementation. I mean, the one in 1.5, you should not use it. It works sometimes, and it's bugged every time. So it has also a full fast CGI implementation. This is very important to have adequate performance, so you don't reload everything. The biggest selling point is it has complete integration with RRKRT. We will talk more about that later. But this is the main issue when scaling, because RRT is very nice, but doesn't scale very well, and it's not without RRKRT. And the thing is, when you use RRKRT, there are some guidelines that I will describe later. And you should not do what you want. It has native SH transport. Well, it's before you use a plain TCP for a connector for 949, and you could use TLS, but most people didn't. And with native SH, usually people also already use SSH on their installs. So setting SSH transport for them, it's quite easy. Whereas having a TLS thing is you have to have a certificate and so on, so on. It's quite much complicated. And it avoids Open Newport, as I said. And it's secure. It's usually more integrated than in setups. The other very big feature is Async Proxy. It's something that sits on the node, that pulls the node autonomously and stores locally on the node. And the Munin update will then connect to Async, the client part, and just replay the spool that he spooled just before. So it has very interesting features. If you have some nodes that have loose connection, for example, you want to monitor a remote location that has sometimes a new network or whatever. Since it's locally spooling, when you connect, you will gather everything that was collected, meanwhile you didn't connect. So those little white bars you are accustomed to are gone. It also speedups pull, even for a local network. Since it delegates all the pulling and the waiting for plugins to the Async Proxy, that collection from the Munin update goes really faster. It only replays a log text file. So when you have a big cluster, it sometimes makes sense to use Async since, well, the fixed five minutes for a Munin update is still a hard one, and you cannot go further. And one less known thing about Async Proxy is it can pull at various update rates. If you have one plugin that says, I want to be pulled by every one hour, Async will only pull it one hour. And the most interesting part is, if you have a plugin that says, I want to be pulled every 10 seconds, Async will pull it 10 seconds and still sends every five minutes all the data back to the Munin update. So you won't have real time information, but you have very precise information. So now we go to scalability. That's the biggest focus on 2.0, because the first one was the zooming part. And zooming just showed that, well, you can have huge data file, since it's not very useful to zoom on one year history, if you don't keep the finer granularity in R&D one year back. So that will be scaling the data at the end. Really what people want is adding more nodes. That's the most common scaling issue that we can. Inside the node, you can also have a huge number of plugins. Some have very, very large installation, especially when you start to use SNMP, because SNMP is done by one host to monitor many, many remote routers or SNMP agent. And the thing is, some also have slow plugins. We already discussed about the Munin update should take less than five minutes. Otherwise, well, bad thing happens. That's one point. And that's a hard rule. If Munin updates take more than five minutes, the really bad thing happens, mostly white bars. And so if you have many, many plugins, and many, many plugins take quite a long time to pull, since it's all synchronous. The fact is, well, even if you parallelize very much, sometimes it still goes quite slowly. If you multiply the number of plugins with long response time with many nodes, you usually pass the five-minute bar. And is scaling the data. Usually, that's with a zooming part. Usually, many, many people ask for, well, I can zoom one year ago, but I only have one bar per day. I don't care about the average for one day. So here, you can natively have much more data inside. We'll see more on that later. So scaling the master. To have a big install, the first thing is use fast CGI. When default is Chrome-based, remember, default is for the typical user that have only one node and one server. Anyone that has more than, let's say, five nodes should really go the CGI road, and not really CGI, but fast CGI, because the Chrome road is you generate every kind of graphic, and it's just pointless. I mean, it's very simple, but it's pointless. As I said, you have to use RHD, because the thing is, RD is very, very nice. It's a very nice piece of software, but it has only one main problem. It's so efficient that it writes only the very little part of the file, and to the underlying IOSub system, already updates when you have a big one. It feels just like random I.O. I mean, when I say random I.O., it's real random I.O. I mean, almost cryptographically secure. I mean, when I ask about some storage vendor, he said, OK, we're on the mail. Well, we can do that. I plugged him in with a big install on it. I said, what's that? Random I.O., not that random, usually. And the people already are well aware of it, and they even designed RHD that is especially designed to make this random I.O. buffered and to make it like normal random I.O. And it's called, there is a slide. You can Google it. It's RHD escape I.O. hell. It's really, really well described to understand what's behind RHD. And it even works on SSD, because usually, we're on the mail, OK, no problem. Just use SSD. But storage vendors said, no problem. We just put on SSD. The thing is, after my test, after four hours with big install, all the SSD were just offline because too many I.Os. Because it writes, writes, writes, writes a lot. So SSD is interesting, but not only for us. Thing is, the RHD has only one very big drawback, is you should never, ever read from the RHD file, especially in Chrome. Because if you read on demand, it's perfect. It only flush the file you're reading. And if you read it in Chrome, by default, you will read the whole install. And that's exactly the same as not using RHD. So it's useless. Thing is, for a minute, you need lots of RAM. Because as I said, we have RHD. But the more RAM you put at RHD, the longer you can keep the spool. And so the thing is, it can, it writes very less often. And that's very interesting. If you have lots of RAM, you can multiply the number of workers. It means, in obvious, if you have, since Munich is very, very much I.O. bound, either so for waiting for an hour or waiting for the I.O. sub-system. If you have many workers, usually it helps a lot. Because every worker is a single threaded. Thing is, but do never, ever swap. That's obvious. But thing is, Munich is designed to use all the memory of its workers. So if you only swap a little thing, then there is no, how to say, there is no lost memory. You cannot swap for people who know of a swappiness setting. It means swapping before trading some application memory to file cache memory. That's not a good idea in Munich, since all the application memory is useful at one point. Yeah, oops, that's OK. On the master, you have really to watch out for shared hour. Because Munich is very nice. And it loves to annihilate any hardware you put it on it. Because, well, it's designed to be very scalable. And you can launch as many processes as you want. We'll see some limitation just after, but it's designed to be very scalable. But the contract is not in very efficient manner. It's not very clever. It just use and goes on your system. So for the record, I had the storage vendor that was mutilized with all the application of the thing. And when we wrote to it, I mean, 99% of the IOPS were delivered to the Minion server. So let's imagine what stays for the others. Not much. So we put it on dedicated hour. It goes slower. But, well, other application are happier. The thing I said before is use the Async proxy, even if you don't have a special need on it. Thing is, it will enable a very fast collection as all the IO time, all the wait time is absorbed directly by the Async daemon. Your Munion update almost doesn't wait at all. It only connects, reads a file on the server, and disconnects. So for usually typical polling time is about 10 to 15 seconds. With Munion Async, typical time is about one second to mostly two seconds, depends on. But you have a 10 factor, and that's very interesting when you want to scale. Because it lowers the number of update workers needed. As said, Munion uses lots of RAM. But usually you don't want to use RAM for Munion update. You prefer to use RAM for the restitution part, for the graphs, for the HTML, which we will speak about later. So Munion update, you would just want it to be very quick. And so if you don't have the, if it's not IO bound from network anymore, it's only CPU bound. And you don't want to have more than the CPU on your hardware, since otherwise it's useless anyway. And the thing that is a side effect, but it's very nice, it's if your Munion update is very slow, happens, and we speak about the five minutes hard limit, all the Async-enabled nodes will not have any data loss. You will have delays in integrating the data, but you won't have these infamous white bars that most of you already experienced at least once. But it was for the node. As I said, you have some huge number of plugins. I mean, the biggest install I saw is about 1,000 plugins. Wow. It's very interesting also as Async, because it has the fork option. And Async knows it will just, prior to Async, the Munion update was doing it very sequentially. And one other plugin, well, to have it in less than five minutes has to be quite fast, since it's not the only node that is a pole. In Async, with the fork option, each plugin will be asked in its own process. So if you have long running plugins as just after, you also can use the fork option. Before the plugin can, usually, they pull themselves either in Chrome and they just read the status back. I mean, that was the official way of doing it in 1.4. But since Async does exactly that in 2.0, just use Async. I mean, it's standard, and it just make use of whatever you use. But for the node, usually, the only problem the node has when you have many, many plugins is the starting of the node is typically serialized. But when you have 1,000 plugins, it's a big problem. OK. So now we're scaling the data. As I said before, zooming brought the need of having precise data very far away in time. And to keep more data in early, it's very, very easy. In 2.0, you have a new option. It's graph data size. You already had it in 1.4, but it was global. Now it's per plugin. It's also global, but you can't precise it per plugin. And actually, it's designed to be per field, but it doesn't work. It's bugged, and it mostly works per plugin. That works well. But it only works on early create. So there is an external tool to move it. You wrote a tool that is called early copy to move from some data from a small early to a bigger early. But that's not part of Coremune. And when you created it, everything is handled automatically by early. And it's very fast. And early, as I said, is very, very, very efficient. But beware, it can use very much space. I mean, I had one person who wanted to have a 10-second precision for two years. Wow. And it's about 500 megabytes per hour day. And so per line in MUNIN, big data. You can also increase early precision. It's called super sampling. That works without MUNIN async. If you put MUNIN async, it will do the job for you. I will just go a little faster, since my time is almost up. And fingers, OK. Yeah, a bigger thing is if you modify the RRD size, always have the RRA increased. Because when you want to have the graphs, if you take huge, for example, I mean, there is a setting that is huge, this setting is very nice. Because it only has the maximum precision for two years. But it doesn't have any RRA. And RRA are part of MUNIN's ability to reply very fast on a yearly graph, for example. So it's pre-consolidation for yearly values. So the ideal way is you know the size of the graph in your templates. And if you have one RRA per pixel in the graph outputted, it goes the fastest, since RRA doesn't even have to interpolate the data. So now the limitation of 2.0. The CGA of HTML is very, very, very ugly. I don't know if many, many of you tried with big install. But the practical limit is about between 150 and 200 nodes. After that, it's very, very slow. And it's slow on reload. Because the whole configuration is stored in a big storage installable file that is reload. And most of the time, it is took by storable.reload. So I can't do much about it. We'll see how I plan to do it. And the UI itself, it's not very scalable. You all know the default UI. So now you have your cluster. Just imagine 1,000 nodes inside. Well, it's a little bit flat and not very, all the nodes are essentially on the overview. It's very static. And it's not what the one does expect in 2013. Because, well, we all have this web app and it's fine shining with very dynamic stuff. And ours is not very dynamic, I agree. Same as if you know the comparison page. I mean, if every node of a group and every graph, just imagine on 1,000 node, 1,000 plugin, your fireworks won't have any memory anymore. And the last thing is it lacks proper RCL. For bigger than your style, usually you want to delegate monitoring to subsystem. And you don't want people to see everything because they will be overwhelmed by information. And well, that's a problem. So I'll just go very fast. That's my last slide. So for 2.2, it will be integrated in 2.1. And when it's stable, it will become 2.2. It's moved from the whole story building to SQL-based. And the SQL-based will be DBI-based because we are still in PAL. And it will be SQL-light by default because we really want the nice out-of-box install. And remember, our users are very of the one node type. And if you want, you can do Prozgr SQL. And if you want, you can do whatever DBI supports. It's just up to you. It will enable dynamic HTML because, well, we're not in 2001 anymore. But that will recreate a deep rewrite of the code. As I said before, you have many, many accessors to the storyboard inside the core. But since it was a big storyboard, it was native Perl data structure. So for whatever reason, many code doesn't use the accessors. And they use it in a typical Perl way that makes it very difficult to translate to SQL. So that's the challenge. And just to be completely crystal clear, the data will stay in our, the data that is in early will stay in early. I don't want to put their timestamp and value inside SQL. That's not the point. Yes, well, I'll just up. We will have a complete node push feature. And the node can push on the master. In order, well, to have, to break this five minute pause standard, so you can put whenever you want, every second if you want. And, well, I think it will enable very fine precision. And my goal is to be as good as Collec D. If, well, if you have a, that's a little blurb on the new HTML5 UI with, so normally, yeah. I speed it up at the end, so you have a little time for questions if you want. With this, is it possible with this new architecture to, I just missed the question. Sorry. But SQL one, you mean? Or? No, I just wrote it down, so I'm sorry for. I think, or? Yeah. Do you fork the plugins, the architecture is still fork the plugins every time? Or is it possible to run a plugin and keep it long running and just feedback some values? Actually, as you also mentioned, Collec D, which builds on this architecture? I, OK. It's, I designed a new extension, a new verb for plugins. It's called stream. And this is a, you just launch the plugin. You ask for a, for a, for a Krogfig. And then you ask for stream. And when the plugin quits, it means, I mean, it just sends periodically values at a rate he wants to. It's a, it's very, it's designed to capture, for example, the output of VM stat. And you can, you can do cat, VM stat, pipe, oak, something and, well, that's your plugin output. And it will stay in memory. And the plugin will kill himself when the configuration change. That's the, but the problem is, I didn't put it in 2.2 because I won't have time to do it. But that's the way it's done. But basically, the architecture of forking exec, a plugin is, will be at the core of Munin. It won't be, for example, a DL.SO, or that you will charge in, or a .pm, that you will charge in the Munin memory space. That's, that's not something I want to. This was the thing which I really liked Munin and used it in 1.2 or whatever. But it had scaling problems with regards to 4.0. So that was one of the reasons I have to change to another system. Hi. So I was a happy Munin user. And then suddenly I fell because of the scaling issues. And I moved to a PNP for Nages, for Nages. And that's one question I want to ask is about how, because with all this data that is a gradient in Munin, then you can do proactive monitoring in a sense, like sending alerts if there's trends that goes one way or the other way, or if you hit sudden threshold. But do you plan on having better integration with alert systems than you have currently? Actually, the point of Nages, I mean, we had very much problem with, because NSCA changed its interface lately. And the thing is, we have something called Munin limits. Well, it sends a lot of things and so on, but it doesn't do it very well. So the integration in other systems, such as Nages, Isinga, or whatever, is very, very high on my top list, because I don't want to reimplement Nages. I mean, I want to focus on data gathering and data keeping. I'm more interested in replacing something like a PHP for Nages than Nages itself. Because the Munin limits, for example, it only has, you know, it only do thresholds. Like if this hit a sudden value, then one. Whereas I'm also interested in questions like, OK, usually this file system is growing at a 1% rate every day and suddenly it grows at like 50%. I want a warning there. Exactly. That's something that is even offered by R&D right now. And I also have it on my future roadmap. But, well, I'm taking the problem that are usually user facing right now. But everyone is welcome to help. I'm looking forward to it. Yeah, OK. Time is up, so I guess you have to ask your questions. After the talk. And yeah, thank you. Just a very buff session this afternoon. If you have some question more specific, just come. And I'll be glad to answer. Thanks, Mr. Chef.