 All right, thank you guys for coming out. We're going to go on some witch hunting today. I'm hoping some of you at least enjoy the same silly movies that I do. This will make a lot more sense if you do. My name is Mike Smith. I'm a Cloud Systems Architect for Overstock.com. So I do want to take just a minute to introduce Overstock in case you don't know who we are. We are online shopping, online retail. And most people who know us know us for what we call our shopping site. Couple million products on site more every day. And we also have a few other faces that you may not know about. For example, our World Stock Initiative, which is all about giving a fair shake to craftsmen, artisans in developing nations. That money funnels back to their local areas. Overstock money is used to build schools in areas like this. We also do some other initiatives such as our pet adoptions, where we take the search engine capability that we offer on our shopping site. And we use that for the good of connecting people and pets that need homes. Here's some here in Austin, actually. We have an initiative called Main Street Revolution that is all about giving mom and pop small businesses bigger exposure on the internet through our infrastructure. We introduced a Farmers Market Initiative not too long ago. It's all about fresh produce near you, rather than trucking it in somewhere across the country. And finally, we're also known for being big fans of Bitcoin. And we're the first major retailers to accept Bitcoin for online shopping. We also have a Bitcoin ATM in our lobby, which is kind of fun to watch people use. But from a technology standpoint, we are operators of six separate open-stack clouds. Three of those are a little bit older. Three of them are a bit newer. And we're in a transition stage between them. As a company, we were founded in 1999. We have some 500, 550 applications behind the scenes of all of the interfaces I just showed you there. Different applications responsible for shipping and rewards points and inventory and all these kind of things that go on behind the scenes. Those applications are mostly Java with Oracle backends. And because we've been around for a while, these applications aren't necessarily born in the cloud. They're more traditional applications that over time have transitioned to more cloud-friendly applications. And although we were originally on physical hardware for everything we do, these days, OpenStack plays a huge role in not only our dev test environments, which we have many, but also production. And by default, every single new app at Overstock goes on our OpenStack clouds. As I mentioned, we operate six different clouds. We have one that we're finally phasing out that's going back to the full some days. Our newer stuff is on Kilo going to Liberty this month. So we are going to talk about witch hunting. And what do I mean by witch hunting? Witch hunting, to me, is something like this, where things are great. You have a nice normal known pattern, and then something changes, and you have no idea what. Something clearly is going on here. And in this case, it spiked the CPU load average of a number of hypervisors to two to three times what they would normally do. So when this happens, you get this kind of a reaction. If you have some performance problems, so of course you turn to some big data analysis and decide who's the guilty party here. But we're gonna talk about using Solometer to do that witch hunting for us. How many of you out there use Solometer today? Could you raise your hand? Okay, cool. Raise your hand if you use Solometer for showback, billback, that kind of thing. How about for auto scaling, provisioning, that kind of stuff? Okay, so it looks like we've got a mix. There's some people here that use Solometer, some that, let me know if you've never installed Solometer, never played with it before. Can you raise your hands there? Okay, we have a fair number of those guys as well. So I do wanna just give a little bit of Solometer basics. If you don't know, in the real world, a Solometer is a meteorological device for measuring the height of a cloud layer. So that's pretty clever that they picked that name. Technically, I think they like to be referred to as telemetry. Solometer is one of those components. In the OpenStack world, Solometer is a set of APIs and publishers and collectors and tools that essentially collect data from various resources, your Nova deployment, your Cinder deployment, Neutron, all of these things, storing that away in a database and making that data accessible through APIs. Some common uses for Solometer. I think first and foremost, most people use it for metering the resource usage of their customers, their users, Solometer itself is not a billing or rating engine, there are projects like CloudKitty that will do that, that tie into Solometer. At Overstock, we do some showback and billback to departments and things like this, but we actually are happy just using the Nova tools that do this outside of Solometer. If you're getting into the nitty gritty of, okay, exactly charging like an AWS cloud for how many bytes in and bytes out of the cloud and how many disk IO reads and writes you do, then you definitely want to use something like Solometer to gather that information. It's also used, some of you raised your hands for, it's used as an alarm condition that can trigger external actions. This could be monitoring things like Nagios alarms or other things, but you can also use it as in a cool way to trigger expansion of your cloud, provision new instances, and I'll walk through a little bit of that too. So here's an example of using Solometer to measure, record, collect, CPU utilization. Once you have Solometer installed, and there are kind of, there's kind of what I call in Solometer Classic, and then there's some newer Solometer components like Naki and these kind of tools, and it's really a difference of how that data is stored. We're still using kind of what I'm calling that classic Solometer where it's storing this data in a Mongo database, and the Naki stuff that has come out recently, or more recently, stores it a little differently, stores it in a more efficient way that gives you better performance, so look into both of those, but know that for the purposes of what I'm doing here, we're using the MongoDB. So if you want to see what meters, after you install Solometer, it has these ideas of meters, it is automatically then grabbing all of this information about this one particular instance. In our case, for example, it does polling every 10 minutes. In some cases, we have it set to every five minutes, but it's grabbing all of this information, CPU, which is a cumulative count of nanoseconds being used on the CPU by a particular instance. It also has this neat way of transforming the data into something more usable like CPU utilization, which we're highlighting here. This is a gauge of how much CPU utilization the instance is doing, which is different than load. They have these formulas that say, okay, given the number of cores you've got, I can really derive a CPU utilization factor, which is a little different than just a raw load. You also see in there gauges and cumulative counters for things like disk reads and writes for the number of vCPUs that an instance is using, et cetera. Okay, so that was a meter list. This is a sample list, kind of the next step along. For every instance, it's gonna go and pull this data every so often, and here I'm showing with this dash L5 at the end there. I'm asking Solometer to show me the CPU utilization meter for this particular instance and show me just the last five. So here I am, I've got 10 instances, and this is a pretty sleepy VM at the time, so it's really not using any CPU. Okay, so going one level beyond a sample, you can then show a statistic. A statistic is taking multiple samples in a given time range and showing you some interesting information like max, min, average, and the sum of those numbers. You can specify the period of time to consider in that statistic, and so on. In this case, we're sampling every 10 minutes. I asked for a 10-minute period, so the min, max, average, and sum are the same, but if you were to select a different time period, you would see those numbers actually start to mean something. Okay, so then we get into the alarming feature of Solometer, and you create these alarm thresholds. So in this case, I've created an alarm called CPU high for when things start to heat up on this VM, and I'm using the CPU utilization meter, setting a threshold of 20%, and I'm saying if this VM has a CPU utilization greater than 20, on average, for a period of 10 minutes, and in this case, I'm saying two evaluation periods, meaning two 10-minute periods, then I want to trigger an action, and I've just made up an API here, that it might hit, and I could use that to know that, hey, this VM is asking for help. He wants a buddy, he wants to share a load across multiple things, for example, and at some point then, once it crosses that threshold, this condition will actually trigger, and we see that here in the Solometer alarm history. So at this particular time, we cross that threshold and we send an alarm out, and it's nice that it shows you all those conditions that it's acting on. This is what it looks like when Solometer alarms post data to wherever you sell it to, and it gives you some information in the post, information including how many times it has, how many poles it has seen this threshold pass, and so on. So you could use this post request to do some horizontal scaling. We use it at Overstock to say, we have different server types and we'll say, okay, this instance is a representative of the whole of this cluster, so when it gets burdened with CPU load, it's time to create another one, and then we have some timers in place to say, okay, give it five minutes, and if it still needs help, let's create another one. Some people use this to do more vertical scaling. This VM needs more CPUs, not more members in a cluster, for example. And we also use alarms like this to trigger reductions in that cluster, so same idea, the nice thing is is you can set different thresholds. It's like, okay, I wanna allow it to build something if it's been 10 minutes of high CPU load, but I wanna wait 20 minutes before I start reducing the cluster, because I don't want it to go up and down, up and down all day, let's say. So it's very flexible in that kind of way. You need the same thing with memory utilization. There's an example of showing meters for memory usage for getting the statistics with memory usage. There's several other meters available. We use a lot of, for witch hunting purposes, storage IO and network utilization. You can look at that in terms of both packets and bytes and on other means as well. So back to witch hunting. We have a little script, and this is my disclaimer, I'm a Python hack. There is certainly much better ways to do this than what I'm doing. I looked around for some kind of native salameter ways to do it. If any of you know how to do this natively in salameter to kind of get top 10 bubbled up to the top or sorting, I would love to have you talk to me afterwards. But this is an example of the kind of tools that we use. So when we are in normal conditions, it might look something like this. This particular script, and I'll walk through what I'm doing in here at the end. We can say how many results do you wanna show? How many minutes do you wanna look back? Where do you wanna start that time period of looking back? We keep data only for like three days. To go beyond that would require a lot more kind of Mongo build out at our scale, and we don't currently have the need to do that. So this is showing us that, yeah, we've got one VM here at the top that's kind of a little bit loaded, but most everything else, we're still within some good thresholds. This happens to be our dev test cloud rather than our production one, but we use the same tool in production as well. So back to this graph. Things are humming along, and something is amiss, we don't know why. Well, this top 10 script shows us this one particular type of Java web server across the board, multiple tenants, all of them pegged above 100%. So what was cool is when this happened, we knew exactly where to start looking. So we took this JWeb02 Tiger test. We pull it up in top, and here's what we see. We see Splunk as our witch in this case. You probably have seen Splunk t-shirts around. If you don't know what Splunk is, it's a log aggregation tool, and usually it's very, very well-behaved. In this particular case, Splunk is licensed on a, how much data can you index of all this log data coming in? And if you violate that limit three times in a month, you essentially get disabled for a period of days. And so to protect ourselves from that situation, we block our Splunk indexers with IP tables if we're about to cross that threshold in certain cases. So this happened in DevTest because somebody had their apps in debug mode and spewed all kinds of logs to Splunk. We block it with our IP tables, so we don't go over that threshold. The next day we drop those shields, and all of these servers that have been queuing up the data all wanna send their data to Splunk. So it wouldn't have been my first thought when I saw that graph, and that's why it was so nice to have something like Solometer that could tell us where to go look at the specific problem in that. So clearly, if she weighs the same as a duck, she's made of wood. And knowing that things that float include ducks and witches, and things that burn include wood and witches, then we know she's a witch. It's just science. Our top 10 tool that we use also allows us to ask for disk I.O. So here's some examples just in normal case. Those Hadoop boys like to use their disk I.O. And things like Grafana and Graphite, things like this tend to be heavy on disk. Here's one that shows you a network view in terms of packets. And so we can see these gateway async guys customer search. These guys are using more packets at this particular point in time than your typical bear. Same thing showing us network reads and writes. So all this is super useful. It comes out of our Solometer data and I'll just walk you through how this particular script grabs data from the Solometer API. This presentation will be posted as part of the videos. You can grab it there. You're welcome to take pictures if you want. And I will look into how to push this to the operator's GitHub as well in case it's valuable to anybody else. So Python, we're including the Solometer client, Nova client, I have a credentials file that I store things like passwords and whatnot in. You'll see some defaults for the number of minutes to go back as well as how many minutes to go to start from, how many results I want to get back. And then we're kind of listing. If you remember that Solometer meter list, you see a whole lot of meters that are available. Some of those are useful to me, some of them are not. And so this script allows me to ask for network, CPU, disk, memory or everything. And this is just an easy way of me including which specific meters I'm interested in without having to remember their full distinguished names like that. So that's what the first part of this script is. Then we get into a couple little functions. Not every resource in Solometer is stored with a UUID alone. Instances from Nova are, but there's things like Neutron Networks and Cinder Volumes that tack other things on as part of their UUID. So this is just a little function to help me derive the instance ID out of the Solometer data that it's stored. A way of just getting a Nova client so that I can look up the name because UUIDs in Solometer for instance names aren't immediately useful. This is probably the more interesting function here. This getStats function, I pass it a meter, a start time and an end time. It gets a Solometer client and forms the query that's needed to pass to the Solometer API. In this case, I'm passing a list of two dictionaries that specify a timestamp greater than a time and a timestamp less than a time and that gives me a time range. I do that query for a specific resource ID I'm interested in, I get back the stats. I think it looks like one time I was paranoid about getting duplicates and freaking out about it. It's never triggered that so must be in a good state. And then we do some sorting and then we reverse that sorting. This is the part where I'm sure there are many, many ways that I would like to learn of how to not rip through my entire Solometer data just to get the top 10. This is what we ended up doing. We returned back the number of results we asked for in a sorted way. Some stuff about parsing, what I'm putting in on the command line, figuring out how many minutes to go back and then really here is the part that gives me the output. From what I experimented with, it seems like Solometer always wants UTC time and my servers are not necessarily in UTC time so we do some conversion. You're supposed to be able to specify a time zone. It didn't work for me so I'm doing it this way and just passing through the UTC time. Then we're just looping through those meters, sorting them and printing out instance names that match. Okay, so hopefully that's fairly straightforward. That's all we're doing. It's not anything magical. It's just leveraging the Solometer APIs, the Nova APIs, in order to get that kind of result back for troubleshooting. I think whoever started this meme of Lego Kids is fantastic. I like it, there's a bunch of them out there for various Money Python things. But that is what we use for witch hunting and I just wanna thank you guys for coming and especially any of you that participate in the work with Solometer and Telemetry. Thank you for making this tool and making it useful for us. If you do have questions, I'd be happy to talk to you or take those questions now. If you would step please to the mic so that the folks that are watching this later on the video can hear, that'd be great. So this was mostly for instances. Did you do anything for looking at your computes or to figure out which of your hosts are also causing problems? Sure, for our purposes we see that in our Nagios monitoring and so we really didn't have a need. There are certainly ways to do that. We have looked at creating custom Solometer events that aren't pulled. So for example, and this goes back to usage and Bill back, I mentioned we use Oracle a lot. That's kind of where the expense in our resources come from. The rest of it's cheap by comparison. So we have this pool of Oracle databases that we essentially check out to a development team as part of their environment. And we have ways of tracking that but we've talked about pushing Solometer events into that so that we can use that for Build Back and Show Back as kind of a pretty cool way of putting your own events within it. But as far as monitoring our infrastructure itself, the hypervisors, the controllers, we just rely on Nagios for that. Yeah, you have a question. Hi, I have a question on your graphing. So what do you use to use Grafana to hook in the Solometer? It doesn't necessarily hook into Solometer. The graph you're seeing there is, we're using Puppet, so we build instances and based on the type of name it has, it's going to get a certain Puppet policy. One of the things that we do across the board in our Puppet policy is have it run the Graphite, Aiman. And Graphite is a great tool for shoving data to Grafana or to, sorry, Diamond is the collector. Graphite is the receiver of that data. Grafana is our UI for displaying that into graphs. So then the Solometer, you basically pump in the Solometer data to Solometer and Graphite? Actually, no. We have just Diamond is pushing every metric we want into Graphite and reviewing that through Grafana. Solometer is just a place I go to when I see some anomaly on my graphs and I don't know why, right? I don't necessarily monitor through Nagios, the hundreds or thousands of VMs that we have for everything, and so like disk.io and things like that, it gets kind of overwhelming. But we're already measuring all that through Solometer, so we use that data. Does that make sense? Cool, thanks. Cool. Yeah, you got a question. I'm sorry if I missed this, this Python code that you just showed us, is that all on GitHub somewhere? I'm sorry, is it on GitHub yet? It's not yet. I will put it up there. I have, it'll probably end up in the operators area, kind of miscellaneous, the tools that people use. And we'll put it up there. If you're on the operators mailing list, you'll, we'll put an announcement out there once we put it there in case it's useful. All right, good stuff, thanks. Cool. Hello, hi. I saw your Python code showed network statistics. And do you know if Solometer also supports per flow statistics? Per flow statistics? Yes, do you know if it supports? I haven't seen that in Solometer. We did a presentation on Monday on our networking component. So once this is posted, you could take a look at that. We're a Metakura customer, and so we're pretty sensitive to flows because it scales kind of based on numbers of flows. For us, we use their JMX into their process to kind of account for those flows. And that is something that we probably will write a Solometer plug-in for to try to shove those things into Solometer so that we know, and we can take action on that kind of scaling as well. But I haven't seen anything flow related. Somebody who knows more about Solometer than I do might be able to help with that. Thank you. Sure. Any other questions? If not, I'll ask you a question. If there's any Solometer experts out there, do you guys know of any way to query directly from Solometer to get things like just show me the top 10 or sort them? There's some group by aggregation stuff, and we're kind of using that in our scripts. But I would love to hear, and there's my email address on the screen, but any of the Solometer folks out there, I'd love to hear from you if there's a way to natively do that. If you saw, the Python script I'm doing is essentially pulling the statistics for everything at once for this time period, sorting through it, returning it back just the top ones. There's got to be a more efficient way, but cool. All right, well, if that's it, thank you very much for coming. I appreciate it.