 I guess the lights are off on purpose, right? So we can see the screen. OK, morning, everybody. Welcome to the Collaboratory Track. I will be your emcee for the Collaboratory Track. Unfortunately, I did not bring any dancing monkeys with me. I was unable to get them from the rental company in time. However, we're going to have an excellent roster of talks today. I know you will all enjoy them very much. First up is Patrick Dillon, who's going to talk about vertical scaling of JVMs in OpenShift. Take it away, Patrick. All right, good morning, everyone. My name is Patrick Dillon. This is my second summer that I've interned at Red Hat working on the OpenShift team. I am a master's student in computer engineering here at VU, and I'll be graduating this summer. So this summer, I've been working on vertical scaling JVMs in OpenShift. So Java is an extremely popular programming language, especially among enterprise developers. But it has a reputation for not playing very nicely in the cloud. So one of the main benefits that you get in a cloud environment is vertical scaling. And we'll see that the way the JVM manages memory can cause some headaches when you're trying to vertically scale your containers. So my project this summer has been to research this problem and to help OpenShift developers better achieve their goals of vertically scaling their Java applications. So first, I'm going to introduce the problem by giving some background into OpenShift and the JVM. So OpenShift allows a developer to declare an intended state for their application. So in this example, we have a Java app, and the developer is declaring that it should get two gigabytes of memory. And that configuration is passed to the API server where the containerized application will be deployed in a pod. So on the right-hand side, we have an example of a server node which is running a Python app, a Go application, and it has some free space. So the Kubernetes API server will schedule the Java application to wherever it can find free space for the application. And in this case, it would be scheduled on the server because that's sufficient free space. And you can imagine if you had a fleet of servers running in a cluster, one of the main advantages that OpenShift can allow is to maximize the resource utilization of those servers. Because whenever you have an application, it can match that application to any available free space. Another main benefit that you get with OpenShift is scaling. And when we talk about scaling, there are two main types of scaling. There's horizontal scaling, which is when load increases, you can add more instances of an application sitting behind a load balancer. And that's a very well-established feature in Kubernetes. And then there's vertical scaling, which is the subject of this project, which is adding more resources directly to an instance of an application. And they're not mutually exclusive. You can combine horizontal and vertical scaling. Vertical pod autoscaling is an alpha feature in Kubernetes. So let's take a closer look at how the JVM manages memory in order to see how that can cause problems when we're trying to do vertical scaling. So the way the JVM works is it creates a heap in order to hold all of the objects that are created by your application. So all of the strings and arrays and so forth that you use an application, the JVM will put those on the heap. So in this representation, we have the blue arrow represents the memory usage of an application as it moves through time. And then the top black line is the amount of heap memory that the JVM has allocated from the host in order to hold those objects. So as the application usage increases, the heap naturally increases to contain those objects. And what sometimes what we see is that when the application no longer needs as much memory, the JVM may hold on to that committed heap memory. And one of the reasons that it will do that is because allocating memory from the host can be an expensive operation. And if you needed memory in the past, naturally you might need it again in the future. And that's fine when this is running on a dedicated server bare metal. But in an elastic cloud environment, this free memory could just as well be considered wasted memory. Because from what we saw earlier, another application could be scheduled to use this memory. Furthermore, committed memory is essentially a black box to the host. When a vertical pod autoscaler is trying to determine whether it should scale an application, it does so based on whether it crosses certain thresholds. Yet the JVM, the metrics that are the vertical pod autoscaler, the metrics that are reported there only reflect the amount of heap memory. The autoscaler has no idea how much memory the application inside of the JVM is actually using. So when you're trying to determine whether to scale an application based on whether it crosses a certain threshold, you can perhaps see the problem. Because well, in this case, you would actually want to scale up. But as we saw before, the application inside of the JVM is not using as much memory. So I hope that better explains the problem that I'm trying to solve here. I'm trying to help people with. So the first approach that I took when I was trying to approach this problem is to design a system to introspect into the JVM containers and report out the relevant metrics. So we could say, oh, the application is actually using this much memory. Then we'd pass that to Prometheus, and then we could customize the vertical pod autoscaler in order to use those custom metrics. And as I was doing research into which metrics should we actually be looking at, I came across something called the max heap free ratio. And what that says is that any time the amount of free memory crosses a certain threshold, say a certain percentage, then the heap should be resized down in order to maintain that ratio. So I hope I've explained this clearly enough so you can sort of see where this is going. So the idea now is that if we could tune this parameter, then the heap would actually be resized to more closely reflect how much memory the application is using. So by default, this is set to 70%, which is a pretty large percentage. But let's say we decrease it to 40%, then the actual committed heap memory would more closely track the usage of the application with a little overhead. So I took this idea to my mentors, and they said, you should test this out. And we did test it out. And by and large, it worked. And originally, I was a little disappointed with this idea because I thought, well, I'll no longer be able to help design the system that's going to touch upon all these different components in Kubernetes. That was going to be really cool. But then I realized this is actually very exciting because we're actually much closer to the goal of helping people tune their applications to use the VPA. So it turns out the problem is not really with the JVM. It's really just with how we're using it and how we need to change it to work in this new cloud environment. Thank you very much. Next up, Natasha Frumken. Data Science on Prometheus Metrics. And by the way, I forgot to mention, because I thought it was so obvious to everyone, but it's not, actually, that all four of your presenters that you're going to see in this session were winners of the Boston Office intern lightning talks competition, which was a very, very highly sought after prize. And it was tooth and nail. There was blood on the floor. But in the end, Patrick, Natasha, and the remaining two sessions that you will see were the winners. And with that, take it away, Natasha. Hi, I'm Natasha. I'm an intern in the AICOE team. And I'm also a junior at BU. So my project is Data Science on Prometheus Metrics. And the goal of our project is to provide automatic alerting to developers when anomalies are detected in Prometheus data. So Prometheus is a monitoring and detection, or a collection system for time series data. And what it does is it takes data from many jobs that you have on different servers or different computers and streamlines them. And we can take all that time series data from there and do some anomaly detection on it. And we'll do, first, a training session, and then we'll do an anomaly detection and ultimately provide alerting to the developers. So here's a main overview of our data transfer pipeline. We first collect the time series from Prometheus, then we train the model and forecast new data points. Then we take those new data points that we forecasted and we get the data that is coming in from Prometheus in real time and get anomaly detection on that data and then send alerts to developers. So here's an example of a metric we have from Prometheus called HTTP Requestoration. And this is, again, many different computers, different servers, so here's three different examples. And what we can see from at least first glance is that the red time series here has an irregularity and it may be, you know, it's very high there and that may be an indicator of something wrong with the system. But again, without domain knowledge, we don't know for sure. So maybe it is, maybe it isn't. We also see problems with seasonality in some places and seasonality is basically when the time series oscillation changes over time. And so we can see maybe it goes up and down in some places regularly and then has a big difference in some other place. So one of the training models that we use is Fourier analysis and this, I have an example here of because it is super fast and we found that it has also had very good performance. And Fourier analysis basically takes the frequencies of your training set or the data that you're training and finds, yeah, finds different frequencies and then extrapolates in the future those same frequencies for a forecast. And so we can see here that the yellow part here is the forecast and it is pretty close to the actual values that come in from Prometheus during those points. There are some differences, but hopefully we can do some anomaly detection on that and see how the difference between the forecast and the real data and how that is similar or different. But we run into one problem and that is what is actually an anomaly? I think I mentioned before, it's very domain specific, what we would flag as an anomaly and what we would not, but the goal is to sort of look at the time series, see what has happened before and make good assumptions about what might be an anomaly in the future. So here's an example of seasonal, maybe seasonal change here. There might be some other seasonal issues, point-wise anomalies. And so we've come up with a few anomaly detection techniques. One of them is called the Gaussian tail probability and that's basically looking at how the Gaussian noise in the previous window is compared to the current window of time and if it's very different, then we send an alert or we flag an anomaly. And if we also have another technique, it's called the accumulator function and it basically counts how many anomalous data points that we've seen in this period of time and how many non-anomalous or normal points you've seen and detects anomalies based off of that. And then we take those two anomaly detection techniques, we look at what data points those both flag as anomalous and we alert. So we can see here that the red part is anomalous for both of those anomaly detection techniques. And that's pretty similar to what at least I would expect to be the anomalous points in this data. And again, this requires more fine tuning for over time and we're working on making it a little better. So here's a summary of the techniques that we've been using this summer. I've been working on Fourier analysis, exponential smoothing, ARIMA regression models, profit modeling, profit modeling was developed by Facebook and Subagit has been working on and I've been working on looking at the models and trying to see if they work for our data and Subagit will have a nice talk about RNNs later this afternoon. And then we have anomaly detection rules that we've looked at and we found that regular thresholding is not super, is very time series specific and it's not really that good for complex anomalies and we have to train for it. So that we have found is maybe not ideal. So we're looking into other anomaly detection techniques to see if we can better figure out where those anomalies are. Okay, thank you for listening. Thanks Natasha. Next up we have Hema Varati who's also going to talk about AI and metrics but with different stuff. Give him a big hand for Hema. Good morning everyone. So I'm Hema and I'm a computer science master student at Boston University. So as a summer intern I've been working with the AI operations team which falls under the AI center of excellence team at Red Hat and my summer internship project was the integration of different AI enhanced metrics. So let me start off with explaining to you what exactly we do in AI operations. Are we building scary robots who are waiting to take over our jobs and tasks? Well, luckily as of today this is not the case and thankfully we're all saved from this. So today in AI operations we're working with combining big data with the latest machine learning and AI technologies to enhance the already existing IT operations such as performance monitoring, correlation analysis and even introduce new features like anomaly detection. And my project is closely related to the performance monitoring operation. So what exactly are we doing to integrate AI into all of our projects? Firstly, we've identified that there are three different areas where we can have the scope of AI in our projects. The first being long-term storage of metrics which my teammate Anand is currently or has been working on in the summer. So by metrics I'm referring to the Prometheus metrics which are basically performance metrics and I'll talk more about it later. And these metrics are being scraped and collected on a daily basis and hence we need an efficient storage to collect them over a period of time. Secondly, now that we have the collection of the historical data we can run and train different machine learning models on them and as Natasha has already explained the different models that we are running we can do various forecasting and prediction on the time series data that we have. Thirdly, now that we have these predicted metrics and outputs we can connect them to live Prometheus metric data and it's also easily deployable onto OpenShift and we can integrate alerting mechanisms as well. So this is the overall architecture of our project. As you can see, Prometheus is the major component which is scraping and collecting metrics from different servers and hosts. We are using Ceph as our storage for putting all these metrics in and from Ceph we're again using and pulling the metrics into different Jupyter notebooks where the models which Natasha and Subujit have deployed on and once we have the output of these machine learning models we again register them as new Prometheus metrics so that we can trigger alerts based on anomalies that we've detected. So what exactly is Prometheus? Prometheus is an open source alerting and monitoring tool. It collects and generates a vast amount of time series data. It supports four different types of metrics. A summary which is, sorry let me start off with the counter which is a monotonically increasing value. We have a gauge which is a metric value that can be both increased and decreased and then we have summaries and histograms which basically are similar in the sense that they divide your entire time series data into specific buckets and quantiles and this is especially useful if you wanna look at a particular window of your entire time series data. So why do we care about metrics and why am I talking so much about it? Well metrics are numerical information that help you to understand how a particular job or task is performing and by doing different data analysis on these metrics we can drive better decision making to improve the performance of our existing systems and services. So this is the current implementation that we have of our project. As I mentioned we are running different machine learning models, two of them being Profit and Fourier which Natasha earlier talked about. The output of these Profit and Fourier predicted models need to be exposed as new Prometheus metrics for which I have developed a Python Flask application. So the Python Flask application is a simple web server application which would serve and expose the metrics to the default path that Prometheus listens to and this is again easily deployable onto OpenShift and we have the application currently live and running and this is the URL along with the source code for it. So once we have these metrics exposed and it's being scraped by Prometheus we can go and hook into the alerting functionality of Prometheus which is basically an alert manager where you specify your alerting rules in a simple YML files. So these alerting expressions are like your if and for statements and they have different mathematical and logical operators. So for our project the expression that we're evaluating is comparing your current metric value with the predicted Fourier or profit metric values and if the metric value is within the bounds of these predicted values we say that this is not an anomaly this is a normal behavior and we set a gauge to its value to be zero. Whereas if the metric value is out of bounds and it's out of the range of the predicted values we set a gauge to be set to one and this is how we detect the anomaly accordingly. So once we have the rule set up you can now trigger your notifications. So we're sending out the notifications via email so it can be sent to any user who wants to be notified. We've also set up a separate Matamos channel called AIOps Alerts. So it's a bot which constantly gives you alerts on whenever it has detected and anomalous behavior in your metrics. So these are some of the results. The first image is how the email would look like it gives you a basic description of the alert along with the details on the metrics that it's alerting and the metric metadata label configurations. So the second image is the gauge that we've set. So it basically points to the value zero or one. One being an anomaly, zero being that it's perfectly fine. It's working normal. And we've also used Grafana to help create new dashboards and also make new visualizations so that we can look at the anomalous behavior and sort of get an idea how our metrics are flowing over a period of time. So I would like to conclude by saying that at AI operations we're not creating scary robots but rather collaborating with the latest machine learning and AI tools so that we can enhance the performance of already existing applications and introduce new, faster, efficient technologies as well. Thank you. Thanks, Seema. This is cool stuff, right? Not bad for nine minutes. All right, next up we have Michael. I'm trying to remember. Michael, what are you talking about? Natural language processing, unsupervised learning and real-time analysis of logs. The natural language which I just failed to process. That's what Michael's gonna talk about. Have to take it away, Michael. Great. Thank you guys all for coming today. Yeah, my name is Michael Clifford. I just graduated from BU with a master's in computer engineering and have been spending the summer at Red Hat working as a intern in the AI ops department. And the project that I've been working on is titled here, natural language processing, unsupervised learning and real-time analysis of logs. Really in short, the idea is logs are generated a billion a second it seems like and we need some way to analyze them and detect which are anomalous. So why would you wanna do something like this? What is the point of it? And essentially when something goes wrong with an application, there are two significant costs to your business, your application, whatever it might be. Essentially what happens is developers and administrators have to spend loads of time going through logs, trying to ascertain exactly what went wrong and as well as a temporary loss of service can be problematic to giving transactions through or just simply eroding the confidence of users. So the main point of this whole project is is there a way to actually minimize downtime of applications when something goes awry? And the answer to that is yes, of course, with automated anomaly detection. So just to give a quick summary of what is anomaly detection, right? So here we have a data set of some points in two dimensions and you can see hopefully in the far corner there there's a anomalous data point. It just doesn't fit with the rest and that's great. We're all very capable of doing this particular task. So yes, the humans, all of us in two dimensions or even three with looking at something by inspection were able to determine what an anomaly is fairly quickly. Even looking at like a large data set or subject matter experts in a specific domain like for example, English are able to determine what an anomaly is. For example, if you have a data set, green, red, blue and chair, all like English subject matter experts are able to notice that chair is the anomalous data point there. That's all fine and well but once you start to deal with high dimensional law or log data that's in natural language and you're dealing with maybe 80 or a hundred different dimensions or dealing with data that's not even numerically based at all and how do you automate this process or how do you get a computer essentially to figure out what the heck an anomaly is? So the way that I have been working on this problem this summer is really dividing into three parts. You have like the stream of semi-structured log data that comes in and I say it's semi-structured because the data that we're working with is typically in JSON format. So there are fields. So there's some kind of like organization of the data but within each field it's typically, I mean there are some numerical values but a lot of it is text of some kind or another. So that text needs to be encoded in a way that's meaningful that can then be ingested into a supervised, excuse me, a machine learning algorithm of any kind but it's particularly with the log problem there are no labels. I mean they're just being generated so quickly to have like a human annotated dataset of logs is at this point at least a little challenging. So we need to have some kind of unsupervised learning model and inference technique that we can use. So I filled in over the summer, I addressed each of these problems using these three tools essentially. So for our like stream of real-time data or at least near real-time for the nature of the proof of concept that I'm working on we're working with Elasticsearch. There is a little bit of a delay with Elasticsearch so it's not true real-time but for our purposes it does suffice and for an actual production version of what I've developed this summer we probably wanna move to something more like Kafka maybe but right now Elasticsearch has been good. So been using that to basically monitor in real-time as logs are sent to Elasticsearch, the application that I've developed pulls down the Elasticsearch data and then uses a word to VEC in order to convert it into like a vectorized representation and one of the reasons why we chose word to VEC is fairly popular right now but another reason it is very good at maintaining the semantic meaning of different words so I'm not sure how familiar everyone is with word to VEC but one of the classic examples is that once you train a data set on a certain corpus of language you're able to do something like king, the vectorized representation for king minus the vectorized representation of man equals in like a cosine similarity sense a vectorized representation that's very similar to the word for queen so in this way it basically converts natural language into just like seemingly to at least unhuman readable vectorizations but the semantic meaning is retained in a cosine similarity sense so once we do that we take all of our natural language portions of our logs and basically turn them into these vectorized representations that can now be ingested into a unsupervised or any type of machine learning application and so we have decided to use a self-organizing map for our particular purposes and a self-organizing map or a cronin map I think it's also been referred to it's been around for a while but it's a pretty nifty way of kind of mapping into a high-dimensional space some of the different elements of your training set and essentially what it is is a user-defined grid where each node on the grid is of the dimensionality of the data set so for our purposes we're using a 24 by 24 grid where each of our log vectors are 800 dimensional so each node on the 20 by 24 grid is an 800 dimensional object itself and what we do with our training data is we basically initialize the map to be randomly there are a few different ways you can initialize it but we've decided to do it randomly and then take your training data and run it across the map, excuse me and as you go through this process the map slowly forms to and generates a kind of a generalized idea of what your underlying training data set looks like so then once we have that trained we're able to take new maps, sorry, new logs ingest them from Elasticsearch convert them into Word2Vec and then compare them to our self-organizing map and what's great about that is we're able to compare them to the self-organizing map just using like L2 norm distance and that distance metric from the map to the log gives you like a real valued number of how similar your object or your log is to other logs that it's seen before so we're able to use this number to determine okay, as logs go through this one is you know, has an anomaly score of zero it's basically a copy of something that exists on the map so we've seen this before, it's nothing to worry about something else comes through and it gets a score of 5,000 or whatever it is that's pretty outrageous but and we're saying, oh this is completely unlike anything we've ever seen before it can be flagged as an anomaly and so once we do that we need to tell somebody about it just figuring that, just generating that number is kind of pointless so we're able to then take the anomalous logs and push them back up to Elasticsearch and with, this is a prototype image so it doesn't have full data but basically report the logs what their anomaly score is, when they occurred what the message was and all that good stuff so great, yeah were we able to show that with this particular proof of concept that we could somehow minimize downtime in downed applications, hopefully basically we're able to now at this point generate continuous ranking for so any log that comes through we have some kind of continuously ranked value for how anomalous we see that it is the output can be put in Kibana or Elasticsearch for a quick review by developers and yeah, so generate a proof of concept for a tool that can automatically highlight the more problematic logs for developers and decrease mean time to recovery. Thank you. Thank you very much, Patrick. Oh geez, this is really loud. Okay, so in addition to these four excellent lightning talk winners I would be remiss if I did not mention Amit Sonala who is sitting back in the back here who was our number one finisher who will have an entire separate talk in this room a bit later today. Also, Pearl Singh and Chloe Kabush were also winners. They will be coming up next in our next 30 minute session. And finally, Lily Sturman who is here somewhere but I haven't seen her yet today who's talk on tracing was also a winner and she'll have another, she will have a session either later today or tomorrow. I think it's, I think it's later this afternoon. So we've got a three or four minute break while we get the next crew miked up they're gonna talk about the Chris project and wrangling Kubernetes into doing what they wanted to do. Thanks all very much. Quickly grab a coffee if you really want to but don't be late.