 took some precedence. My name is Daniel Rieke. I work in the office of the CTO at RATED. On the topic of AI, I'm running the Artificial Intelligence Center of Excellence at RATED. And we're talking about our open source AI vision. All right, so what are we talking about here? Oh, great. Now it works. So what are we talking about? I'll use AI as a shorthand for the whole broader sense of the buzzword, big data analytics, machine learning, cognitive systems. We'll talk primarily about machine learning, but I'll just simplify it by calling it AI. Overall, so we see AI in RATED as one of the biggest changes in industry, probably the biggest change in the industrial revolution. Because we see a new way of automation coming, which specifically applies to IT itself. The key is that every human task will be automated if you can describe input to output in a probabilistic way. And specifically in the software industry, what it means is that you're moving from humans understanding data and relationships, and then encoding their conclusions in code to machine learning models, learning from the data, and autonomously coming to conclusions. So they're still humans involved in defining the models or getting the data, at least in the initial data engineering, setting it up. But the decisions, the deriving decisions from the data itself is something that the software does autonomously. And even in areas where you're still, so we call it dynamic, an example with dynamic rules model. And you might not want to do that everywhere. We'll see that in many areas. We already see that in many areas today. Even if you don't want to do that because you need an auditable, very strict rule system, what we are seeing is that the rule development itself is employing AI technologies to make sense of the data. So even if they are humans writing the rules, often they will use AI tools to make sense of the data, do the data engineering, do the data analysis. So we see a lot of AI being used in many things that computers do in automation. And we'll go into some examples of that. So this transformation changes how we interact with software. It changes the role of software itself, of code itself, and the role of the data science. And so overall what we see, we have a name for these kind of applications. We call them intelligent applications. So applications that ultimately collect and learn from data and typically gather more data when you use them. That's what you have a couple of common examples that all of us use today. It's how we use AI and ML today throughout the software industry and all the industries that use software, that are software bound, which is most industries. So from banking through anything with security, production automation, all of that. And interestingly, most of the big advances recently in artificial intelligence are based on a use case, a business use case-driven application of AI and the combination of the compute power we have with the availability of real-life data. And there are a couple of examples where big advances have been made. We all knew it was well understood that you could do handwriting recognition, but it became really pre-reusable when the post office provided data to researchers, large amounts of data to solve their problem of letter address recognition. And then it was solved. We all know all these rating systems. They are machine learning. I was behind that more and more. And it's a lot of data. Image recognition is another example where in principle, it was well understood that it could be done. We suddenly had the compute power to do it in the last five years, five to eight years. And then the data became available and was made available. And it was driven by a business use case with a speculative application of the understood theoretical capabilities that got us to where we are today. And you have to, in that light, see things like when Amazon offers a service like recognition, which is a facial recognition service that you can use. They provide that ready to use. And they're going to towns and police departments and trying to get them to connect their surveillance cameras to that facial recognition system. So they provide a free service and they get free training data back. And that's the dynamic that's driving the advancement of technology here. And it's incredible how fast this is moving right now, how fast new applications are, or the boundaries are pushed, algorithm or trained models get better. We see this right now in the continuum of the recent changes in the software industry from a hardware centric model. We went to, it's about the software. Software is eating the world, to cloud native. And then from cloud native, we are going to data rules of the world. An important point here is that when you look at AI, so code models, algorithms are really important. But because it's all driven by the data, no amount of algorithmic sophistication can overcome the lack of data. And data is, at the end, an equal in this to code. You now need two halves to make a program function. If I write software, let's say I use AI in my development process, in the development process for an open source project, now I need the training data and the code to make it functional. And so now we are not software eating the world, it's AI eating software. That's what we're seeing right now, because everyone with a business use case where you have any kind of automation, you have data, and you can derive the action the software has to take probabilistically from the data, you will see training models being used either indirectly or directly in the user detection, taking direct action. An interesting aspect of AI is that it's very open source, friendly on the technology side. Most of the big things that people are using are available in open source. Most of the research, even the leading companies in the space, like Google, Facebook, Microsoft, Tesla, some others, are publishing, contributing to open source. The problem is, if you don't have the data, that doesn't help you, because the code is non-functional if you don't have the training data. So that's an important aspect where we have to think about what does it mean for open source. What does it mean, for example, from a copy left point of view, if you're publishing software, but it's not functional, then you are potentially in conflict, at least with the spirit of copy left license. And we see right now that a lot of companies are willing to share the code, but they see the proprietary differentiation in the data. That's a shift. So what does it mean from a Reddit point of view? So we at Reddit see AI in five perspectives. First, it's, of course, a workload for a platform. We're an operating system platform company to large degree. And AI being such a big transformation, something that every business has to do, is dealing with, of course, we want to provide the best platform for that use case. We're also applying AI in our internal processes, specifically in our software development process. So a lot of the talks you're going to see today are looking at how can we use AI, for example, to improve software quality by doing Flake analysis? Or how can we use AI in systems operations? You could say that there's a fancy, very flashy use case, like self-driving cars, which is very interesting. It's a big deal. But it's also very hard, because your challenge is interacting with the physical world, and it really matters whether that white thing in front of you is a truck cutting you off or a traffic sign. And it got it wrong, and the driver unfortunately died in that scenario. So that's a really hard problem. At the same time, basically, that car is a data center on wheels, like all this AI is running in the car. And you probably don't want a sus admin in the trunk. So you want a self-driving cluster to run your self-driving car. And that's a low-hanging fruit, because it's supposed to be all machine-readable already. We're dealing about machine-generated information interpreted by machines, and then machines need to take action. Today, if you're in automation, in your systems automation, you're doing anything in SRE, you're writing a lot of automation scripts. A lot of what we are looking at with AI right now is how can we automate IT processes, automation of computers? How can we move from static heuristics to learned models in core products, basically in every device down to the corner, eventually. We're not there yet, but that's going to happen. And then in systems management, lock analysis. And interestingly, a lot of the same algorithms that everyone else uses, you can apply here. So it's anomaly detection clustering. You can use the same things that people in trading, in finance are doing. So at this point, when we use AI internally, of course, we follow our open source development model, so it will be available to the community and in general. But the customers don't necessarily know that we're using AI. The products are still the same. You get your REL subscription, you get your security fixes. Code quality should improve. Reactiveness should improve, things like that. But then we embed that actually in the product and we create services that are AI-based, at which point the customer actually becomes aware that they are being supported by AI. An example is Reddit Insights. It's a predictive support service that looks at your system configuration and it will tell you if you have problems, there are rules in there that human generated. There are rules in there that are human-generated with AI enablement and there are automatic decisions that happen based on AI. Something we demoed at Reddit Summit, for example, was to look at, so it aggregates data from customers. It reports your system status. And what we demoed at Summit was that the system then looked at your configuration and your performance data versus all the other customers we see. And the example was a degraded performance in a cluster and the system told you, oh, your performance is out of bounds compared to everyone else. You're an outlier and your configuration is an outlier. Why don't you change this and then your performance might get better and then we press the button and it changes and the performance got better. But it's basically using AI. And the key here is that it's not fundamentally different from our support service today. If you have a problem in production, you call Reddit support and they have a lot of knowledge and they have our knowledge base. We have really good support people. They know how to diagnose systems. But we are trying to augment them with AI that can basically process more data and long more vectors more quickly and then give you a faster or even predictive before you call an answer to you have a problem there. It doesn't replace a human support but it gives you a lot of value before you have to escalate the human support. Human support can focus on the real hard problems where you can't extrapolate from statistical input. One example, a lot of this is basically herd immunity. We're gonna be able to predict problems based on what we have seen before. The things that we haven't seen before that we still need human creativity to figure out what's really going on. But again, we can augment them with a lot of content. So we'll basically put AI into the platform product themselves and in our support services to make turn our platforms into intelligent platforms. And then in the intelligent apps that we are providing the same capabilities that we are using, right? We are running this on our own stack with our own tool chain. We're working with our ecosystem. The broader open source community and the commercial ecosystem to create and to end AI solutions that we use to build this and we make them available to our customers so they can go and build their own intelligent applications to serve their customers, right? So we are a user of AI and then become not a vendor of AI in the sense of like an AI vendor but we become an enabler by providing open source technology that lets you build AI. And all of that is based on the foundation of data, right? It's the recognition and the culture shift from treating, from being code centric which we are like most software companies and the open source community today. It's all about code. Our value isn't the code, that's our mindset. And it has to shift to treat data as an equal to code. It's data and code going forward. So what we're doing there is basically we're doing the usual things in enablement AI is an interesting topic because it's very hardware bound suddenly hardware performance matters again, right? In the cloud everything was about scale and individual system, vertical performance wasn't the key differentiator. Now it becomes a key differentiator again which is great for us because we have this hardware enablement capability end to end. There is a lot of work to integrate because AI writes on the shoulders of cloud of microservices of containers and I'll talk a little bit more about how that looks in detail. But so there's a lot of work going into integrate and enable an ecosystem so things can work together, right? And we'll get to that. We have a talk later and I'll go a little bit more detail. We have a project called the data hub which is basically a reference architecture for an end to end AI platform built on top of Kubernetes and Kafka and S3 as a storage interface with Seth as the storage in the redhead world. And we're actually operating that inside redhead and we're gonna announce operating it in the Massachusetts open cloud for the community as a project. So AI is a workload, as I said, it's really interesting from a hardware point of view you need to end to end enablement. You need not only performance, right? You have depends if you want to, you need data security end to end, right? A lot of, because data becomes important. In DevOps, we have figured out how to manage application life, like a code life cycle. It's where Kubernetes based platforms containers are thriving. Now we have to figure out how to manage the data life cycle because when you're training a model you want, you need to keep the training data for completeness of your source code for reproducibility but you also need it for audit. Because the code on its own doesn't explain anymore what the software is doing. So a lot of our customers that are in regulated industry so have potential abilities need auditability. For them it's really important to in their application life cycle management also captures the training data. Things like that. You need compliance when you're feeding data back. For example, you're training a model with data that model inherits knowledge from the data. So if there's confidential information in that data it potentially can end up in your model, right? It can be disclosed that way. So you need proper separation of compliance and access consistent between data and trained models. And you know an example would be you have, let's say you're a stock trading company, you have general data, market data you train a model on and then you have customer specific data and for your large customers you want to provide customized models that are trained on their specific data and the behavior of their data. That model will be considered proprietary information confidential by your customer. They don't want you to share that with other customers because it might disclose aspects of their portfolio. So you need a very complex consistent compliance model to ensure that code, the entanglement of data and code gets handled. So they're really interesting problems that you have to manage throughout the whole stack. On the hardware side, a quick announcement and it's a good example. We just in OpenShift 3.10 start supporting device plugins, which is what you need to get the GPUs right now for NVIDIA, but it's the generic feature in Kubernetes that lets you expose hardware capabilities to Kubernetes as a scheduler is aware and also figures out how to make the right drivers available to the application, which is a bit challenging in containers. You need to know which version of the drivers and the host to load the right version of the user space, low level user space. So that's really important. There's a block if you want to run machine learning at performance for complex machine learning because you will need this for most use case, you need GPU offloading and that's possible now in Kubernetes, which is a big deal from our point of view because it gives you this ability to do the integrated life cycle in the DevOps model. So on the core system, as I said, we're looking at enabling AI in the core system. So basic what that means is we're trying to train developers to look at learning instead of static heuristics. That's a research area and I think we're gonna talk about that a little bit today or tomorrow. So think of augmenting the scheduler. It's a good example. One thing that's being discussed is in Kubernetes, you have a scheduler and you have a de-scheduler that will evacuate notes if there's a performance issue or something. Right now they don't know about each other so you can have a situation where something gets de-scheduled and then the scheduler puts it back on the same note. It's a perfect example where you can apply a statistical model but you can also very quickly get benefits from learning models because they can factor in more vectors than static rules can. And you can have a scheduler that, for example, learns from the behavior of the de-scheduler. It's a pretty low hanging fruit, pretty straightforward idea that we think will improve system performance significantly over time. So that's where we would move from static heuristics to a learn model. Another aspect is AIOps and AIDEF where just complexity, complexity grows so much. Example would be a flake analysis. If you're running a DEFOps CI, CD system, you have a lot of tests, what happens is that in many cases you'll see a failure and next time the test is run, it goes green again, it works. And most developers will just call that a flake. It's a test flake, it's something in the system or the test was broken. So let's move on because it's working again. Now there are situations where these are actually not flakes, not one-offs, it's just a weird failure of the system. It could be a hard to deep sitting problem in your code that just only happens on a full moon. If someone at midnight, if someone tortured a black cat on a graveyard, not endorsing that, I like black cats. But it's just these weird situations that are really hard for humans to reproduce because they depend on these complex intersections of data streams and system failures in complex microservice systems in the underlying infrastructure. With AI, we are able to process more of these vectors and see if there's a clustering. So the same thing happens and somewhere deep in this vector of inputs, you see a clustering of situations that coincide with these failures. And it will tell you, oh, they're not actually flakes because it's really, it's a full moon, right? And the log tells you, oh, it was a full moon every time this happened. And then we can inform the developers, oh, here, this is actually a problem in your code, right? It has a problem in the moon cycle algorithms, right? And it goes off and does something stupid. So we're helping deal with complexity. In the intelligent application space, what we're doing primarily right now is integration work, integrating existing Reddit projects, as well as community projects and ISVs. There's a website called Red Analytics IO where you can find some inputs, some pointers to what we're doing there. You know, an example where we are doing some work today is for business process automation. So there's the idea of robotic process automation. It's kind of the, oh, I'll learn what a human is doing and I'll repeat it, but we're trying to help with, you know, see, can you put artificial intelligence into business process automation to enable that we think that's a valuable target. I talked about data already. One of the key things we are seeing is the change in mindset. And in that context, right, if you want to change the mindset and you want to give people the ability to, for example, put like three data like code, you need to figure out a way how to manage that and you need to find a way how to give them tooling and workflows, which today don't exist, right? This is, you know, there's a lot of startups in the space, a lot of projects in the space, but there is no common workflow. So we are putting up the data hub, it's open data hub IO as a project to incubate and integrate technologies to get to an end-to-end solution that's completely open source. And we're going to operate that for the community to give you a place to experiment that, to put data, to also exchange data in the open source community so we can start seriously putting AI into open source projects. The problem we are trying to solve here is the complexity threshold, right? Because so, you know, you could say, oh, everyone can just get like all the code is open source so I get a lot of AI tools, put them together and then I'll use my data. The problem is that, you know, in order to do that, you have to run a very complex stack. So, you know, if I want to just use AI, I probably don't want to put up the underlying infrastructure, figure out how to manage the GPU hardware and things like that. So it's a pretty high threshold to quickly put up a build service and application platform for AI, right? So it's very easy to just put up a Jupyter notebook on your laptop and do it once. But if you want to run an extra project on it, you run into this. And this is just to show the complexity. It's just a simplified view of our internal stacks that we are running to do AI experiments, right? And putting all of that together for an open source project is probably not, you know, it's interesting, but it's not what you want to do if you want to be like up there in the top and define research AI models. Or, you know, maybe you just want to apply some well-understood anomaly detection, right? They say I'm home assistant, the home assistant project, and I want to put in a kind of Nest-like, an open source alternative to Nest that learns my user's behavior with climate HVAC control in open source home automation, right? Do you want to have to put all of that up? No, so as a writer, we were trying to provide this as an open source platform that projects can use to develop in that space. I'll skip that one. So, well, so the problem, I'll go in there quickly. So the problem we have is right now that everyone just goes to the cloud, right? Because they overcomes the threshold, right? If you, Amazon SageMaker is awesome, right? It gives you everything you need if you just want to do an AI experiment. You can put your data there, like it's, you'll give it up, but you can even put data there and share it with the world. But the problem is that at the end, that's a black box, it's, they reinvented the mainframe, right? You're running black box services on least hardware and you have very limited reproducibility, right? You don't know what they're doing. It's, the service abstraction is awesome because you don't have to learn how to do it, but you will not be able to actually do it on your own. Which for open source is a problem, for all the reasons I talked about earlier. So, you know, the risk is, and that's a starved GNU, so the risk is that we are starving open source if we go and follow the black box service abstraction model. And so we want to enable open source to basically run this with a commitment to have full transparency on the code and on the data that's used in open source development and a place to do that without having to do everything on your own, right? That's what we're trying to incubate with the data. Where do we expect reuse? And like I'll go bottom to top, right? So if you look at the bottom tier, it was that platform I showed, like building and running containers, Jenkins pipelines, Kafka as a streaming, access control, policy, like all of that. Right now, everyone who is doing AI platforms is building the same thing, right? You can modern, so back in the day there was this, recently, the standard was like a Hadoop model where data analytics was a special case, a standalone cluster. We are moving very quickly to converged application platform because everything is AI based, right? All applications are intelligent applications going forward, so we'll have a converged platform which means that you're running your applications and your data analytics in the same cluster. Kubernetes has already established itself as a default solution for that use case. So most people who are doing this are doing it with Kubernetes. Kafka is the data transport, S3 is a protocol for data address, Spark is pretty common, so they're really common things that everyone is doing in that space and we expect a lot of collaboration of like integrating that, creating operators to make this easy to run on Kubernetes, a lot of standardization into standard products in the space. The second tier is... kind of well understood AI functions like anomaly detection, right? Like clustering where we think you can actually pre-tane models generically enough so they're useful for many people. So we'll kind of a predefined library of AI functions that you can directly connect to. You can still take the model and train it further for your own purposes or retrain it completely, but they're gonna be pre-trained model and we see a lot of interest even in the industry to collaborate on some of that. And then one level up, you have the actual business use case where you aggregate different models into an actual customer function. That's where people usually see their differentiation, right? An example would be fraud detection if you're in a financial business. If you're a big bank, you have your own fraud detection service and you're gonna treat that as a trade secret, as a differentiator. If you're a small bank, you're probably contracting an external service that does it for you but they will treat the model itself and the data as a trade secret. So we see less collaboration up there but there are still options for collaboration. And in a way you can look at Amazon recognition, the facial recognition service I mentioned earlier as an example of something where they're not providing the transparency, right? They're not doing it all in open source and publishing it back like we would but they do get collaboration from all the people who sent their data in to make that model better. So you see some generalization, commoditization of use cases even at that level but that's much less than lower in the stack. The common pattern we see, it's basically you have a secure data platform that abstracts from the hybrid cloud. Basic Kubernetes, Linux, S3 and Kafka. We are using Sath there, OpenShift, our Kafka to provide that. On top of that you have a DevOps application lifecycle management that expands to data. So you're gonna treat basic data as an equivalent to code in this and for example, when you train a model or retrain a model, you'll package up not only software, you'll package up the data with the software. Basically, now today we do that for reproducibility. Most projects do that like when you build code, you store the source code that you built the code from or store reference or package up a source RPM or a Debian source package. So you can recreate that. We're expanding that into containers now with a concept called source container and I think someone's talking at DEF CON for about that because you want to have the aggregate concept and then we are gonna add the model for tracing data so you can reproduce the exact function based on the training data used originally. On top of that then you have common things like language runtimes, processing toolkits like Spark or Flink and then some common services that everyone uses like messaging. We see usually most customers and most larger projects having a predefined library of AI functions, like we are doing that internally. So that would be for example, Flake Detection is a service we have inside Rated that teams can just use, you connect to an endpoint, we are a REST API and then it tells you, you put your data and it gives you results back in an analysis of your data. So you don't have to, so the use case here is basically you are a developer or a QA person and you don't want to learn AI, you just want to benefit from it so you use this predefined service. You're equivalent to Amazon's recognition, right? You can use that because you have data with faces, pictures of faces, right? You can use it. You don't have to learn how it does it. So very simple enablement. The analytics, private microservices, that's what if you're a data scientist or AI developer, you create your own services, right? You could create generalized services or you create your own private microservices and then there's a data science and developer tool chain that you would use to do that like Jupyter and so on. On top of that, just general API routing. And then of course, identity access control. So what we are seeing, talking to customers, looking at what everyone in startups is doing, what's happening in the community, this is pretty much a pattern that everyone is doing and where we think we can get the collaboration. The idea is create a meta project so it's not going to, in that project open data, we're not going to drive individual projects deep in AI. It's going to be focused on making it easily accessible in an open source context and then operating in sense, building up operational knowledge about it, feed that back into the project. This is a goal to get access to the community and to suppose the academic community that's today using the Massachusetts Open Cloud and the goal for Reddit is to expand it to the open source community so provides a Fedora and CentOS community to begin with and other Reddit communities with the capability to use AI tools at different entry points, right? So Flake Analysis for anyone in the Fedora community would be something that's pretty close but then also the ability for you to do your own experiments and develop your own AI solutions. And the entry point basically could be data only, so an S3 entry point. Look at it with some governance around it. It's like a GitHub for data and then build up from there, container platform, you can run things, streaming services, an AI tool chain, TensorFlow availability, other AI tool chain up to a full workflow or predefined services. Example, so just, I talked a little bit about that already but just a quick example, we have deeper talks during the day and tomorrow by the people who are doing these experiments. But for example, we're looking for our own operations teams that run our cloud services. We are working with them to look at anomalies in their clusters. The problem is that traditional systems monitoring doesn't help you anymore, right? You can try to create manual rules to filter down but either you're getting too many alerts or you're gonna not get the alert, well either way you're gonna miss the alert that you actually was waiting for because either it's lost in the noise or your filter is too tight and you're not getting it at all. That's a common problem. It's just too complex, right? There are too many things, too many interdependencies. So the idea is let's take some well understood AI algorithms, machine learning based tools to train it on the data to, for example, find anomalies, right? So if I train it with the data of a cluster and I have enough data, I can identify when unusual things are happening and I can alert on those which is much more powerful than a static rule system for alerting. Flake analysis, I talked about that, right? Can I just get so much, the complexity of these systems are so legit for human. It's hard to understand the complex chain why something happened or even recognize that two things, two failures are related because the relation is deep in some condition that both share that is not obvious for the human. A computer can find these things because it can just shift through all the data. And similar, we're doing associative rule learning. So this is for humans who have right rules but we are helping them to visualize relationships between things so they can give them rule examples so they can more efficiently derive knowledge, derive rules. Another example is that when you open a support ticket going forward, your supporter will know your state of mind because we're gonna do, we're experimenting with sentiment analysis, right? It's pretty obvious that you would do that, right? It's really important to know how pissed off the customer is. But so the idea is basically derive more information from the information we already given, right? Go deeper into that. So recap, so for RATED, like overall, we think AI is an extremely important trend, right? It's more than it is a fundamental shift of paradigm. Like I'm not joking when I say it's as big as the industrial revolution and you can go beyond. So we are focusing very much on the technology aspect and the application to AI, but overall, it's a big shift because it's a different kind of automation, right? It's a big, it's instead of traditional automation, it's just instead of like swinging a hammer on my, on my kernel, now I have Ansible to swing many hammers at the same time on many nodes, right? But it's still like a press a button now instead of swinging the hammer myself, but it's still a hammer being swung in the same predictable set up, right? It's just doing what I told it to do. Here, we are now, the machine is learning to do something based on data I give it and parameters I give it and that means I don't exactly know how, I don't know how it's swinging the hammer and I don't necessarily exactly know why it choose that specific hammer, right? It's not 100% predictable for me anymore. I'm not between the machine and the action anymore. I'm just on the outside putting input into the machine and it's a big change. It also, it means that certain types of tasks probably we won't have to do anymore fundamentally, so on a society level, it's a big change because in the past you could always just move from swinging the hammer to pushing the button, but now there might not be a button to push, the button is automated away. So it's a big deal. We think that everyone has to be aware of it like as a business, as a software project as a developer because it's changing how you interact with systems, how you interact with software and it's going to change what your customers expect, what your users expect from the software. We see a very strong trend towards this hybrid cloud container platform which we are fairly happy because with Kubernetes OpenShift we've done a lot in its space and so biggest priority for us is to make OpenShift the ideal platform to run AI, machine learning, workloads and enable the broader ecosystem and the open source community because as Reted is an open source company we depend on the open source community to picking this up. We are not going to do this on our own. We only want one part of the overall equation and so we want to enable the open source community to get on board with AI and the application of AI within the scope of the open source project to make them better. So that's it. If you want to find out, we are going to publish a blog at next.reted.com and you can go to the Open... Now you can go to Open Data Hub I.O. You should have updated that page. And Red Analytics I.O. has a good place to find out what's going on with AI at Reted and get some quick starts and tools. There's going to be... We have this whole... Today in this room and tomorrow in Metcalfe Small, I think, an AI track with a whole bunch of good talks and on Saturday there is also a data science workshop run by Mike McEwen and Will Benton. Mike is in the back there which I can really recommend. It's a really great workshop on how to run... It's built around Spark and how to run data science and machine learning on top of Kubernetes and Spark. So, really good. If you have any... I don't know how much time we have for questions. I think we have five minutes, is that right? Ten minutes, excellent. We have ten minutes for question and discussion. Any questions? In the back... I think we should have a mic... Do we have the second microphone back there? No to self as the next talk, get out the hand microphone. Hi. I'm just wondering, will you make this presentation available? Yes, so we actually live streamed it and the recording will be available and we will make the slides available. Any other questions? Let me ask again. Who of you has done anything with machine learning? Have you used it just on your laptop or done it in a workflow already? Both. Who of you is using Amazon for that? Maybe it's not as bad as I thought. Why are the Reddit people saying Amazon? So, well, there are no questions. I hope this was one more. Hi, I found your perspective quite refreshing. I have seen some quite chilling things that are coming out of the land of proprietary software from companies such as Salesforce when it comes to AI. Can you expand more on the importance of doing this the free software way? Thank you. So, why does free software matter here? I think there are multiple layers to that discussion. The first one is if we want to do it in open source, we want to enable open source to be AI-enabled. Kubernetes becomes a self-driving cluster. We need to do it all in open source and otherwise it's not open source. It's self-evident that for applying it to open source software we have to do it in open source. That's for the code and the data when you go into it needs to be functionally complete. If we have the training data then we haven't provided an open source solution to the problem. So, I think in our world we'll see a separation of data domains in community data and community data will always go back to the community and we'll see a development of some licensing schemes similar to copy left for that, for data. We then have your own data. So, RET has created a lot of data. Some of that will be open source data. Some of that will not and that depends on a lot of factors. Then we have customer data that belongs to the customer and always going to be private just by requirement in many cases by even regulatory requirement unless the customer donates the data into open source which of course is a possibility for them. For us it's really important that this is complete. Open source needs to be complete, self hosting reproducible. That means the data you use to develop open source to put AI into open source needs to be part of open source otherwise you haven't done open source. It's very straightforward there. Now, you can take it a step further. It's a general trend. If you're using black box services you're giving up control. What's true is anything you do in proprietary software you have less control than in open source. That's why open source matters. Open source is about enabling you to understand what software is doing, reproduce what the software is doing and then redistribute that software. Now, if you go into cloud services, black box services that operate in the cloud that's the extreme case of proprietary software because you don't know and not only can't you change the software you don't have input on the operation side. Now, if you go into data services where now decisions are being taken based on data and you don't even control the data you can put your own data in it but you lose control over your own data when you do that it extrapolates. So it's a control problem. You get into a very strong dependency when you use these kind of black box services. You lose reproducibility. So the top of that, let's say if you're in research and you're doing scientific research and you want to do publications how do you do peer review if you are dependent on a black box service that you don't know how is it going? You don't have reproducibility anymore. It's a business problem, regulatory problem and audit problem. In many cases you can probably offload it if your service provider is the HIPAA compliant you don't have a problem with HIPAA compliance anymore but you're also completely dependent on that. That doesn't work everywhere, right? That doesn't work in research and it goes deeper in where your of course your results are dependent on what goes into that. I don't like the term algorithmic bias because I think it's misleading it's primarily just the old garbage-in-garbage-out problem. It's not that the algorithm the math doesn't have a bias but if you put data into a training model and the data has a certain statistical or statistical aspects, characteristics of the data will show up in the model. And so if you have an imbalance in the data you put in you will see the imbalance most likely in the decisions a model takes. The problem is that if you have a model part of your decision process and you don't know you can't validate the data you're fully dependent on the people who select the data to train the model to do it right. So even like the correctness of the results depends on training the model right. So if you're using someone else's trained model it could depend on weird things in the platform right. If I can't reproduce a binary platform that I trained a model on I can't guarantee that I can recreate the same behavior. Let's say I validated it was correct but it was correct but there was like a rounding error somewhere in the GPU because of a microcode problem or you know it's actually a hard one. In the driver platform or just a parameter setting if I have no control over that I can't reproduce it I might not be able to reproduce the results so that's why I think it's extremely important just for consistency both for businesses and for researchers to control or have transparency in the full stack and have the ability to get control over the full stack you can use services that will not tell you what they're doing would be my way. And then you can take it a step further where now we are these systems become more and more important and then you can get quite philosophical or political on that but I think it's old Lawrence Lasek I think he wrote this Larry Lasek Harvard professor of code and this was like probably 20 years ago or 15 years ago but it's all about how code becomes law because in the way we interact with the world today our interaction with the world is limited by the code that we use to interact with the world. We are all in this room that's great we can talk unfiltered at the moment where I'm posting this on Twitter or on Facebook there are already algorithms between us and if protocols don't talk to each other they don't talk to each other if filter things and don't show them they don't show up and that's another bigger transparency problem that I think only open source can overcome at the end you want to make sure that there's transparency in these kind of decisions there's transparency in and you don't have to go to the big question for self-driving cars the dilemma like do I kill one or five people which is hard for humans I just listen to Sam Harris book on that so for humans the answer depends on how you ask a question if you kill one or five people in lab environments in a test psychologically and for machines we will have to figure that out self-driving cars are not possible without taking that decision they will take that decision and I think that kind of decision needs to be transparent and the only way to do that is with open source and open source treating data as part of the code was a long answer I hope it was helpful one more question two more questions three so I was very interested in the idea of a trusted aggregator of data being a third party can you talk about some examples or with some of the really promising areas that might be there so right now the focus it's for open source projects to share data that's where we're starting well right now it would be like operational data for example for you know out of your cluster or you know let's say you want to do you know my favorite example it's something if no one is doing it I will do it so I'm running my home automation with Home Assistant which is awesome on the Raspberry Pi on Fedora on the Raspberry Pi it's important in this context and it's great too and right now I have to for my climate control I have to write manual rules and like with three thermostats and a bunch of like day, night, season people home like it's already too much I can't keep up with that so it's a very low hang through to just train that and that's what with Nest you do it learns from I have no idea how they do that I never owned a Nest because it's a black box service so I'm doing it myself and with the open data you create that very easily a place where you can put the data with trust we can ensure compliance around it to make sure that data gets managed properly and the open source project can now collaborate on it without having to build up the stack on the four so it would probably fall under GDPR because of personal identifying information so you need compliance to the organization I can't say that in English but you know what I mean and so we can work together to just make that easier available we're going to enable you to do it on your own if you want to then you can participate in the project it's going to provide all the tools to make this very easy but if you just want to use it we're trying to create an implementation that you can just use while being transparent I think we're out of time but two minutes so one more question there was in the back someone who wanted to so I had a question about the practical approach of dealing with these large datasets I can use AWS and I can write my code that's open source and I can use PyTorch which is also open source and I can provide my data and just use AWS as a service and still meet all of your requirements but when we're talking about nest data and we're talking about self-driving cars and we're talking about medical data we've seen from a very simple example Netflix a couple of years ago when they had their million dollar recommendation challenge that even the identified data in a very limited source is enough oftentimes to re-identify people so when we're dealing with petabytes of large datasets how do you envision that people without encroaching on the privacy of whoever is containing that dataset would even start to think about making that data open source because what you need the whole premise of machine learning is the more data that you have the better. Data is far more important than the model that you throw at it or if you have enough data you'll get somewhere but that inherently seems to conflict with the idea of making that open source specifically for these kinds of models of text-to-speech or speech-to-text and everything else. I don't know if the audio was strong enough so I'll repeat I'll summarize data is the key we all agree you need more and more data the problem is that you can identify individuals even with very little data that we're trying to do we're going to try to aggregate a lot of data wherever we are doing it with AI how can you do that with open source without compromising privacy so I think we don't have an answer to that yet there are some techniques you can do with pseudominization that get you there and ultimately it's going to be hard it depends what you're doing when we start with IT data it's fairly simple still so where we're starting out now when you get to medical data it's already a big deal there's a lot of medical data out there already and it's being shared and people are actually not aware how well identifying it is today where MRIs actually are identifying information I don't think they're covered in the regulation yet to that degree and the problem I think so there are going to be techniques to improve how you separate the identification from the data that works for a lot of the simple use cases I think for the harder use cases we have to get better at secure compute capabilities so things like multi-part compute research like that we're going to see so you get to an escrow model for data where data is stored but not disseminated to everyone and why I'm going to do something with it it is in conflict with the concept of open source there's going to be compromise somewhere there in some areas you probably will have to decide whether you're going to go with privacy or you're going to go with transparency in a way I think it could be an opt-in model so for something like a self-driving car I don't see a reason to have private data there I think you can probably make it anonymous enough to train the decision systems or it becomes irrelevant enough because there's so much data about who was there individually there might be things in that data that are compromising individual privacy but if it's everyone I think that washes out a little bit with medical data it gets a bit more like so location data is one thing but medical data I think is a more problematic thing so well thank you very much talk a little bit about the Massachusetts open cloud then we're going to talk about how the which components of the data hub we put in there and then we'll do a demo for those of you that are in academia run open source community this will be directly applicable to you for how you can put data into that environment and then analyze that data and feed that back into your applications so to start we have Oren with us here today and he's going to talk to you for those of you that aren't familiar about what the MOC is and what it's been designed to do alright so I stole this chart from Dan I love it what's today's about the cloud it's basically a reinvention of the main frame it's got all these benefits but what they've got is your leasing compute power in an environment you're locked into it it's incredibly expensive you can put your data up to the cloud for instance to pull your data out of the cloud and it's using open source but it's actually not really contributing back to the open source community so can this we know that cloud gives you this enormous advantage for users of elasticity for data centers of better good operations we can locate these data so it powers cheap so a lot of people feel that the future of computing is in the cloud but is it really going to be in these proprietary clouds that lags for about a minute there so we don't think so we think an alternative model of the cloud is possible what we call an open cloud exchange where multiple different entities can stand up infrastructure and compete with each other and collaborate with each other where multiple different entities can stand up larger cloud services on top of that we can each stand up research offerings alongside these production offerings when we first started having this idea of creating an open cloud and I'm not going to talk about the technologies we've done to do this we started talking to economists there's already 48 types of VMs on Amazon what's going to happen if we have something where there's thousands of types of VMs that seems crazy and complicated but the thing they talked to us about was you're going to get intermediaries platforms like big data platforms, web platforms, HPC platforms and in fact we're going to be talking about one of those intermediaries today that you can select between these thousands of different options in an open cloud where there's lots of competition and variety so this is our vision of an open cloud exchange and the idea is that once we create this in one data center we can actually replicate this out to other data centers this isn't crazy current clouds are incredibly expensive since this is actually being filmed in on YouTube I can't tell you the numbers but let's say it's well over 20 times as expensive to use one of today's clouds then if you took a modern data center at least our data center at the MGHPCC you amortized the cost of the data center for 20 years the cost of computers over 3 years the cost of operation staff it's incredibly expensive the on-demand pricing and even the lease pricing is a lot more expensive than our cost to operate these facilities much of industry is locked out of today's clouds there's obviously lots of great software to develop these with and there's lots of people that don't want to be locked into the cloud so we had an incredible opportunity in this region the universities in the surrounding area had actually built a new data center it's really up there there we go the MGHPCC data center this is an incredible facility this was built by a combination of MIT and the UMAZ system it's 15 megawatts I don't know if that means anything to you but that's like the power requirements of a town of a large town this thing has two acres this is the only place where I've seen where you measure computer space in acres so two acres of space for computers incredible facility located right next to Harder Dam 70% of the power comes from is green and because of the nature of this prices are incredibly cheaper than in Boston so usually it's an opportunity to do this there's a picture of us with the governor announcing the project to create an open cloud it's renamed the mass open cloud from the Massachusetts open cloud it's got all these universities participating in the effort the Air Force, the state our core partners where Red Hat's been a really key partner from us from the beginning and a lot of other partners that have contributed in various ways to the project it's real it's an operating cloud today these numbers were as of last week this week so it's a functioning cloud at a relatively modest scale we have about 400 users directly and last time we figured out it's over 10,000 users indirectly using the service in various ways this is a different chart and it's resulted in tens of millions of dollars of grants because this is one of the few clouds out there where researchers can get involved do innovation change things and it's moving from a project that's a small team developing this as its own kind of isolated project into the IT department's actually working on this as a production service because this is being reused by real users that want to get their work done so increasingly what's happened is people don't want to use HPC aren't just using the data center for high-forms computing but I want to do data analytics they want to do machine learning the data science initiatives from all of our different universities and so two of the projects I'd like to mention that are just coming up now Northeastern storage exchange that got funded for 10 petabytes of storage to start off with but it's doubled in size before it even started so it's now 20 petabytes of storage they'll be available for people to host their data sets from all over the region which is the largest data set repository in the world that today runs on AWS is being shifted and moved to the MOC so it's a going concern and we're really excited about this project because this actually is kind of our realization of an intermediary and it's also something which is really in high demand by the users of the MOC hmm alright so like we talked about earlier so the data hub is based on a set of common services and common applications that are out there in the data science and data management world and we see that basically breaking into three themes so we have our platform and workflow theme these are things, these are like what the operators are concerned with when it comes to how do I operate the data platform here at the bottom how do I make sure I have my identity policies identified, how are we doing the management operations monitoring alerting of all of those systems and then how are we making this a self-service system so that our users can come in and do this on their own without constantly having to interact with the administrators themselves and then from a workflow life cycle standpoint this is something DevOps has down pretty good so this is based on Kubernetes, Jenkins things like that for managing the code pushes the application life cycles that layer itself moving further up the stack that's where we start to put in our reusable models and modules so this is where you have a team of folks that are putting in things like common libraries and common services that have reusability across whether they're in-house applications customer applications or someone else's bespoke application this is where we look to derive a lot of the community investment and intellect in how we deploy what we call our AI library as a series of common analytics that folks can then use in their applications so think of things like anomaly detection flake analysis something anything a CI developer or an application developer can work into their CI workflow to help make them more productive help them get to the root cause of failures quicker and make their code quality better and help iterate over the releases of those that software faster and then the third layer is at the top which is basically where the custom development comes in so this is where communities, businesses, data scientists are coming in taking advantage of those shared services writing those into applications or chaining together them together in such a way that they're getting added value out of them in that environment that's where they also may need to do their own data science experiments so we can give you something common like a commodity like a flake analysis a correlation analysis things like that that are pretty well defined but giving you a custom trained model to determine whether for fraud detection or for natural language processing or for image detection that's not something we can actually give you at the commodity level that's something you may want to train specific to your data and that's where what we want to do is enable that tool chain and that data science workflow for users to be able to come in easily get access to the environment put data into the environment analyze their data iterate on that and then publish that information out to where it's a stored model that some application is then surfacing further downstream or whether it's actually getting bundled up into another service that has an end point that then other users can call into and that entire model life cycle management is something we're looking to enable here and now it updates should be coming the bits take a little while I guess to get from back there up here there we go so this is the concepts behind the open data hub in the Massachusetts open cloud and the open data hub project in general what is designed to do is basically give you capabilities when it comes to data ingestion normalization and storage data exploration around reporting and analysis and then analytic and life cycle management around data science experimentation publishing that into services and managing the workflow therein and so the Massachusetts open cloud is basically a meta project around bringing together the technologies that would comprise this platform so when you're out there in the industry or when you're talking with folks a lot of the technologies that are used to enable these types of capabilities are pretty common everywhere you go you hear things like Kafka you hear things like Jenkins those things have become commodities and what we're trying to do is alleviate the pain from individual users or companies who want to stand up this from actually having to worry about hosting that infrastructure themselves it's already going to be their self-service model you come in and you take advantage of it and so what we're bringing together is a set of communities vendors users operators and academics to do that in a fully open source way where we're getting the benefit of everyone's intellect everyone's experience users know what they want to do to actually use the system operators know what it takes to really operate the system and so we're red hat traditionally has been really really great in the open source world around we know how to write code and we know how to package code and we know how to push that code this is an evolution for red hat where it comes to how are we going to open source operations and how are we going to open source the data management life cycle entirely and I'll hit next to that and the focus here is on reproducibility right that's why we're using open source projects feeding into this the open data hub is not writing a new Kafka we're not writing a new service broker we're using those and we're putting them together in such a way that it leads for a more usable platform things are going to be preconfigured so a user coming in doesn't have to worry about where their spark instances or where's my data storage that's going to be in there all you need to worry about is coming in and actually doing your analysis the other thing we want to be able to do is let projects pick and choose which services they actually want to be encumbered with not everybody needs the full stack you may have data that's already sitting out there hosted somewhere and you just need to bring that data in and then analyze it because maybe you don't buy the expensive GPUs right but the MOC has them so that's great take your data run it in the MOC environment take advantage of that GPU horsepower and then move on to whatever the next thing in your value chain is so you can come in at any layer of the stack you wish and so the of all of those things we just talked about the first use case we have targeted here assuming it shows up it's a brilliant use case is basically around data science experimentation and this is an early adopter environment right now but it's basically geared toward the data scientist who has data wants to come in analyze that data needs access to spark or TensorFlow and is comfortable with Jupiter notebooks so what we've enabled is set for our storage and that's three Apache spark for how we do data management TensorFlow and then a series of Jupiter notebooks out there to take advantage of these capabilities and so enough talking right you actually want to see the thing work and how easy this really is right so that's what we'll do next let me put this up here so I'm kind of worried we'll see how good this does all right so we've got this this all right so the moc environment basically so I'm going to assume like they where this picks up is at the point where you have already gotten a login from the moc team and you can actually get into the system right so there's a request for that you get a login password and you can log in so I'm going to hit refresh on this because it's probably going to take it a minute all right so I'm going to login to the moc environment the first thing we're going to want to do is get some data up here right that's oh invalid wrong one that's right there we go now we'll have valid credentials so the first thing we want to do I'm going to upload data just so you can see how easy it is if you're bringing your own data to the table if you have data that's already hosted out there somewhere you can always point to it but for the processes of this so we're going to add a new container and as you test 0817 and we will submit it great so now we've got an s3 bucket right everyone who's used amazon or anything like I've used s3 you're familiar with that s3 bucket is all right so we'll go verify I'm not lying to everyone here with so some things that's already been done here again I've already logged into the system you can from that system obviously you'll need access to your individual API keys in order to get access to those buckets right that's all taken care of and again because this is live streaming I'm not actually going to click and show you my API access keys but understand that you click that button it gives you your keys and you can copy them and you know use them how you need to so the next thing you will do you all want to download the s3 AWS client so you can go in and then actually you know command line upload your data so from there you would do a AWS configure here again I've already pasted my keys in here but if you want to paste them you could paste it in enter paste in your next key enter and we're just going to leave the region and format defaults for now so that will then have now tell your command line client on your machine here's how I get access to the AWS here's how I get access to AWS now we're going to go in and do something with that bucket alright so the first thing we're going to do and just so you guys don't have to watch me type over and over again we're just going to cut and paste these commands so first we're going to go up and look and see that our bucket actually did get created out there and sure enough there we see SUL's test sitting up there alright so the next thing I want to do is upload my data so in this directory here I have this data file it's a JSON formatted file I think this is actually weather data so we will upload that and again for those that aren't familiar with AWS commands it's pretty basic stuff right so you've got you're uploading it to S3 you're telling it what bucket to put it in if you want it in a subdirectory under that bucket you're giving it the endpoint that you're loading it to so this kaizen this is the moc environment yes this is S3 support bicep alright so there we have uploaded it and then just to prove to you that it actually is uploaded we will view it and so here we are looking again at that SUL's test bucket and then we can see the data has been uploaded if you have more data it would take a little bit more time but it's as simple as that to get the data into the environment itself so from here this is now where we're going to go into jupiter hub and start to analyze some data and I'm going to bring up a fresh one just so you can see it from the get go here let me log out so you can see it from the start okay so this is what it would look like when you first come to the open data hub jupiter hub oh yep no it's the S3 protocol using object storage in the moc environment so the storage is all in the moc environment it's just using the S3 protocol for the storage alright so when you first come in we'll click sign in again it assumes we actually have access and you would get that credential at the same time you sign up for saying I want access to the moc that's the same set of credentials to get you into this project alright so by default when we come in we are let's run the spark one first so let me stop my server and I'm going to start a new one and we'll start with the spark example that we have out here so when you first come in you should be presented with which notebook image do you want to actually run how many people are familiar with jupiter I guess fantastic very boring with what these things are so all of the notebook images here that you see listed these are things that we're hosting out in the community obviously as we need to grow and do support more things then we'll add those types of notebooks and make them available so we've started again with some simple ones around using spark and TensorFlow and there's a couple more generic ones in there I think just using where their site hit learn so this should spawn it's pending let's go back to it there we go so now we're up and running alright so we will open the spark moc can you guys read this let me make this a little bigger is that better okay so I'm not going to go into the details of what's in this notebook this was written by another individual might be in the room who's giving a talk later about analyzing time series data so this is data that's monitoring Kubernetes operations on a running cluster and then reports about the various statistics on those operations I think he's giving a talk later today but the important thing in here is basically that for your spark configuration that spark is already up and running inside the moc environment you don't have to worry about hosting spark yourself standing it up yourself configuring yourself you can override configurations if you want more memory or you need special configurations but by default all you have to do is come in take advantage of an environment variable that's been set up and that is your spark server that's been done for you someone's going to edit that out on the recording sorry and then basically you just point it off to the data and that is again preconfigured so we know where the moc data is and we can execute against it so now we'll just run all of it and this takes about a minute or so to get all the way through or so we'll give it a minute so again all this stuff is built in so that we're trying to ease the operation right I don't know how many of you have ever tried to stand up spark try to stand up all of your storage get all this configured and then actually get to the value add part which is actually writing the notebook and running that notebook but this makes that process a lot faster and again if you had data already out there in Amazon S3 you can point to it it's the same set of keys same set of access and this is where we just wait and see how fast the machine runs should take less than a minute so I guess while that's going any questions on anything you've seen thus far it's a good segue not even one question to help me with the time this is taking time I'm sorry I can't hear you sorry a little bit louder yes yep the Rados Gateway service that you can stand up with we have that it's a pretty standard thing yeah it's broken done it worked yeah thank you for the questions it helped now everything is run here are the results from all of our analysis again Subajit's going to talk about this later in the week so but basically point is approved Spark's out there running, data was processed everything worked just as designed the next one then we're going to talk about here is TensorFlow oops that's fine stop my server alright now we'll restart it and we'll do TensorFlow this go round and I believe so the TensorFlow we're going to show right now is non-GPU but we have GPUs in the environment and we can take advantage of those as well but again it's just as simple as if you're coming in doing a TensorFlow project once this list shows up we will select TensorFlow image and execute against that it's normally a lot faster let's be this room so the experiments were going on at night there weren't that many students around oh is that what it was we have this on a relatively modest scale environment right now do you want to talk about the scale we're rolling out to in the next while should I talk to that so right now this is running on a limited number of nodes a lot of the efforts being to just get this up and running and make this available we've got a new environment that's coming up which will be a couple hundred nodes by the end and this will be rolled out we'll make this available to a broader community at the time when we have that larger scale and so this is just kind of a proof of concept in some sense do you want to exactly so you'll see and I'll put the slide up at the end basically right now like that first slide said that we're kind of the early adopter phase and opened up for the masses just to sign up and move forward with what we'd like to do is if you have an interest if your community has an interest, academia whatever we'd like to talk to you let's see make sure it's a good fit for your perspective from like understanding what you're going to get out of the environment to start make sure it's a good you know we understand exactly what we're going to look from you from an early adopter and then we can get you plugged into the environment and we can start working on it so it's out there and running but right now we're gating access to it until one we make sure we have all of the right requirements for what most people are asking for we made some assumptions based on the common use cases we been presented with to date around data science and the type of work so people want to see supported and we want to make sure that's valid before we just say hey here it is the second part is we do want to put it on a larger scale environment knowing that once you open it up you're going to get a lot of things coming in and hammering away on it and the environment is not designed for that level of interactivity so that upgrades going to be going on so so again the mo c is not intended to sort of compete with Amazon or anything what we want to do is be an environment where first of all we can support all these research uses and stuff like that and secondly where the open source community the research community can actually work together on things so this is going to be the one you know platform like this where the information about how it runs and can come back to all the different open source communities and stuff like that so yes we are building a charging model right now we don't actually have that integrated into it because at the end you got to charge for things if you're going to open up to a large population and we will be opening it up but we sort of intend for it to be more for the open source community the research community and startups and stuff in the region we don't our goal is not to be the competitor to Amazon or anything like that does that make sense alright and so this is now loaded so this is the end this sample you can download from TensorFlow again no magic in the code if you want to know what's in the code I can give you the code and you can read through it but here we will just do a run all this is going to go download some data and it runs through pretty quick here at the end we'll see some total which one are we doing are we still running oh downloading alright importing alright so we're training models right now and then it goes through pretty quick and at the bottom so here the various runs through the neural network and the training accuracy and then voila it's finished right so access to TensorFlow sure we're going to start again so ultimately you want to use this all up something you're happy with would you know let's say somebody wanted an Alabama is there going to be is everything going to be able to source clear enough that somebody can replicate this absolutely yeah so fantastic question so we've seen that you guys all believe me it works and it's real right so I will go back to this real quick maybe present there we go okay so to your point if you want to stand the same thing up open data hub.io this is the upstream community where all of the APB's operators to actually deploy this are being pushed this case that's up there today is the one we just demonstrated on how to actually you can take that deployed on OpenShift that's going to continue to grow we're looking for people to help contribute to that collaborate with us there's a lot that goes into what it takes to actually run this at scale so there's a lot of trial of we have it running internal Red Hat I think you saw Daniel talk about that we have a specific scale we're operating at and that's continuing to grow and change and we're adapting to that to another scale compared to what we're running internally and we're learning from that and I'm sure there's lots and lots of experience elsewhere you know in the audience and out there that we want to take advantage of to make sure that what we have is a truly hardened environment that can stand up to everyone poking at it but then yes you can take those APB's hit the go button and it deploys so maybe a good thing to sort of what we found we've been working really closely with Red Hat and everything and we've been finding a lot of problems as we stand up OpenStack and OpenShift on OpenStack and we have users using this because a lot of the times the development community the open source community aren't in a position of operating things themselves at the kind of scale so what I think the experience of doing that at those layers has led to now the data hub is going forward and that's a really important initiative instead of having a decoupled effort Red Hat and then we'll see working together to make it one deployment a much tighter loop on that feedback of what has to change and if you're interested in being an early adopter here's my information just contact me and we'll start the conversation about getting what's going to be involved in the early adopter cycle and get you guys access one of the other things and I didn't mean to gloss over it but there again data hub is bringing together all the other open source communities where we're actually taking these bits from one of the bits of the Spark components that actually comes from the RadAnalytics.io I think you guys probably are familiar with that or at least I've heard of mentioned a number of times so for anyone sitting in those workshops this week that's the exact same set of bits that we're deploying up here right so we're consuming all the same stuff we're talking about this week here alright and that is all I have for slides and demo so we finished like perfectly for questions if there are more questions oh do you have a mic sorry I couldn't quite hear the question oh well I mean OpenStack is a virtualized environment with KBM and you run VMs on top of it OpenShift is a Kubernetes environment which runs containers and we run actually the OpenShift on top of OpenStack and data hub on top of OpenShift on top of Hardware any other questions no great oh yes yeah correct but we're going to by having this here right instead of it Red Hat only gained the experience of their internal users they'll have gotten experience from a whole bunch of outside users using this in ways they didn't envision which will be a lot better for companies later on deploying that because you know the project will be much more involved and I don't want to understate that point like I don't know how many folks in here like work and operations or IT where you've tried to stand one of these platforms to service up for your users and have gone through the growing pains that come along with that I mean the data hub that you just saw demoed is not at all like what the data hub looked like a year and a half ago and we first started standing this up internally there is a lot involved a lot of learning that goes into it so to think you certainly could just take it and do it but it really does come from the benefit of the masses of people contributing to it to help harden it make it more scalable make it reliable and start to work toward like even that self healing model we're going to be deploying the AI library that Daniel referenced in here as well so there's going to be a set of preconfigured models that you can just download and they'll be available for you to start calling into so there's all of that's going to be in here and that all benefits from the open source community approach so just to go back to what I said at the very beginning I mean open source we've all been a part of that community at least many of us have for many many years right but you know open source isn't enough anymore right the clouds are deploying a lot of open source software but the learnings of how to deploy these things and how to operate these things and the diversity of service is something which is actually locked into these clouds and what we're trying to do is gain a model where we can actually start offering these things at scale the open source community offering them at scale and head with real users so that both we can replicate them to other you know regional data centers and even back to the enterprise and without that you know these clouds are going to be the way we lock ourselves into the big proprietary clouds all right well thank you you want to turn on for a second so it doesn't miss that you wanted me to come on this goes back in hi guys this is Shubhujit here my audible okay so today I will describe how to forecast time series with machine learning and deep learning algorithms and we use that in our production system for alerting so first what is time series and why it is so important to know them to forecast them so time series actually recorded a system behavior that changes over time and that records abnormal behavior as well and we use this time series recording structure for any series based application like financial trading systems, medical devices monitoring software and hardware systems cryptocurrencies and to the autonomous car as well and most important part of time series is like for let's say financial trading system we are recording stock market data as well which has recorded all the abnormalities of stock up and down so in red hat we use the use cases of monitoring software and hardware systems and the image is the ideal time series of sine wave unfortunately in real world that does not exist that much so why time series forecasting is required so why time series forecasting is required for metric analysis first of all health and reliability so time series are recorded over time so every time units let's say it is recorded for one minute intervals so fortunately it can record all the abnormal behaviors of system you have to just find it and you have to create some sharp alerting system for future and reducing false alarm so in case of system admins they get emails like system is down maybe some abnormalities going on or in case of stock markets fund managers get alerts like this funds is going to be downwards so there can be some false alarming as well so they come under the false positive category of classification problem and we don't want that we don't want any fund managers to take a selling decision in terms of false alarm so it will be lost and in case of our system domain we don't want to disturb our system admin where the system is perfectly working normal but our alerting system is not adapted to the current trend of the data so that is the reducing false alarm and plans of the future so at least by good forecasting algorithm we will know the direction of the data so suppose upcoming one week the system will be highly used by some operation so maybe some higher values of metric is not anomaly or let's say for stock in upcoming one week the stock market for this stock is a print or down print so normal down print is not an anomaly there so that is the glance of future and challenges between time series forecasting is stationary and non-stationary so stationary is the green one so that is kind of sine wave type of signal where you can find uniform mean and standard deviation throughout the cities but as I was telling this ideal thing is not that normal in real world so we get the red one where the time series has an increment trend or decrement trend over time and in these cases you cannot you cannot depend on the mean and standard deviation which you are getting from the data because it changes over time so this is the system abstract what we are planning to build some part of it has been completed by our interns as well so we get the data from safe storage the Prometheus matrix data previously it was in influx db and we created data connectors so after crossing the data connectors it should give us some structured version of the data like pyspark frame or pandas data frame so before the connector it is storage specific after the connector it is same so after that there are some transformation steps goes on which we will run the forecasting algorithm so we will see shortly what are the transformation steps are required then the machine learning module comes into the picture so we used various kind of models first one was prophet from facebook second one was tensorflow and keras and natasha is also working on some different models so after that after model building is completed we extrapolate the future so that is the future prediction part and then we store back that data to some secondary data storage that can be safe or any other thing then we pull the data from the secondary data storage and spin up some alert now the data science part is still the future prediction after that it is application layer mainly so I will be describing the data science part in more detail so what are the components of time series so mainly in a normal time series it is not normal actually it has lots of abnormality so there are mainly four parts train seasonality cyclicity and irregularity trains is like visible by bear eye so you can see the time series pattern is increasing over the time and it has not decreased in that time window it is always increasing seasonality is like increasing the values and then decreasing again after some finite time unit and it may not be visible by bear eye but you can detect that by some statistical methods cyclicity is same like seasonality but it is irregular seasonality normally happens in regular interval but cyclicity does not and irregularity is absolutely irregular behavior you can see that there is unique spike and that does not occur anywhere in the future or past now by any forecasting algorithm you can predict the trend and seasonality maybe some part of cyclicity but never the irregularity so irregularity is automatically an abnormal behavior so let's see what are the forecasting algorithms are okay so first one is linear regression every data scientist starts from it but unfortunately it is not that much useful for this kind of series where there is a trend on the data if it is a normal regression problem maybe linear regression you can fit with a quadratic model but not in the time series exponential smoothing as I have shown in case of real time time series data there is some increasing trend in the data so mean and standard deviation changes over time what exponential smoothing does it gives you some weights to the current data versus the past data so the next value will have the more weight to the recent past than the distant past so your distant data points will be fed out and the recent data points will have the most effect on the future values but exponential smoothing also have some problem because it can maximum forecast one step or two step at the future we have to calculate the exponential smoothing every time and it is also not that useful when the data size is very big whole twinters model it adds the trend and seasonal part with the exponential smoothing but the forecast it produces we cannot use that in a real time because it actually lags by some lag unit like 2 lags or 3 lags behind the real data and Arima was the most sophisticated model before like 5 years before now it is not so Arima has some 3 parameters to tune in normal Arima there are seasonal Arima as well but all of these models don't cope up with the big data situation so they fail to leverage the big data facility ok so what are the more recent approaches with machine learning world and let's say deep learning with the time series so first one is profit model so it is kind of for quick prototyping it is a kind of revolutionary model that was created by facebook and it is open source totally so there are lots of contribution chances as well you guys can find that and profit is actually additive regression it models and predicts at the same time and second one is recurrent neural net so it is a part of neural net family and it learns the temporal feature of the data because it is recurrent in nature so whatever you are feeding into the model it keeps the track what to forget and what to remember to predict the next model next output so you will see more details in profit what it gives us ok so profit is like it can cope up the trend and similarities on the data and when it fits the data and it extrapolates at the same time and it gives us like 3 or 4 types of curves or maybe you can say plots so left one is fitting and extrapolating at the same time the black dots here are the actual data point in our time series so blue continuous line is the predicted value and light sky blue is the confidence boundary so you can see that where the anomaly has been found the confidence interval also jumped a little bit so it keeps the track of these anomalous data points now on the right side you can see the 3 plots so these are mainly the trend plots so as we can see on the real data it has an increasing trend maybe at the end of the data there is some decreasing on the trend part graph shows the same it is increasing the first one on the right it is increasing throughout the time second one is the weekly seasonality so as you can see the witness days are much higher valued than the Sunday and Saturday so when I did this graph the data I had it had higher values on witness day that means the system metric that we are capturing by the time series it got higher values on the witness day because the system maybe was in rest on the weekends and last graph is the daily seasonalities we can see that it is forming like something like sine wave but don't think the data actually is like that that is the seasonality component of daily trend now what is happening every 4 hours the values are shifting so if it is higher valued in 4 a.m. at the morning or maybe for I don't know if it is a.m. or p.m. that I can't remember but it is shifting every 4 hours so these all are very useful point for time series because when we are seeing the real data we don't see these trends in the data we cannot see that by bear eyes so these 3 plots helps us to identify that and for system admin we can say that suppose in witness day they are getting some higher values that may not be the error situation or any animal situation because usually you should get higher values on witness day but if you get any higher values on Saturday or Sunday maybe that is something to look at so you will see the prediction part of the profit okay there are a lot more data so blue spikes are the real data green line if you can see the green line that is the actual predicted value that is y hat red one is the lower confidence boundary and purple one is the upper confidence boundary problem is profit don't predict that much good when the prediction is very small amount of time like 1 minute of interval so these data is taken from 1 2 minutes of interval interval lags and you can see it has almost missed all the spikes so that is the problem with profit and that is why it is still experimental model we can try with the different kind of parameters of profit but it is very useful to find the trends and seasonality on the data by profit so we will see the residual analysis so residual plots are created from actual minus predicted value main hypothesis of the residual analysis for time series is it should be in Gaussian distribution after modeling if you can't find Gaussian distribution from your residual data then you are doing something wrong very wrong but if you can't find the Gaussian pattern the bell curve pattern from the data then you might doing something right like you have your model have found the seasonality maybe the trend but there are some problem you need to work on but the assumption are right so the left one is the actual and I zero centered it by subtracting mean and dividing by standard deviation so it is showing the bell curve so that means profit model we can use for time series at least for the routing. Shortcomings are so there are some special spike list you can mention on the profit like holiday list suppose you are predicting something like hotel reservation hotel reservation so on the holiday dates and maybe the special event dates there will be surely some high spike or low spike on the reservation series you have to give that data manually to the profit so there are some like 10 days in the event series and you have to adjust the model by that so that is manual process and we can't do that in production system we should not do that it is still experimental and may not be very accurate for minute by minute prediction so what is RNN now? so RNN gives us a good promise against all those previous models that it learns the temporal context better than any model and it can work with the big data so prequisites so I have used python 3.6 plus and tensorflow 1.9 actually keras 2.2 why keras why not direct tensorflow the guys or whoever is starting with deep learning you should know the tensorflow is the widely used machine learning framework because of the community support and the examples as well but it has some problem or maybe some complexity with the data dimension so when you have a multi-dimension data you have to mention those dimension correctly on the tensorflow API otherwise the model will not work keras does all those stuff automatically by its own internal layer and gives you a very simple architecture to implement any model and you can stack layer by layer very easily if there is time I will show that how to do that so RNN architecture so there are mainly two types of RNN GRUs and LSTM GRUs stand for gated recurrent unit and LSTM is long short term memories so LSTM is better for time series because it can forget I mean its little contradictory what I was saying but truly it can forget what it needs to forget and it has some very cool way to do that it learns over time what to forget what to remember data in supervised learning format so what we saw in the previous slide so it is a continuous series continuous numeric series maybe with float value so we need to convert it into a supervised learning format so let's say we have a series of X1, X2, X3, X4 like that to Xn we are taking X1, X2 as past and predicting X3 as future let's say second X2, X3 as past and X4 as the future so what it does it takes consecutive two items in the past and predicts the third item so I am not saying I have used only two values from the past there are like 50 or 60 values but it is the technique how to convert a time series to a supervised learning format train validation and test split whatever forecasting you are doing whatever model is you are applying you always divide the data like 80, 10, 10 format or 95, 5 format if you have big data you can do 95 and 5 division stationary so at the first slide I showed that real time time series are non-stationary so there is no identical mean or standard deviation over the time so you have to convert that somehow to stationary time series so we do that by taking one lack of information X2-X1, X3-X2 like that so these are all the transformation steps from our data pipeline so the images of recurrent neural nets you can see it is an N is to one recurrent net so it takes N inputs and gives us one value just like what we want for our time series so X1, X2 those are the time steps maybe you can say the features and different layers by which it is going and then it is predicting why so now the LSTM part so LSTM is little sophisticated than normal RNN on the left picture you can see the right images of plain RNN and LSTM is little bit complex complexity comes from input gate forget gate and output gate I told right so it learns how to forget so the previous time steps will be forgotten by its optimized parameter and these mathematical equations mainly FT is the forget equation so WXF means that is the weight to go from X to F so weights in neural network works like that the suffix says if it says XF that means it will go from X layer to F layer and connection is the WXF that is the weight matrix so when a algorithm learns it optimizes the W so forget gate is responsible for removing information from the cell state what you are seeing in the left and right that is a single cell of LSTM so multiple cells are connected so multiple neurons are connected so one neuron consist of this cell it depends on the previous time steps that is one that means a single cell is updated from the previous time step the current input and the weight that has been optimized this is mainly these five equations are mainly for the forward propagation part of the algorithm I did not mention the backward propagation but that's how it optimizes this equation by partial derivatives so you will see some hyper parameters to optimize before using an LSTM so if you just put an LSTM block before time series it may not work well because there are lots of regulators to be tuned among them I have mentioned six major thing so you always have to do some cross validation with these six so sequence length is like how many past time steps are required to predict the future like I have used 50 time steps so that means 50 minutes trend will be used to predict the next time step next is mini-bath size so learning algorithm does the learning such a way so if you don't use mini-bath size and you have 1 billion records so before doing any optimization of the weights it will see all those 1 billion records then it does the fast epochs and optimizes the weight slows down the learning and weights don't learn that much to give us the optimized value always we should use the mini-bath size it has multiple good points like we can use many big data size than our RAM have as well as the learning algorithm optimizes very fast and we always should use to do the power batch size because our computer architecture is built like that it always works with to the power optimizers that is the optimization function for learning part so mainly we have SGD, RMS prof, Adam and Adagrad so main basic is gradient descent stochastic gradient descent and RMS prof, Adam all are built on top of that with some modifications so most of the time I found from my experience Adam does the very good then compared to vanilla SGD but RMS prof is also very good for regression problem I did the cross validation Adam that time so activation functions what is activation function so normally linear any neural network does only linear transformation that means multiplying some numbers with other numbers that don't learn anything so linear transformations without the activation functions whatever layer you build how much hidden layer you give it will always be a linear transformation when you put some activation before going to next layer then it will learn something with some complex pattern because this is the non-linearity we have mainly five activation functions the tanh relu, likirelu, sigmoid and linear function so I will only describe relu and likirelu because sigmoid you all know maybe that is the activation function for binary classification problem and linear is nothing so whatever the value of x it will give the output as x only relu and likirelu is little interesting because you can see that likirelu will not give you any negative value so if after transformation with the weights if your values are coming as negative it will dump that so it will shrink the parameter likirelu gives us very small number when the values are negative I have used likirelu in my study so it does not dump the input so if you use likirelu in case of regression problem it just reduces our feature size in case of image processing you should always use the relu because in case of image we need to dump many data because except the object there are lots of noises also in the image but in regression we don't have regularization I have used dropout so I will just quickly show what is the dropout so left one is the standard neural net all layers are connected with each other right one is the dropout layer dropout means the framework will dump some neurons with some different probability if you give dropout as 0.2 so it will dump 20% of neurons from every layer so it gives us better regularization okay now we will see the architecture of LSTM what is this visible so it is starting from LSTM first unit so this is the input unit and as we have shown there is a dropout layer between every two connected layers LSTM 1, LSTM 2 and LSTM 3 so what you are seeing is the forward pass of the algorithm so it goes from LSTM 1 then dropout then LSTM 2 then Licky relu that is the activation functions then again dropout layer then LSTM 3rd unit so green ones are the hidden units and at the end there is a dense layer that means fully connected layer from where the regression will come out and loss functions are calculated at the end of the process of forward propagation then the backward propagation starts the yellow lines are from the backward propagation so that is the learning part of the algorithm so this is the model training part first one is the training loss as you can see that is ideal world it is always decreases with number of epochs are increasing below plot is the validation data loss function now you see the training loss will always look this decent but validation loss will not so there are some global some local minimize of the data on the data so as you can see both graphs are having a lot so when the training loss at the minimum may not be the validation loss is also behaving same that is why we always need to use multiple iterations of the algorithm as well as the regularization part these plots are made with tensor board so now comes to the prediction part as we saw on the profit it was not predicting the spikes well for every minute gap now as we can see here LSTM could find the spikes in one minute interval and also it did not find some higher towers so those are our alert distribution so we can club them for let's say 5 minutes interval and club those towers and compare with the predicted values to generate alert and right one is the residual analysis as we saw in the profit as well it is giving us the normal distribution of the residual plot that means actual minus predicted and the lag and the ribbon the plot is the data density as we can see 99 percent data should be like mean minus two standard deviation right so it is giving that much data density to the residual plot so our algorithm implementation at least we can say it is finding the correct tune to forecast the future so what we can do from here so as we were saying we have common AI library where you just put your data to get some result out you don't have to make the model from the scratch just to get some ideas we can use this model for that AICU common libraries then we can do the modeling as well from sequence to sequence prediction what I did from n is to one prediction right there are n inputs and it is predicting just one value in place of that predict sequence to sequence that requires auto encoder architecture more deeper layer we can use so I have used only two hidden layers because of the computing cost only with two hidden layers it took a lot of time like three hours to run the total model and predict the output so more deeper layer we require more complex hardware and increasing the feature space so I have used only the single time series and created the past data from that so we can use the correlated data points as well from other time series so some operations have some impact to other time series so we can take that and make some better predictions from here so if we have some time left I can show the implementation part do we have time left okay so this whole code is hosted on Github so you can get that from the presentation I want to quickly show something here yes please this is the model architecture as I told Keras does it very efficiently so you can stack layer over layer and it is as simple as that yes you obviously need some idea about that but implementation is very easy now you can see that after every layer there is a dropout layer and it has 0.2 that means it will dump 20% of neuron from every layer to reduce the overfitting problem but in times of model fitting only you have to use the dropout when you are predicting the future you should not use the dropout because which neurons it will drop you have no idea it is just random so I have kept a parameter predict as default so it is training time in the training time it will add the dropout layer in the prediction time it will remove all those dropout layer in the prediction time only the elasteme units will be there next part is callbacks okay so callbacks are the special kind of methods on Keras it helps us to store the best result between the model training as well as model metadata as well as for the logs so I have used mainly 3 or 4 callbacks here first one is the reset state so as I was saying we are doing mini batch gradient descent for learning so what it happens between a single epoch it divides the data by the batch size so for the first batch it initializes the neuron with the random weights now these random weights are updated with the learning algorithm for the next batch it does not initialize with the random weights it takes the initialization from the first batch and initializes for the second batch so after completion of the epochs it has optimized the weights number of times of batch sizes but after completion of epochs we should remove all those weights from the model state because otherwise it will be overfitting model reset state does that I have created a custom class which will be called after every epoch next one is the model checkpoint checkpoint is like the heart of the deep learning so when you are doing the model training it is evident that for every epochs may not be the validation loss is the lowest one so this checkpoint will make sure that the best model from let's say running it will make sure that we will store the best weights for the future use so maybe it came around 50 epochs so it will store the best result from 50 epochs and will discard worse weights from the result tensor road is the for log visualization so the architecture I showed that is created from the tensor board and reduce error so when you are doing the learning maybe for every iterations the weight will not learn weight matrix will not be updated so if it is continuing for let's say 5 iterations the validation loss is not reducing then we should reduce the learning rate that is part of hyper parameter tuning as well so the parameters here is the monitor that is validation loss factor is 0.9 so that means validation loss is not improving that means reducing after 5 epochs it will reduce the learning rate by 10% patience is the 5 iterations and minimum learning learning rate means it should not go beyond that so this is the model architecture this is the model architecture in the time of training as we can see 256 is the batch size and 50 is the input sequence size now when you are choosing the model architecture you should make sure that the feature space should be doubled in time of hidden layer and then it should reduce at the end of the layer so dense means the fully connected layer from where we are getting the outputs but before that all are hidden layer LSTM 2, LSTM 3 we cannot get the state of the model directly we will see the prediction net so this is the prediction net it has absolutely the same architecture with the training model except the dropout layers as I told you cannot use the dropout layers in terms of prediction net and also we need the prediction net why it is called prediction net because in the training time we used 256 batch size right so if you use the same model to predict the future it will always require a perfect divisible test case of 256 if you give a 257 size of testing data it will throw you an error it will require 256 perfect divisible number so that's why I created the new net with the same architecture and batch size 1 so at the 48 block you can see the batch size is 1 and here you can give one sequence at a time so that's more likely I had to show any questions in between so this code is available I have given the link with the presentation so you can access the code and run it so that's it thank you thank you Shubuji at the side don't put it in the middle like maybe here this works and then you have it from behind so now from here from the top I'll help you now sorry great ladies and gentlemen next up we have with us Vue Vera software engineer at radanalytics.io and he's going to be talking about building streaming recommendation engines on Spark I request you all to have a seat and welcome Vue hi everyone thank you very much can you hear me no hello yeah so I'll use this one thank you thank you very much for coming so my name is Vue as you heard now thank you sorry about that thank you so my name is Vue as you just heard and I'd like to talk a little bit today about building streaming recommendation engines on Spark and I'd like to talk a little bit about batch streaming of batch recommendation engines so that's a common approach of doing recommendation engines and also how easy it is on one way to build this kind of distributed recommendation engines but on the other hand I'll building them in a streaming and distributed way can be tricky so you have some roadblocks you might stumble upon so I'll introduce the concept of collaborative filtering which is the most common way of producing recommendation engines and I'll talk about two variants which is the batch alternating with squares and streaming alternating with squares I'll also introduce a little bit about Patches Spark I'm assuming some of you are familiar with and I'll talk about an implementation on top of a Patches Spark called distributed streaming ALS and finally I'll talk a bit about how it works on a modern cloud environment such as OpenShift so what is collaborative filtering so I mean first let's talk about recommender systems so recommender systems are a popular method of matching historical data from users products and the rating that you have so the connection between those users and products usually you have like a unique relation between a user, a product and a rating so say if you go to a website where you want to see new movies or buy a new movie you might see some movies there you recommend one so if you give it five stars you're going to have a unique relation between yourself, the user, the product, the movie and the rating that you just get and in this jargon collaborative filtering collaborative just means using all of the data that you have globally from all the users and filtering is basically predicting so you're basically doing predictions on the data you already have so in a way collaborative filtering we use it in our everyday life and it's quite common sense if you think about it so let's assume the main idea behind it is let's assume you have two groups of people so one of it is group A which is a group of people with which you share musical taste so everything they like you usually like and you have a group B which is people which you don't share any kind of musical taste so everything they like in music you hate so if group A recommends you an album and group B recommends you another album so which one are you probably going to buy so you're probably going to buy group A right so yeah so that's basically a collaborative filtering in a way so as a bonus question if group B says an album is really good or says an album is really bad sorry does that mean you're going to like it because you really don't have an informative relation between the data and the one between the group you have so it's a different kind of relation so you can't really say you're going to like that album so one of the most popular methods for collaborative filtering is alternating ways squares and in ALS we assume that we have all of the data organized as a sequential ordering of users and products and we can build a kind of matrix it's a natural way of displaying this data so we have a matrix representing all the ratings and this is a sparse matrix and we have some ratings missing from there so not all of the users rate all of the products and each entry represents a unique relation between the user and the product so what we're doing with ALS in a nutshell is basically try to factorize this big ratings matrix into two latent factor matrices we're just going to call them U and P here and these two factors when they multiply it back together they're going to give an approximation of the ratings matrix and that approximation is going to include the ratings that are missing and that's going to be a prediction of the ratings that are missing so one classical way of doing this is using a batch method so in a batch method what we do is the factorization is done by defining a loss function so basically define a loss function having an error term there which is a difference between the actual ratings that you have and the prediction of the rate you're going to have and you have some regularization terms and the derivatives of the loss function in order of U and P to 0 you have a nice set of linear equations which you can solve by iterations so that's quite handy so the way we do it is we fix one of the factor matrices we solve the estimator in order of the other one and then we just iterate the process going back and forth we have one fix and the other fix so eventually this process is going to converge and you're going to have a very good approximation of the ratings matrix and finally in the end what you have is something like this so you still have the data that you actually have and in red you have an approximation and what this approximation means mathematically is that these values are going to be the ones which actually minimize that ALS recursion so it's going to be probably going to be if you have enough data and enough iterations a good approximation so to visualize this let's imagine a very quirky shop where you have 300 products they only sell 300 products and you only have 300 customers and to be even more quirky each customer can give a rating in 8 bits so you can give any rating from 1 to 256 and as we're humans and we visualize better things patterns and colors and we do in numbers we have to send a palette to these numbers so I think you know where this is going so you can build a ratings matrix you can see so this is going to be our ratings matrix so if we use ALS to solve this how do we go about it so first we fill the rating factors with random numbers so obviously if you fill them with random numbers your initial approximation is going to be a random matrix so that makes sense and then you start the iteration and as you go about the iteration you can see that well it's doing actually quite a good job right so after a few iterations it's actually approximating the matrix that we had originally so okay it works but it's expected to work in this case so this is probably the simplest case of ALS you can imagine so it's a very small data set you have all the ratings you know all the ratings and it's not a distributed system so no traps you to fall upon so yeah of course it's going to work so you might be thinking well this is all nice but we can do this in a streaming way right so if you're feeling we're approximating that matrix we get new observations we get new ratings or we get new users or a new product is released we can just recalculate the whole thing right well yes you can but in one way that's not going to be streaming just because of two technicalities one of which is you have to keep the whole of the data so you're not using a streaming implementation because obviously you get new data but you have to store the entirety of the historical data you can't approximately new matrix we just add one new rating that you had secondly if you want to do this in your real time it might be a bit problematic because you have to imagine if you're a shop or a company that has millions of users and millions of products obviously it's going to be a bit tricky to recalculate the whole thing in real time so how do we go about it so we want to look at a method that allows us to do this factorization with just one rating or a few ratings at a time and fortunately there is a method for that it's called the stochastic ring which is a factorization and the specific method we're going to use is bias stochastic and gradient descent to factorize the ratings matrix so what's the difference between stochastic gradient descent and match method so basically we introduce a new concept and this is the concept of bias so here we have a bias between x and y so x is the user and y is the product and the bias associated with the product is where it's going to be mu which is the global average of the biases of the ratings that you have and you have bx which is basically how much does the rating deviate from the average that you have for all the users or from all the products and then the new approximated ratings is going to be the batch case plus this new term so if we replace this into the loss function so we can just devise a new loss function for the streaming case but now we're just going to have the new predictions in the loss function plus some regularization terms that you're not going to go into them so calculating the silver ratings is quite expensive so we're going to calculate them for the single observation so we as you can see from the update of the biases and the update of the factors iteratively but just with one observation at a time and that's exactly what we want so provided we have a single rating the rating of bx and product y we can update the biases as well as the latent factors and that's going to allow us to view the factorization in the real term and an interesting thing is that this method also has a conversions property as the batch method has so the practical difference is that streaming data is obvious now I hope is that in both methods the objective is to estimate the latent factors but in one method so in the batch method whenever you get a new observation you're going to have to recalculate the factors with the entirety of the data whereas in the streaming version whenever you get a new observation you just update the grade that relates to a specific row that's a specific user and product and it is important to note that under a certain point of view these methods aim at exactly the same thing so they both aim at calculating the latent factors and from that making the predictions is just the way they use the data that's going to be significantly different so let's look at an illustration with the same data set of the streaming case and we're going to use the same manufactured ratings data you can see the conversion here seems to be happening but slower that is to be expected because now we're not using the entirety of the data just one observation at a time but in the end we're going to see a conversion to a similar result we're getting a good approximation of the ratings matrix and again this is a simple example so this is with a small data set in a local machine there's no distribution happening but we don't want this right we want to try this with big data sets actually implement something that works at scale so something that might work with big amounts of data and to do that we're going to use Spark so I'm assuming that some of you are familiar with Spark so who here has worked with Apache Spark yeah not that many people okay so I'm just going to do like the 10 second mandatory introduction of Apache Spark I hope I hope your answers describes faithful what Spark does so Spark is a framework that allows you to distribute calculations at scale and it provides several core data structures such as resilient distributed data sets data frames and data sets and the RDD the resilient distributed data set is an immutable distributed type collection of objects and what does that mean it means that when you create one of these RDDs they're actually mapped across your cluster and as a remutable what happens is you can actually map your computations to each of the clusters so the calculations are done in parallel at the clusters and then you just aggregate the results back so this allows for a very natural way of distributing computations if you can translate your algorithm in this kind of distributed immutable operations and for the performing ALS application you're going to use RDDs as the core data structure so Apache Spark already provides in its ML library an implementation of ALS so of alternating with squares but it's a batch implementation of alternating with squares but it is a very performant one it works very well if it works for you just use it by all means and it has a very simple API so basically if you're entering a model you just need a few quantities so basically you need what is called a rating and the rating is just a wrapper around the quantities we mentioned so the user ID the product ID and the rating and you have the ratings which is just your matrix is just an RDD of ratings you also need a rank which is corresponds to the number of elements in each of the columns the rows of the rating factors that we mentioned and you also need iterations which is basically a hard stop on when should the iterative process stop so this is quite useful because you actually can know that the problem is computationally bound so you know it's not going to last forever you can say well after 100 iterations this is going to be a good enough approximation you can stop and it allows you to pass also some regularization terms as we've seen such as double sorry so the way to train a model is quite straightforward as I mentioned so basically you just pass to the ALS class there's an ALS object you just pass the ratings, the rank, the iterations and the lambda and what you get back is basically a class called matrix factorization model which is just a wrapper around the two latent matrices that we've seen and this is to work obviously in a batch setting but to actually work in a streaming implementation you're going to need a streaming data source and the streaming data source that we decided to use is the Spark's discretized streams or D streams and they basically work as many batches of RDDs over a certain time window so basically you're going to get batches of resilient distributed datasets over a certain frequency and time window and then you can process you can apply the process to each of these mini batches so I mean an important thing sorry if I can just scroll back so an important thing to notice about or advantage of these discretized streams or D streams is that we now if we use this on a streaming ALS we no longer need to keep a variety of the data in memory or even access it so basically if you can imagine the case that I mentioned that we have millions of products and millions of users now if you get a mini batch with just a few ratings you don't need to say read the database with several hundred million ratings to redo the whole process you can just use the data that you have in that mini batch so we wanted to since the ML web API is quite straightforward and intuitive we wanted to use the same API on the streaming ALS so we're going to use the same type of commands and the way we're going to do this is initially when we don't have any model or data we're going to create a model with initial RUD with initial mini batch and then from that model we're going to update it with the mini batches that are going to come afterwards so you're going to be continuously updating the model as new mini batches of data arrive so what do we need to train a model so now I'm just going to give you a few of the steps actually like the algorithmic steps of going from that initial mini batch to a trained model just going to be a couple of steps they're going to go into some detail but hopefully it's going to give you an idea of how it is tricky to implement like this kind of algorithm in a distributed way but in a way the flip side of it is that you get obviously a distributed recommendation engine which is quite performant so what do we need to get the model this is from if you remember from the initial slides from the formulas so we need actually these quantities to have a trained model so once we have them we can say we have a trained model and we can perform predictions so let's start with looking at user lighting factors so these operations are going to be identical from one mini batch to the other so the same set of operations you're going to do on the first mini batch you just repeat them with a new data and you get to continuously update the model so to calculate the user lighting factors so what we need is like in the batch ALS we get an RDD of ratings right and this corresponds to the ratings that each user gives a product and the first thing we're going to do is we're going to split this RDD into two RDDs so one keyed by user and one keyed by product right and this will allow us to compute the lighting factors and now we're going to do them so you have to just keep in mind that this is the first initial step, we have no model, we have no lighting factors, we don't have anything so the first thing that we do is for each of these two RDDs that we created we're going to generate a random feature vector and the way we're going to do it is to just create a feature vector of rank R so the rank that we decided and we just see which is random uniform values and with each of those feature vectors we're going to associate a random bias as well which is quite easy so the next step is to because we actually split the RDD into users and products keyed one by users and one by products we might have some duplicated users of products in this RDD so you can imagine the case where if you rate the two movies obviously you're going to be on two entries of this RDD so twice in the users and twice in the products so what we're going to do, we're going to join these ratings which in turn will return the data set consisting of product IDs, user IDs ratings and user factors right, so we join them and we get to return is we have these mappings between the user, the product, the bias and the feature vectors right so finally it's just a couple of steps more, you can see it's actually quite simple what we do is we swap the RDD keys having one key by user or product which is reverse the key and we take this intermediate data set and we join it with the other feature vector so now we have a complete RDD in which each raw item of it is going to include all the biases and all the future vectors for each combination of products and user IDs and the ratings so now we can calculate the global bias so if you remember the global bias is simply the average of all the ratings that we have so we can do that easily in Spark so we just calculate an average for all the ratings that we have so finally we just need to now calculate the user specific bias and the product specific bias and that's quite simple as well so so you've seen before we can update this we can update the bias by using just this gradient term and the new bias is just going to be the old bias position gradient term and to calculate the gradient we just need these quantities that we have here we need the error, so we have the difference between the rating and the prediction we need the gamma and the length we have those, those are model parameters so we have everything that we need so let's start with calculating the gradient so now that we have each item of this RDD for each one of them so now right so we calculate the prediction right because we have the feature measures and now we can calculate the error so the error is going to be simply the rating versus the prediction so you can calculate the error so now that we have these quantities if you remember just the expression of the gradients it's quite straightforward to calculate so basically what we do we basically take the RDDs we have for the rating factors for the users and the products and now we just key them by user and product and we do an aggregated sum for each of those split RDDs that we have so now we get the gradients for the users we get the gradients for the user weight in factors and we get the gradients for the products and the gradients for the product weight in factors so now we have all the quantities we need say we have a model. So we just sum the gradients so that we have one gradient for each user and product. And that's it. So you might say, well, that seems a lot of steps, but that's the price you have to pay for doing this computation in a distributed way. So you might think, well, now that we have them, we have the latent factors. We can perform easily predictions. But what if you get new observations? What if you get new data? What if you get new data from a project we never seen before or a product we've seen before? So how do we deal with that? So you might remember that we trained the model with nothing in the first initial mini-patch, right? Sorry. So we trained the model with nothing in the initial mini-patch, but now we're just carrying over the model to the next mini-window with the next mini-patch. And we just train it with the ratings that we have. So let's look at the case where now we seem to get a mixture of data from ratings we've seen before. So imagine you rate a new movie. You rate a movie last week, but you rate a new movie. But some users decide it goes into the system. It's the first time it's trying to rate something. Or it's rating a movie that didn't exist before. How do we deal with it? So as you hear, we just assume that the cells in red are ratings that you haven't seen before, right? And the others are cells that for users are products that you've seen before. So what we do now is instead of assigning random feature vectors for all of the products that we get, we basically do a full outer join between the RDDs that we produce and the rating factors from the data RDDs and the rating factors that we have. So that allows us to keep the RDDs that we already have with the feature vectors that we already have. But for the ones that we've never seen before, now we can do exactly the same steps as before and create new feature vectors and random biases as well. So in this way, you can deal any kind of situation that arises, right? So how does this behave in the real world? So we decided to test this with some real data. And to do that, we use the movie lines data set, which is a quite widely used data set in the recommendation engine research field. And it actually has two variants, so a small variant, which is quite big for prototyping. So it has ratings that users actually get to move these in a small file. So 100,000 ratings is quite good to do some quick prototyping of algorithms. And it has a full variant with 26 million ratings, which is quite good if you want to do some more in-depth analysis of the performance of an algorithm. And the data actually has several fields of interest, but we're just going to use a few ones from this data set, which is the user ID, the movie ID, and the rating. So that's basically what we're going to use. No, no, I was just going to go into how we set up the whole thing to train the streaming case. So yeah, that's a good question, but I'm just going to explain your question in a second. So how are we going to train this in answer of the question? So first, we train a batch model so we can have a baseline so we can compare it to the streaming version. And we just use the Spark MLBout of the Box batch ALS. So we split the data into 80%, 20%, and we basically train the data in one of the splits. And then we just keep part of that data set to the side so we know that we're going to use exactly the same two splits on one method and the other so we have a fair comparison. And we calculate some kind of error measure to have some kind of metric of how well the model is performing. So in this case, we use the root mean squared error. So it's easy to calculate in Spark given if you have the predictions and the original data. And OK, so we measure the root mean squared error and then we're going to compare the root mean squared error of the streaming version against this one. So how do we set up the actual streaming testing case? So we train this on OpenShift. And the idea was to use some kind of message broker like and we actually use Kafka to simulate the data stream. And we use StreamZ to do that, which is a project that allows you to deploy Kafka on OpenShift. And we use Oshinko, which is a tool from redanalytics.io, which allows you to easily deploy Spark clusters on OpenShift. And basically what we just did is we read the answer to your question. So we read the entirety of the data. And then we just basically replayed it through Kafka as to simulate the stream. And we use Windows of five seconds with 1,000 observations each. But this is just for convenience. It's just because it's convenient for practical purposes. You can use whatever you want. I mean, realistically, if you use one observation per minute or something, you're probably going to wait several months or years to wait for this to finish. So we just use that for practical reasons. And an important note is that the best parameters for the batch model are not going to be necessarily the best parameters for the streaming model. Obviously, they're completely different or very different algorithms. But also for convenience and for practical purposes, we use the same parameters in both models. But more in the next slides, I'm going to go into how to estimate hyperparameters for a streaming ALS version. So this is the result that we had. So the horizontal dashed line is the root mean square error for the batch version. And the blue screen line is the root mean square error for the streaming version. And you can see, well, it's quite good in the sense that it is what we were expecting. So in the beginning, you don't have much data for the streaming version. So it's kind of all over the place. But as time goes by, it does seem to be converging to a value which is in the same region as the batch version. So in the end, both the batch and the streaming process does the same amount of data. So it's a reasonable result. But you might think, well, this is all very good. And streaming ALS is like a silver bullet. It magically solves every problem we might have. But that's not the case, obviously. So some things which are very important to consider when using streaming ALS. So a problem with, and this is not particular of streaming ALS, is for all ALS and many machine learning models, is the cold start problem. So basically, the cold start problem refers to a initial point in your model training where you don't have enough data to make any kind of insightful inference or prediction. So you can imagine those slides showed of the monoleism in the beginning. If you remember, the latent factors are completely filled with random data. So the approximation is going to look completely random. So if you just have a few observations, that's not going to change that. So it's going to look pretty much random. So always be careful when providing, because you might feel tempted to, since you have a streaming version, I'm going to start serving away predictions immediately. So that might not be the best idea. And something you might do to mitigate that is to say, if you have walls of data, if you're a big company and you have walls of data, first train the streaming model offline with a big chunk of data and then start serving in a streaming way, right? So, but bootstrap the model with a big chunk of data. Don't start from zero and immediately start serving predictions with like, say, five grades or something like that. That might be, so the predictions might be rubbish. You wouldn't be able to get back. So another thing to consider is hyperparameter estimation. Excuse me. So hyperparameter estimation, excuse me, in the batch ALS, it's quite straightforward because you can do a grid search for parameters for several sets of parameters and then you decide, well, parameter set D is the best for my data. And then at some point in the future, if you want to retrain the whole thing, you can do it. Like I say, after two months of having this model running, you say, well, it's not behaving very well. If we try to retrain it with a rank of double size, something like that, then you can do that perfectly. You take all of the data, you retrain the model, that's it, that's fine. Because you have all of the data. But with the streaming case, that's not the case because when you get the data, you discard the data. I mean, in theory, obviously. But what I mean is if you're in a position in the streaming ALS that you need to refer back to an entirety of the data and retrain the model, then it's not really a streaming version. You're doing a batch, kind of batch hybrid streaming version. So what you do is you have a set of parameters, you get the data, and then if you want to try a new set of parameters, you can only retrain the model with that new data that you have because you discarded all the previous data. So you can't really do what you do in batch ALS. So a possible solution around that is to perform like a parallel grid search. What you do is you have a bunch of models in the beginning, you train them with a set of parameters each. And then as time goes by, you see which of these model or which of these set of parameters gave me the model which is least performant. And then you prune that model from your search. You say, well, I'm just going to keep these three models. And you keep doing that. This has an obvious drawback, which is it might be computational expensive to train lots of models simultaneously. And another thing is there's no theoretical result that actually guarantees you that a model that you discarded in the very beginning might not be actually the best model in the future when you had more data. You might be the best set of parameters. Forget specific small chunk of data that you had. So it's quite a tricky problem to train parameters with the streaming version. Finally, there's a consideration of performance. So in these kind of models, you're going to do, as you see, lots of joins. You're going to have lots of data shuffling around the partitions. So you have to be really careful of optimizing these kind of algorithms. Apache Spark does something very clever with a batch version in which they do something called the blocked ALS. Basically what they do is that they pre-calculate the amount of outgoing and in-going connections between the chunks of RDDs for the future of the latent factors so they can minimize the amount of data shuffling that happens. It's a quite clever algorithm. But a naive implementation of streaming ALS will give you nothing like that. So you have, on top of this algorithm, you have to think of some clever strategies to minimize data shuffling. And also something that for the more seasoned Apache Spark developer might raise some, make some alarm bells ring, is the fact that you might use some ad hoc random access fetching of RDDs to calculate some quantities. So say, if you're an algorithm, you find yourself calculating the predicted rating for a specific user and product many times. Keep in mind that to do that, you're going to have to access a specific row between commas or column of an RDD. And that's not really this kind of an anti-pattern in Spark. So it might be ending if your code ends up looking like this, having to do lots of lookups and stuff like that. So possibly it is just rethink your strategy of doing the predictions. So this is basically the explanation of a generic Spark ALS streaming algorithm. So if you want to check out more stuff about streaming algorithms or Apache Spark or OpenShift, I invite you to take a look at my blog. This is a specific post on streaming ALS, if you want to see it. And if you want to play around with distributed algorithms on Apache Spark, on OpenShift, or on the cloud, I strongly recommend that you go to the redanalytics.io website. You have several use cases for intelligent applications, machine learning at scale, which are very good, very well documented. You can actually learn how these things work just by reading the source code and documentation. And they actually have some ready use cases you can actually deploy like they have a microservice oriented recommendation engine built with a Apache OS version of Spark, which is very, very, very interesting. So I strongly recommend it. So that is it for me. And I thank you very much for your time. Can you start filling down? Thank you. So welcome to the MLAI track at DevCon US. Our next speaker is Anton, who's going to be talking about the data hub for CI. And I'll let him take over. Thanks. So I'll talk a bit about the OpenShift roots of data hub where it is coming from and how it evolved a bit from just regular logging for OpenShift into what we now have as a data hub. And I also talk about how we use data hub specifically to get data from various CI systems and what we do with this data. So it all started with logging on OpenShift. Basically, this is logging architecture on OpenShift. What we have here for those who don't know OpenShift is Red Hat distribution of Kubernetes. And OpenShift includes logging capabilities, meaning that it can collect logs for all the containers that are running inside it. It can contain logs from hosts that OpenShift and this logging capability is shipped within OpenShift itself. It consists of this logging namespace. Namespace is basically named collection of pods and services within OpenShift. The logging contains all the namespace, which is logging namespace and it contains of three major components. The fluendi is the collector. Elastic search is used as storage for logs and the kibana is used as a visualization reporting layer. There are some other minor services involved in the picture. Like we have the service. Service is Kubernetes concept which allows you to reach the appropriate pod. We have curator. Curator is a component which allows you to curate indices to discard old indices and optimize older indices. We have Prometheus and Cerebro. Prometheus is a component which is responsible for monitoring of getting metrics from the infrastructure. Specifically here we can monitor elastic search. And Cerebro is another component, optional component which allows you to get specific metrics, specific elastic search related metrics. And what OpenShift logging collects, sorry, OK. So what OpenShift logging collects, it collects logs from various spots and containers running in the cluster. It collects audit logs. It collects system logs. If we look deeper inside the server part of logging infrastructure, we can see that we have kibana which consists of two containers. We have elastic search which consists of two containers. As you can see, each container is fronted by the authentication proxy. So if we look at a user request which is coming through, then user goes into the browser. Queries goes to the kibana which is exposed route into the OpenShift. It reaches the authentication proxy and authentication proxy redirects user to OpenShift API so such that user can authenticate with his OpenShift credentials. And if user is authenticated successfully, user is redirected to the kibana container. Then kibana uses the user ID and header from the OpenShift authentication to query elastic search. That way we achieve multi-tenancy because the user that came in is known in this request. So we can give out the results of the query based on the user ID and headers. So kibana sends the request to elastic search. In elastic search there is a special plugin that was developed partially by OpenShift guys and partially by Search Guard. This plugin allows for multi-tenancy. We use it heavily in data hub. This plugin allows you to protect various indices from unauthorized access. Basically, user will be able to access only the indices that he has explicit permission to access. On the other part, for example, on the ingest part, the ingest goes such that we have service exposed. We can have it exposed via external IP or node port. The request comes in and it goes directly into elastic search. And basically, it gets indexed. We don't have Fluendi or any other ingester must have the appropriate certificates to get the document ingested into elastic search. At the moment, Fluendi can ingest into any index, but this can also be tuned. And for monitoring, we use Prometheus, which queries authentication proxy and gets the token from OpenShift. And that way Prometheus can have authorized access to the statistics of specific elastic search node. So this is the way that secure authentication works in OpenShift. And this is what we use in data hub to ensure that we have secure multi-tenant access within the data hub as well. So yeah, this picture was pretty much stock deployment of OpenShift with logging capabilities. Then what we added was two major components. One component is Kafka. Kafka is a message bus which allows for strongly ordered messages. It is very resilient and it is very high throughput. And we also added Ceph, which provides S3 interface. So three major components got added to the stock logging included in the OpenShift to get to what we call data hub. Elastic search, Kafka and Ceph are the key components in the picture. The other components that we have are various data ingestion components. Then we have various normalizers and we have other applications which allow for various data manipulation. Everything here is running on top of OpenShift, except for the S3 part at the moment. S3 is running on bare metal Ceph. But other than that, all the other components are running on top of OpenShift with us. Yeah. So yeah, data ingestion. One of the key reasons why we have data hub is to be able to get logs and to be able to get data from other sources from outside of OpenShift cluster. Basically, we took what was logging, stock logging on OpenShift, and we exposed it to the outer world. The way we did it was we introduced, as I said earlier, a message bus, which is Kafka, and we also introduced normalizers, which are special components. We have three different normalizers right now. We have Fluendee, we have RCSlog, and we have LogStash. So these three normalizers pretty much cover everything that is needed to get your data into the cluster. So any host that is on the Red Hat Network can be set up to send logs into data hub. The collector can be used, as I said, any of these three collectors. Fluendee, LogStash, alternatively, you can use Elastic Beats, or you can use RCSlog, or you can simply point your Syslog to log your Syslog to a remote Syslog server. And that way, our Syslog on our site is this normalizer. It will act as a remote Syslog server, and your data will get into data hub. So why we need this message bus? This message bus allows us to have a very robust layer, such that if we have any spikes, the Kafka will even out all the spikes, and we will not have any problems with its ingestion of data into Elastic Search. So one of the reasons is to even out the data flow. Communication with other buses. So we have multiple other buses within Red Hat. One of the buses that is often used is a unified message bus. And we have many various build systems within Red Hat, as well as we also talk to some Fedora build systems. And Fedora systems use mostly FedMessage as a bus of preference. So FedMessage is the message bus that was based on 0MQ, and it is publicly accessible over the internet. We basically subscribe to some topics on FedMessage and get a lot of data channeled into our internal data hub message bus. On the Red Hat build system side, we subscribe to unified message bus and channel most of the messages into our Kafka bus. So the idea here is that the way the built pipeline works upstream and downstream, upstream meaning in Fedora and downstream meaning within Red Hat something gets built within Fedora. And you get appropriate message on the FedMessage that something was built. Then Red Hat build system kicks in and gets the pieces from upstream and builds something downstream, respective component downstream. And one of our tasks is to reconcile what is happening upstream with what is happening downstream. And we do this by tracking the message within FedMessage and a respective message within UMB unified message bus. So yeah, a lot of systems within Red Hat ingest data to unified message bus. There is Jenkins plugins that does ingestion to unified message bus. And there are a lot of factory two services that do that. Also, we have external message bus, sorry, external data hub in a different cloud. And we also have the ability to replicate the data. So what we did was we deployed the data hub on the external cloud. And this is also full OpenShift deployment there. And we have Kafka message bus deployed. So what we did was we exposed Kafka as an OpenShift route to the internet via secure protocol, and internally we deployed Miromaker. Miromaker is a Kafka component which allows for data replication between geographically distributed Kafka buses. And what Miromaker does, it simply acts as a consumer and as a producer. So Kafka has a concept of consumer and producer. Consumer subscribes to appropriate topic and gets the data from some topic. Producers simply produces data into the other topic. And yeah, so here what we have is very simple structure. It was not the simple to implement, but the picture here is very simple. Miromaker simply mirrors topics from data hub that is located in a different data center into the internal data hub. There are no open ports in the internal data hub. Internal data hub is completely within Red Hat firewall. External data hub only has one port open for the secure connection to the external Kafka bus. So yeah, the other component that we have is ShipShift, which it is the internal product that we use to gather logs from various Jenkins servers. What it does is it waits for notification either on UMB or on 0MQ or on FedMessage. And based on the notifications, ShipShift goes and grabs logs, built artifacts from appropriate Jenkins job. That way we not only get the messages from UMB and from FedMessage, but also we get the full logs from Jenkins that includes Jenkins console logs and that also includes any logs that are produced within the job. The logs then are stored sent to a message bus and also stored in S3, which is our S3 backed by CIF. So yeah, data processing. On the processing side, everything we have is Kafka-centric and we want to have everything within the Kafka. So we run multiple applications. We have several rule-based normalizers that get data from one topic and send data to the other topic. We have several Kafka streaming apps that do the same thing. We don't yet have Spark, but the idea is that we will have Spark that will do the same kind of processing. So all the processing, all the data normalization is happening on top of Kafka. That ensures consistency and that ensures the delivery of all the messages. As for the presentation layer, the presentation is set up very, the presentation is very simple. MessageBus contains certain topics which are mirrored into Elasticsearch. So we have a topic in MessageBus and this topic corresponds exactly to an index pattern within Elasticsearch. Another topic, another index pattern. This indices get presented via Kibana to end user and we have a separate repository with reports which we upload to Kibana and the user can basically view various reports based on various saved reports in Kibana, command data model. So one of the key things that we have is a command data model. This is a rigorous definition of all the fields within Kafka and Elasticsearch. What it allows us to do, it allows us to make sure that we don't have any inconsistencies and that we have very good correlation between various different entities and similar entities. So if we have, for example, field called timestamp, it will be timestamp of this log entry throughout everything, throughout any kind of document. If we have a field called, for example, job name, it will be the name of the build job for any kind of document. The other thing here is we use hierarchical namespaces which means that if you have a product, for example, OpenStack, then we will get a completely separate namespace within JSON namespace for OpenStack, such that any product, anything that is produced within OpenStack can simply go into this JSON namespace and that way anything outside of OpenStack will not be able to conflict with this specific OpenStack product. So hierarchical namespaces allow us to have no conflicts within Red Hat products. Yeah, and also this common data model allows us to generate templates for elastic search easily. Yeah, specifically for CI, we build several namespaces such as CIJob. CIJob namespace allows us to have definition of specific Jenkins job or specific other job for specific other job and we can use it within elastic search to correlate with any, throughout any documents. Testcase is another namespace that allows us to correlate any test case. So basically any test cases, any data for test cases will be nested under test case. Any data about CIJob will be nested under CIJob. And same for Fedora CI. This is specifically for Fedora infrastructure messages. Any Fedora infrastructure message will be nested under Fedora CI so that nothing else can conflict with that and we can have nice, very good correlations in the end. And yeah, basic artifact workflow is very simple. This picture does not include, this picture does not include ingestion of messages via message bus. This picture only includes ingestion of Jenkins artifacts. So yeah, so four major stages, collection, normalization, storage and visualization. Collection is done via ShipShift. Then we do normalization. Normalization is mostly done via LogStash. Right now we have also other Kafka applications that can do normalizations. Storage, at the moment it's elastic search and visualization is done by Kibana. This is the layout of data hub from the point of view of OpenShift. So we have OpenShift consists of various namespaces and we view namespaces as an application from kind of broad point of view. So we have namespace, for example, that is responsible for all the ingestion. We have namespace that is responsible for storing the message bus. We have namespace for normalization. We have namespace for elastic search, et cetera. Basically, this namespacing in OpenShift allows us to have very modular approach to data hub. If we want to add or remove a namespace, it's very easy to do and it doesn't take a lot of effort to remove or recreate some namespace. And we also can control what namespaces are allowed to talk to each other. So we can disallow communications from non-data hub namespaces which we do and we only allow communications with data hub namespaces. Yeah, so it's just the snapshot of various tools and components that we have in data hub. I've already talked about most of them. We have elastic search, Ceph, Spark, Kafka, which are the key components that we have. We have a lot of other tools here as well. I think that's it. Any questions? Thank you. I'm curious how you deal with errors during normalization. Say someone changes the format of a message bus message so that your normalization no longer can convert it into your standardized format. Do you error out at that point or what do you do? So at the moment, we don't have a great way to deal with that. We do error out. So it really depends on the tools that is dealing with that. Some tools deal with this thing. Better some tools are not that great in dealing with this kind of thing. If there is, depending on the change, there might not be an error. But if there is an error, then we will error out. We don't have specific, at the moment we don't have a specific validation that the message must be in this format. We want to get to this point. We don't have it at the moment. Thank you. Yeah. You had a question? I noticed you had to have it over at many places. Is it CentOS also a client-genre? So CentOS also uses FedMessage, right? So yes, in that sense, and Fedora uses CentOS CI. So I mean, I'm not 100% sure what are the relationships between Fedora and CentOS, but yeah, I mean, we use, yeah. Chain code base. And Fedora, one of the external instances of data hub isn't CentOS CI. So yeah. Another question, very nice. I am appreciated looking under the covers of the design, but do you have anything to show like the dashboard, how you would use the information? I would, but I'm afraid I don't have VPN connection from here, so. Just one final question. This looks very close to a product that my company has decided to use called Splunk. Have you heard of that? Do you know of a relationship between them? Right, yeah. So Splunk is proprietary software. We use Elasticsearch, not Splunk, so. Yeah. Thank you. Any more questions? So we use something similar to Elasticsearch, Greylog, if you're aware of that, like the log aggregation platform. Sorry, which one? Greylog. Okay, Greylog. Yeah. So I was curious if you guys have any monitoring alerts or notifications that are built on top of the Kibana data. So Malik answers this question. So right now that was my summer project, which was monitoring each of the data hub components and send out alerts if there is like unusual log activity or stuff like that. Yeah. We are using Prometheus for monitoring and Grafana for visualizations. Any other questions? Thank you, Anton. Ooh, that's interesting. Yeah. What's your favorite color? Every life on YouTube. Holy, what did we do that for? Yeah, we have some trick questions, like what is Uli's coffee name? I don't know Uli's coffee name. Oh. First start one. Let's start one. I think we're ready. All right, we're good to go. Hello everyone, good afternoon. You guys have made it this far. First day of DevConf, that's pretty good. So what we're gonna do today, we're gonna talk about the impact of AI. And we have with us Jared, Uli and Daniel and they will be our panelists. I am Gerard Griffin, the moderator. I'm here to keep these guys honest and make sure that they tell you the truth and nothing but the truth. So first off, we're gonna do some introductions. Jared, maybe you wanna kick us off? Sure, I'm Jared Floyd. There's your microphone on? I think so, yeah. Maybe I'll move it a little higher. There we go. Yeah, I'm Jared Floyd. I'm in the office of the CTO working on a technology strategy. Yeah, Uli Drapa worked also for the CTO, Chris Wright, and no one knows what I'm doing. So I'm mostly looking at all kinds of compute things. Among them is machine learning. Daniel Rieck, also office of the CTO and I manage the AI Center for Excellence working on Reddit's AI strategy. So we're gonna start off with a pretty generic question to get the ball rolling. What do we think about the impact of AI as it deals with software development, IT operations, DevOps, those types of things? So I'm curious, Daniel, what are your thoughts on that? So I think that overall AI is this big transformation and many of the shiny topics that we see in media like self-driving cars and all that are hard problems to solve and there are a lot of low-hanging fruits for applying machine learning to improve IT itself. Software development operations, embedding AI capabilities, machine learning capabilities in our software products and the platforms. And the picture I used earlier, if you're driving a self-driving car, you probably don't want a SUS admin in the trunk, but that car is basically a data center on the wheels. So in order to get to that, you want to automate things to a degree beyond what we have done so far with IT and I think that applies generally to what we do with software. Jared, in your line of work, what do you think the impact of AI will be? Well, AI is interesting because it covers a lot of traditional analytics as well as machine learning techniques and I think I'll talk a little bit more about this later on, but I think that there's a very interesting set of patterns of newer applications that are built around data flows and bringing data into the system, doing some sort of processing in it, which may include AI processes and then taking action on that and that may be taking action locally or that may be taking action at a different point. So Daniel just talked about autonomous driving as an example. Autonomous driving has the full suite of AI ML techniques at play, but most of those aren't happening on the car, they're happening in a training process that's happening in a very large data center infrastructure. There's flow that's occurring where data is coming in from vehicles for asynchronous training purposes, it's being analyzed and then that's creating updates that are then delivered, again, asynchronously to the vehicle to improve its behavior. So managing how that data flow happens, managing the security of that data flow, the reliability of that overall application process, the general assurance that it's doing the right thing is going to be interesting because that same pattern occurs across use cases such as IoT, Internet of Things, another kind of large vape, but complex area use cases as well. So overall, I think we're going to see a shift to these data-centric applications of which AI is a core component for doing analysis. So when you mentioned about security and things like that, do you see a place in AI to maybe mitigate the risk for businesses or at least expose potential issues? I think from a security perspective, in that case I was talking more about making sure that you have templates for what your application lifecycle is so that you know that you've done appropriate integration testing, you know you've done appropriate regression testing before you put something in what can be a safety-critical situation. AI techniques also can apply for security use cases, really just had a talk two sessions ago about using AI techniques for anomaly detection or intrusion detection, all the things that we've traditionally built rules-based systems around, but that AI techniques allow us to create much more flexible, much more reactive environments. Cool. All right, now given that Linux is the biggest open-source operating system, what's interesting is there's a lot of Linux developers out there now. Uli, can you kind of speak to what you think AI means to a Linux and a systems developer and how that might change their paradigm or change their approach to development? So I don't necessarily think it's limited to Linux developers, but there are big opportunities when it comes to developing any kind of complex system nowadays where we reach the point that an individual is not actually able to understand every single aspect of this. It's not able to understand the signals which the system can actually bring out and recognize it as something positive or negative or something like this. And therefore, not so, even if they sometimes recognize them, they might be biased in a certain direction that they actually don't want to recognize a signal as being something good, something bad or something in between and so on. And getting mathematical logic in place to do this kind of work, to do the analysis of the work will open up completely new venues. So just to give you an example, also a couple, I gave earlier today a talk on micro-architecture of CPUs and so on. So this is something which if I would fill the room with a hundred people and throw a stone, I probably could not hit the single one who actually knows anything about these kind of things like micro-architecture. This is a very esoteric topic, but at the same time, if anyone who wants to do performance analysis using today's performance counters, which are available in the CPU, you cannot do anything without actually understanding what they mean. So, but with some, if we're writing some logic, some mathematical logic around the analysis, we can actually learn what makes up, what measurements actually describe a good workload as opposed to a bad workload and have the AI actually learn these kind of things without the user actually able to describe what makes it good or bad. So we can learn from the systems directly and circumvent the lack of understanding which the user might not have or the program might not have. And this extends to the set beyond just the Linux world. So it's for everyone. It's a big advantage going forward. You just have to build these kind of systems. Okay. Daniel, will you have any thoughts on that as well? Well, I think this is this general pattern that we can use AI, brought the sensors to the password to derive knowledge directly from data. We have the ability to generate a whole lot of data now. The complexity of our systems are beyond the capability of humans to understand. And AI can help us make sense of things or it can autonomously derive meaningful information from this very complex set of vectors and amount of data. So the problem with the data is it's the volume but also the complexity of data that just beyond what anyone can still grasp as a human. You just have a limit on how many vectors can you consider when trying to understand what the root cause for a problem in a machine can help us. And the other aspect is that something which has been worked on also is some of the tasks are mundane and therefore programmers are not likely to take them out voluntarily. So monitoring a CI system of some sort or recognizing very simple faults like, well, yeah, you didn't commit this message, this check in here and so on, it's missing something. So this is done and Sherrod can talk about this as well. So he's done using work using bots, oftentimes in some form where simple tasks might not even have to be handled by humans anymore. And therefore we get much more reliable systems because these kind of mistakes, if they pop up can be very quickly rectified. So with the automates, it sounds like a little bit of the removing the manual tasks that a person might have to do. Have you guys experienced any of those examples of things maybe we're doing in Red Hat in that space or other tools where you've seen a good job of it eliminating that manual mundane task? I've certainly interacted just on, I don't know the project or product but I've interacted on GitHub projects where I've reported bugs and immediately had a bot come back with, you might find these other ones relevant. And that's a very simple conceptual piece of code but actually having it produce useful results was enormously helpful. And we have examples like that that we're in the flake analysis in CI for example where you, it tells you if there's like somewhere deep in the many aspects of like how a system can fail, they look random to a human observer because you don't see the depth of, it was like on a full moon and someone tortured a black Schrödinger's cat on the graveyard and you don't know that but it's somewhere in the log and in the metrics of the cluster where you run your test. And after it happened five times, there's a pattern there because somewhere deep in the vectors of the data, it's a clustering of issues and so the system can identify that that's actually not a flake, there is a problem somewhere hidden deep in the code that as a human it just doesn't happen often enough or it's too hard to understand what happens and we have said today it's improving our quality as we speak. Although a phase of moon is not a common but it builds contingent. You'd be surprised. You would be surprised. So to the other side, this is actually also an interesting thing where we need some work by a lot of research going on. Most of the works in machine learning nowadays are built on the law of large numbers where you need statistics to actually catch a relevant result. So what is going on to some extent nowadays and will be hopefully the focus of more work is that we actually can work with very small data sets and derive patterns out of that in a reliable way. So the thing which Daniel mentioned that we might get to the same error situation five times in the sixth time we can actually do something about this ahead of time. That's very, very useful and we're trying to do some work in that area so I have my personal helper on that kind of thing to do the math for me and looking forward to actually getting results of that error. Jared, I wanna go back to something Daniel mentioned earlier and that was part of his presentation earlier about data being the differentiator. Having all of this information, all of this data, the way you use it, how you make it available both internally and externally to your team. So how do you see with data being the differentiator what does that mean in the AI space and how does that impact AI in general? That's a very interesting question I hopefully can give an answer that won't get me into trouble. So one thing that's very important about not just AI but AI is really driving value further up the stack. We've already driven value from lower level software to higher level application infrastructures. What AI techniques do is that they drive value from your AI infrastructure to the data to the models that you're building with that data. And so this can be very challenging because you can ship software that's open source software which is fully capable of solving a particular problem and may even have the input and output hooks to solve a particular problem. But if you don't ship that with a working model then the software itself is useless. And if you don't ship it with the data to train that model, then you'll be limited potentially in what an open source user can do to enhance that. Similarly shipping a model in and of itself can be challenging because it depends on what data has been used to train that model that if that's confidential proprietary data your model can potentially leak information that you don't want to be leaking. So I think that as ML grows in importance in the open source community that there are going to be some very interesting conversations about what license do we make these models available under? What about the data sets? Do we have open repositories for the data sets and the models that are training these newly critical systems? And a lot of the discussions that we had many years ago around licensing and software we're going to have to have those same conversations again as it comes to the data and the models in ML based products that certainly there are data licenses that exist there are content licenses that exist they may or may not be appropriate for these particular types of data or may not be nuanced enough to describe the variants of licensing that we may want to allow. That's interesting. Daniel, what are your thoughts about the difference between open source data and these closed off black box systems? Obviously you have someone like an Amazon or Google where they've collected all of their own data therefore they have the power with their own algorithms but as you mentioned earlier that's a little bit different with open source can you kind of dive into that a little bit more? Well it's two sides of this I think that for a lot of things where we're going to use AI we need open source even more than we needed with the pure software so the difference is that traditionally it's all about code, right? We're code centric and the source code is complete and describes the functionality of the software. With ML suddenly you need the data to describe the full functionality as a training data to describe the full functionality and depending on what you're doing that might change like when you're running the software. One problem is that if you have a machine that derives information, draws conclusion directly from data and takes action based on that there's no human in between anymore I think that even more drives a need for transparency that only open source can provide and so that's one aspect of this. Trusting a black box service is getting harder and harder and more knowledge or intelligence because that's like in there and more intelligence is in that service and then on the other hand yeah I think what Jared said and I'll be curious what Uli's perspective is on that open source versus AI and like how do we keep that consistent? We've reached kind of a detente with firmware blobs which are opaque binary data that's necessary to have a system operate but I don't think that we can stay static with that same level of acceptance as ML models become more critical to the pieces of software that we're using here. Yeah so the data aspect is certainly so it makes up part of the solution and so on that beyond so David Penner mentioned also the implementation the implementation is important as well so that it is actually freely available and can be inspected because the software which are built for modeling let's say deep neural network or even other techniques and so on well why would you trust them actually to do something? So that could be hidden somewhere in there if the user is Greek then spit out a thousand dollars otherwise zero so I could hide this thing there so it has to be inspectable and this also means that for the data itself even if the implementation is right and someone is delivering the trained model with it it has to be replicatable so in machine learning really is nothing but the scientific process actually put in reality for everyone to use and one of the aspects in science is that you have to design an experiment and make it reproducible because otherwise no one is trusting this and the same will now be true for every single program out there and therefore we have like Jared mentioned at some point we have to think about how we're delivering models along with this but also the description so things like notebooks which have been available for a long time so Mathematica was programs like this at them forever and notebooks are going to be the way how you actually describe how you arrived at the state that your model is right now and then people can say oh yeah I see this, I can replicate this and therefore I can trust the service and I think that the firmware is a compromise that people are willing to make because hardware is proprietary like hardware is physical right so it's somehow an anchor to something proprietary historically we're just changing right and like- I wouldn't say firmware is trusted so just look at the Intel's I didn't say it's trusted people it's only good- people dealt with it like- we're actively working to do away with the stuff so we actually lacked the ability to be introspective to audit the hardware that we run on to the same extent so I think Daniel was saying that the firmware is abstracted into the same level that I don't de-cap my CPU and use a scanning electron microscope to validate that it's doing what it says it does and there could be all sorts of malicious things that's happening but I think at the point where whole operating systems are embedded in the CPU and not the best operating systems like I think it was Minix's obsolete or something was the- the version which they use there's many versions behind so at that point I think like I think you know it's probably it's a bit of a different but it's time for open source hardware and then we're getting there right and anybody said that's another thing so we're working on that as well and I'll probably like in the same like the more the more autonomy machines get the more important this is gonna be because we- the more they take over the more we have to trust them so and that becomes a very fundamental question for business, for society. Well and I think there's a very related security issue that you raise that goes back to the training data for models and the necessity to have that be replicatable in an open source product because one of the big challenges with machine learning systems is that the ability to be introspective into the reasons why they make the decisions they do is a much younger area of research than the learning models themselves and so to be able to one ask on what basis were these decisions made and also to make sure that there isn't a trojan horse in the model that as we suggested if you're the right person then you can take all the money out of the ATM because it has an M.O. model deciding what it should give you which is not a very good use case for a model. All right so talked a lot about the hardware part of it and the latter part and the hardware versus software what about software in terms of the data versus the source code when we start to talk about the entanglement of needing to have enough data for the algorithms but then also having to have the right algorithms that can analyze the data Daniel what are your thoughts on that? That's a very broad question. Yeah so no but the main thing is there data science, machine learning, AI some people call it I don't know why is not something which you should think you know you can apply if you just know that oh yeah I can call this library and be done. So you have to think about a model as it is produced by machine learning algorithm like a function so if you put in good data you get good results out perhaps but you put in garbage you get the garbage out so how do you to be able to successfully use any of the machine learning techniques you have to be able to distinguish between the model being garbage or not and that's not that easy so this is actually something which requires you to understand what the math is you know what the weaknesses of the various techniques are etc so this is not as easy there will never be black boxes where you can say well here's my data do something with it and in the end well it will work always that's never going to be the case in my opinion and therefore you have to be careful with what you actually do and who is actually doing the work so again so anyone can produce a model but whether the model is useful that depends on lots of quality factors do you see that as a challenge for organizations maybe Jared I don't know if you've come across this but companies want to move towards AI and ML and they implement these but it does seem to require quite a bit of work to tune fine tune get the feedback from whatever service is being used and make sure that that model is accurate are companies prepared for the workload that it takes to actually tune these models on an ongoing basis well I think what Uli was getting at was actually just a microcosm of software development as a whole which is that many businesses are very for it understanding what software they want to develop and therefore the software that they produce doesn't do what they want but we don't have the ability to open up a text buffer and write a short description of what the software should do and have a system automatically generated and as Uli points out we never will because half of the hard part of software or at least a large percentage of it is deciding and determining what it's supposed to look like in the first place that I'm not suggesting that the right way to go about this is the big design of front multi-thousand page specifications but understanding what problems you're trying to solve are critical to selecting the tools that you're going to use to solve them validating that you're solving the correct problems or validating that you're solving the problems you've set out to solve and then deploying that software so AI and ML techniques are just another set of tools in the box for solving business problems, personal problems, educational problems, societal problems that we want to influence with systems and so in the business framework you have to know what your problem is before you know what tools to solve it, you're going to use to solve it and you have to know what those tools are capable of before you can select them and so there's a whole spectrum of tools for data analysis from traditional statistical techniques which is a lot of what these data science workbench products allow you to work with to the newer ML techniques where it largely comes down to again knowing what problem you're trying to solve can you clearly define the inputs and the outputs that you want out of your process whether it's statistical or ML if you can't then you need to go back and figure that out because none of the tools are going to solve the problem for you and then it's simply a matter of selecting is this something where when looking at the problem can I describe what the rules look like? Is it that there's a very large space of possible input scenarios but I have a very easy way of describing what rules apply for what actions I want to take then neural network isn't the right tool necessarily that there are other rules based systems that you can use on the other hand it's really easy for me to create a huge set of example data that I can feed into an algorithm where I can define very clearly 10 or 100 or a thousand parameters from that input data set where I can clearly define what outputs I want but it's really hard for me to describe in relation to those inputs what rules need to be applied that's where more machine learning return neural network approach makes sense so these are all tools that you have and yes lots of people are using them wrong and lots of people are using them because they're buzzwords we can have our AI IoT blockchain talk after this a file for an IPO for a startup that's going to do blockchain neural networks can you do that with blockchain? I think you can do it all with blockchain that's my understanding I mean the IPO or isn't it? ICO we'll do an ICO for our AI IoT blockchain startup afterwards and we've used all the right tools in all the wrong ways so there's one other thing and this is confusing many people who don't necessarily have the insight in there so oftentimes people are using machine learning especially deep neural network learning mechanisms in what is called unsupervised learning situations so this is the entire business model of likes of Google, of Facebook, et cetera so they're trying to learn in an unsupervised way the massive amounts of data are something about you they don't actually know about what it is they are just looking at similarities of some sort and so on so in a completely unsupervised way so there's no human doing an intervention so why does it work? Why do I say, well, this doesn't work in general? Why does it work for them? The reason is simply that the impact of them getting the model wrong is that you're seeing an ad which you don't like BigWoop, who cares? If on the other hand you're doing these kind of things and you're writing some piece of software which decides whether to shut down the cooling in a nuclear reactor, guess what? It's a little bit more important that you don't want to have this so there's the last spectrum of course in the middle but I would argue that pretty much everyone who has to be taken serious doesn't have the freedom to just ignore the quality of the model so most of the time you really want to have something which is really solid in its results so in my previous life I worked for a financial company and so on and there they could not do these kind of things automatically because the result might have been that you're losing a couple of hundred million dollars so that's not doable. It's not just human life with one might be at stake and therefore I would really argue that these nice looking prospects of machine learning which come from the use of unsupervised learning might be good for things like marketing et cetera for answering these kind of things but pretty much no other area of life where you actually depend on the result. All right so we're just about out of time. Do we have time for any questions? Okay so we'll take a couple of questions. Anyone have any questions? Sure over there? You have another mic? Just a second there come with the mic. Hi hello I just wanted to know your views about AI and gaming. Well the AI interacts with human constantly and there is a lot of criticism which isn't one. AI and gaming I wanted to know your views about AI and gaming. Gaming? Who's a gamer here? So I did write, so I did write my own games for a long, long lying time and what we call the AI back then doesn't really compare to what it is today. So it's certainly, I think you're mostly referring to the open AI and so on trying to solve games using AI and so on. This is a completely different thing. So they're using this mostly because it can be done 100% inside the computer without tiring anyone out. This is just an example for the solution where the input is not a binary one zero and so on. It's visual input so in theory so there's a similar experiment where you can actually if you own a GTA license you can actually write your own driving simulator using GTA on a machine versus then just does a screen grab which would be the equivalent of a camera which you have in the windshield. So that's the only thing why these companies are using games for that because it provides something where they can cheaply create situations with very extensively complicated inputs and trying to write something which we thought we couldn't do before in this sense. So that it's games is beside the point. This is just a coincidence. Great. I think we're just about out of time here. We wanna thank the panelists for joining us and they did say you get a special door prize if you guess who's been at Red Hat the longest. Hey everyone. So the next topic would be ARMED and Loaded using ARR and OpenShift. Thank you and over to Ricardo. Thank you. All right, thank you. So. All right so good afternoon everyone. I'm glad to be here at DevConf. I had the opportunity to go to the latest two DevConf in Bruno and I think it's a very good conference to share some experience with developers for other areas. And I'm very glad to be here. So speaking about experience, what I'm going to show is a bit of my experience, a few of my experience with data visualization and data manipulation and a bit of my personal initiative to bring our own OpenShift. So let's get started. Well, I think most of you know what our language is but for those who didn't understand what our language does, well, ARR is a language that believe it or not it's 25 years language. So it's pretty old and its main utility for the language is data manipulation, calculation and graphics display. So it's very useful. Now they're talking about big data and machine learning. We're talking about using ARR for statistics, data exploration, data visualization and machine learning. Well, I personally love the language but why would someone use ARR for data manipulation? Well, ARR is a very useful language not only for software developers but those who are not software developers can use ARR too. It's a language with a very easy syntax and not only that, there are many libraries that covers many, many applications and you can download the libraries through CRAN. CRAN stands for the comprehensible or archive network and particularly I'm trying to find the comprehensible in the CRAN but it's just a personal opinion. However, be careful when you try to load a huge amount of data because ARR uses in-memory calculation so be careful if you have a huge data set to manipulate it's not recommended for this kind of situation. All right, just to show a little bit about how easy is ARR for data manipulation I'll start our session here. So this is the ARR prompt. It's like a bash, you can run commands in here so I'll start reading a CSV file and as you can see I have type completion not only for commands but for paths so I'll use the US elections CSV and as I want to handle this data I'll put in a variable, all right. So now I can check the first six lines of this data. The first six lines of this data, I can check the latest six too. I can, we've summary command, I can look at a little bit about my data like for example, I have total votes 2008. It's a number column so I have some measurements of the data distribution like minimal, maximal, the quartiles, mean, median and in case of a text column like for example, counting name I will have the frequency of each word inside my data set. As I said, ARR is for statistics so I can also calculate the standard deviation or for example, total votes 2008 so it's 52 far to zero 0.69 and what else? Well, I can also plot the data so let me try plot the distribution of total votes for 2008 and I have a very simple graph to show the spread of my data. All right, so what about companies? How many companies are using ARR today? Well, there's Facebook for behavioral analysis, they use a lot of it in their related to status updates and profile pictures, Google uses for advertising, Twitter for data visualization. Microsoft and IBM are part of the ARR Consortium. It's a special group who maintains the ARR Foundation, Uber for statistical analysis, Airbnb, ANZ and so on. The search for the list of those companies is below and what else do we have for ARR? Well, as I said, ARR has an extensible repository of libraries and one of the things makes ARR very good is to create quick data visualizations around your data and in particular, there's a special library called Shiny which is a very useful library to create dashboards in a web application style so you can use, just let me show the Shiny web page. So you can create dashboards using your data, you can create some visualizations, create some interactions with it and using some of the standard components from Shiny and you can also change the style using CSS, themes and so on. So ARR is a dashboard is a very good, in my opinion, is a very good option to make quick visualizations and to publish in a web application style. All right, so now we know what ARR is. Let's talk a little bit about cloud and containers. So we have currently this form of using cloud computing and containers and most of the requirements of the motivations around that is mostly because of what NIST defined as a cloud computing. So they elected five requirements for cloud which is self-service on the main resources, broad network access, research pooling, rapid electricity and measured service. So everything we didn't have in standard data center style architectures we have in cloud mostly because there is the biggest motivations around the technology itself. All right, and why I chose OpenShift to run ARR? Well, I've been using OpenShift since the alpha version. Give me, well, four and a half years ago and one of the things that makes, in my opinion, OpenShift a very good platform is that it's a developer-friendly environment. You can, with just single clicks, you can publish your application and you have a DNS name for you to access externally. You can also have some pipelines so you can split your workflow in stages where you can test and deploy your application in production. And now with OpenShift P3 you have all the capabilities to create microservices style for your applications. All right, so mixing cloud computing and ARR makes the image which is published and available through the RedNet.io project. It's a project aimed to bring some machine learning and big data capabilities to OpenShift and we have the attached Spark as the central component of it. It's one of the biggest projects coming in the big data area. It's a very useful tool to make large-scale computing around your data. And ARR is around this language supported by Spark who have also Java, Scala, and Python. So, and ARR in OpenShift offer also the S2I style workflow to build your application. So for those who doesn't know what S2I is, S2I is the standard workflow in OpenShift to create your own application image just by just providing the base image and the search code in a Git repository and the ARR image has also its own dependency management. Well, although we have the CRAN repository which has all the libraries started for ARR, we have this problem that we don't have a way to provide a metadata where you can just tell the dependencies of your application. So this is one of the challenges I did in the ARR image so I needed to create my own dependency management mechanism using a kind of a metadata file in a very simple text file. All right, so let me just show a quick demo with the ARR image. So this is my repository where I started my shiny application and this thing I would like to bring some more attention is the dot ARR libraries file. What does that file means? Well, this is the file I told in the last slide about that makes the dependency management mechanism working in the ARR image. So as you can see, it's just a very simple file with each line mentioning the name of the libraries required to build your application, your ARR application. And this is honestly the only difference or the only capability I created for to make SUI works for OpenShift. So we have also the main entry point, the app dot ARR which calls the server and UI. It's the front and back end for shiny application. I'm using some custom CSS and JavaScript file and in this example, I start my data files inside my repository, but I could also use an external storage to load the data. Well, yeah, I think this is basically what I have in my repository. So going back to OpenShift, this is my project where I have the US elections data sets which contains the last three elections data, the number of voters of each county for the last three elections. All right, so this is the visualization one second. Okay. So I created a JSON file with all the counting limits and then I merged the US elections data inside the JSON file. So what you're seeing here is the map of the US divided by the counties and the caller says the number of voters of each county. And there are many parts with that same caller. The problem in this data sets that most data and most the data is below 400, number 400, 4,000 voters. So most of the counties are with the same caller because of that, but you have also like for example, let me check another one like here. Nope, there are some counties here with more than 10,000 maybe here. Yeah, this are just very few number of voters. But as you can see, if you point to the county, you can see the number of the county and the name of the county and the number of voters in that county specifically. Also, I created a very simple page to explore the data. So there's a table here with the state name, county name, the party which received the votes and the number of votes of each election. So for 2008, 2012 and 2016, I used the shiny components to create a very simple filter. So I can look at the Massachusetts state and East sex county. And I have only the data related to the state and county. You can also use the other filter here. So like for example, East sex, right? And it shows all the occurrences around the data. Okay, lastly, I have this other visualization which is bar graph with the votes by state in the 2008 elections, but I can also choose the 2016. So the data changes as I'm changing the parameters. And I have a checkbox to color the votes by party. So these are what the colors do. And there's the legend of the graph here. All right, so this is a very simple dashboard I created with the US elections data. This is just a demonstration of the capabilities for our image inside open shift. And I know that there are lots of improvements I need to put that, like for example, I'm going to prepare more the base STY image to support machine learning features. Maybe I need to do some more research like GPU scheduling, add full the patch spark support. I know there's support for sparkly arm, but I like to also add the spark arm improve build times because the CRAN repository only stores the search code for each library and the process is basically download the search code and compile the whole source inside your image, your container, and that makes the build times like for example, in the US election application, it takes about 16 minutes. So I'll try to come up with a better way to improve the build times. Also, I have only the image streams for our, so I'm going to create some templates to fast creating applications using R and try to find a better way to handle dependency management. All right, so that's all I have so far. So before I finish the talk, I'd like to thank everyone who attended this talk. It's my first time talking about R, so I was a bit nervous. So if anyone have any questions, feel free to ask. All right, thanks Ricardo, that was really nice. When you were showing the R interface that you had, one of the commands you ran was like a summary and it showed statistics about the various information, your data source. Yeah, yeah, yeah, the R prompt, yeah. So in that summary, can you actually index each one of those pieces of information to pull it out programmatically? Could you like, if you wanted some of that information? I don't think I follow your question. Well, could I run the summary command and then say I want the top entry for county name or would I have to run a different query to do that? Like would I ever use the summary in my application? Well, let me see if I understood because you're saying a little far from the microphone, I'm not hearing you. So the, whoa, so the summary, would you ever use this command inside an application to get the summary data? So if I understand your question, the summary, if it's kind of part of my application, is that? Would you use it in an application? Oh, okay, right. Well, to be honest, summary and some other commands I ran here is just commands to do some data exploration, but you can also, you have other commands to build applications, but the idea behind some of these base commands is to have a very quick inside of all of your data. So for example, the summary is just a base command to know what your data is, how is your data? Like for example, if SR is mostly used in statistical calculations, summary usually brings all the statistical data around your data. You, I'm doing with all the data set, but I can also choose some specific column, like for example, which is better. Also, and this I need to use my special library called deep layer. It's kind of an advanced data manipulation tool, but I can also use, like for example, do you know in Shell, you'll have the pipe command where you just get the output of one command and passes the input of the other command. We have the same in deep layer, but it's not the pipe, it's this strange symbol, which they call the pipe, but you can also with election data, you can filter by total votes in 2008 below 4,000. And we have the data. Let's just do another thing. Sorry. So with the output of that command, I would like to select only the total votes in 2018. And with the output of that command, you can also use a block box plot. And there it is. So what I did is just, if I use the same command using the raw data, it will be a very difficult box plot visualization to see because of the outliers, but then what I did is just, I filtered the data to get the data below 4,000 and then I selected only the column I would like to visualize and then I call the box plot just in a single pipeline. So all these commands are mostly used to get some data exploration and Shiny could be useful to create all the visualizations to publishing a webpage style. So does that answer your question? Yeah, it does. Thank you. Cool. Hi, can you hear me? A little bit. About now. Okay. So when you start by saying this tool is easy, this is easy, it's not for developers. You don't have to be a developer to learn it. You were talking to me. So how, a few questions. If you use a comma separated CSV file, do you have to format that in any particular way to get the way, does it have to be format in a specific way for R or is it just any CSV file? Number one. Can it be any CSV file? Actually, CSV file is just one of the services I could use, but I was handling another dataset for atmospheric data, for example. They have a special format called net-cdf. It's a binary data. So it's very hard to use in standard text editors, but I know that there's a library inside R called rnet-cdf, which you can read the net-cdf data and then I created another. I created a function to read the net-cdf data and export to CSV. But it can be whatever CSV as well as have separators and you don't need to have the first line as the call names, but you need to specify you don't need the first line as a call name. You can also have the roll names, but you need to specify if you need or not to read the first column as the roll name. So there are lots of options to read. And then any CSV data. Not only CSV data, but every other format that could contain data. I'm intrigued by this because I just had a project where I had CSV files all over the place and I had to use grebs, set, or to just kind of get what I needed and print out what I needed. And this, our language as you presented it, looked like it would have been great if I knew it two weeks ago. So for a newbie, how much effort would you say it takes to get to a point where you can create reports or something like what you just showed today? Thank you. Well, I'm going to finish that our session and I'll start again. Can you see that there, that last paragraph, you can see there are two very helpful commands. There's the demo command where there's another, there's a list of very simple use cases to use R and there's the help for the online help. So when you type help, it will open a very simple HTTP server and export, open your web browser with all the online documentation for R. There's also some very beginner R tutorials. So it's, for me, if I would begin learn R, I would start with one of these two commands. All right. So thank you, everyone. Thank you, Ricardo. Thank you, everyone, for attending today and we hope to see you tomorrow.