 So I apologize for the delay. Thank you all for coming. As I said, we are still figuring out the video streaming, and that took some precedence. My name is Daniel Riek. I work in the Office of the CTO at RATED. On the topic of AI, I'm running the Artificial Intelligence Center of Excellence at RATED, and we're talking about our open source AI vision So what are we talking about here? Oh, great. Now it works. So what are we talking about? AI. I'll use AI as a shorthand for the whole broader center of the buzzword, big data analytics, machine learning, cognitive systems. We'll talk primarily about machine learning, but I'll just simplify it by calling it AI. Overall, so we see AI in RATED as one of the biggest changes in industry, probably the biggest change since industrial revolution, because we see a new way of automation coming, which specifically applies to IT itself. The key is that every human task will be automated if you can describe input to output in a probabilistic way. And specifically in the software industry, what that means is that you're moving from humans understanding data and relationships, and then encoding their conclusions in code to machine learning models, learning from the data, and autonomously coming to conclusions. So they're still humans involved in defining the models or getting the data, at least in the initial data engineering, setting it up. But the decisions, the deriving decisions from the data itself is something that the software does autonomously. And even in areas where you're still, so we call it an example with dynamic rules model, and you might not want to do that everywhere. We'll see that in many areas. We already see that in many areas today. Even if you don't want to do that, because you need an auditable, very strict rule system, what we are seeing is that the rule development itself is employing AI technologies to make sense of the data. So even if they are humans writing the rules, often they will use AI tools to make sense of the data, do the data engineering, do the data analysis. So we see a lot of AI being used in many things that computers do in automation, and we'll go into some examples of that. So this transformation changes how we interact with software. It changes the role of software itself, of code itself, and the role of the data science. And so overall what we see, we have a name for these kind of applications. We call them intelligent applications. So applications that ultimately collect and learn from data, and typically gather more data when you use them. That's what you have a couple of common examples that all of us use today. It's how we use AI and ML today throughout the software industry and all the industries that use software, that are software bound, which is most industries, from banking through anything with security, production automation, all of that. And interestingly, most of the big advances recently in artificial intelligence are based on a use case, a business use case-driven application of AI and the combination of the compute power we have with the availability of real-life data. And there are a couple of examples there where big advances have been made. We all knew it was well understood that you could do handwriting recognition, but it became really pre-reusable when the post office provided data to researchers, large amounts of data to solve their problem of letter address recognition, and then it was solved. We all know all these rating systems. They are machine learning. I was behind that more and more, and it's a lot of data. Image recognition is another example where in principle it was well understood that it could be done. We suddenly had to compute power to do it in the last five to eight years, and then the data became available and was made available, and it was driven by a business use case with a speculative application of the understood theoretical capabilities that got us to where we are today. And you have to, in that light, see things like when Amazon offers a service like recognition, which is a facial recognition service that you can use. They provide that ready to use, and they're going to towns and police departments and trying to get them to connect their surveillance cameras to that facial recognition system. So they provide a free service and they get free training data back. And that's the dynamic that's driving the advancement of technology here. And it's incredible how fast this is moving right now, how fast new applications are, or the boundaries are pushed, algorithm or trained models get better. We see this right now in the continuum of the recent changes in the software industry from a hardware-centric model. We went to, it's about the software. Software is eating the world. To cloud-native, and then from cloud-native we are going to data rules of the world. An important point here is that when you look at AI, so code models, algorithms are really important, but because it's all driven by the data, no amount of algorithmic sophistication can overcome the lack of data. And data is, at the end, an equal in this to code. You now need two halves to make a program functional. If I write software, let's say I use AI in my development process, in the development process for an open-source project, now I need the training data and the code to make it functional. And so now we are not software-eating the world, it's AI-eating software. That's what we're seeing right now because everyone using, everyone with a business use case where you have any kind of automation, you have data, and you can derive the action the software has to take probabilistically from the data, you will see training models being used either indirectly or directly in the user interaction taking direct action. An interesting aspect of AI is that it's very open-source, friendly on the technology side. Most of the big things that people are using are available in open-source. Most of the research, even the leading companies in the space like Google, Facebook, Microsoft, Tesla, some others are publishing, contributing to open-source. The problem is, if you don't have the data, that doesn't help you because the code is non-functional if you don't have the training data. So that's an important aspect where we have to think about what does it mean for open-source, right? What does it mean, for example, from a copy-left point of view, if you're publishing software but it's not functional, then you are potentially in conflict, at least with the spirit of copy-left license. And we see right now that a lot of companies are willing to share the code, but they think they see the proprietary differentiation in the data, right, that's a shift. So what does it mean from a retic point of view? So we at Reddit see AI in, like, five perspectives. First, it's, of course, a workload for a platform. We're an operating system, platform company to large degree, and AI being such a big transformation, something that every business has to do, is dealing with. Of course, we want to provide the best platform for that use case. We are also applying AI in our internal processes, specifically in our software development process. So a lot of the talks you're going to see today are looking at how can we use AI, for example, to improve software quality by doing flake analysis, or how can we use AI in systems operations, right? You know, you could say that there's a fancy, very flashy use case, like self-driving cars, which is very interesting. It's a big deal, but it's also very hard, because your challenge is interacting with the physical world, and it really matters whether that white thing in front of you is a truck cutting you off or a traffic sign, right? And it got it wrong, and the driver unfortunately died in that scenario. So that's a really hard problem. At the same time, basically, that car is a data center on wheels, like all this AI is running in the car, and you probably don't want a suss admin in the trunk. So you want a self-driving cluster to run your self-driving car. And that's a low-hanging fruit, because it's supposed to be all machine readable already, right? We're dealing about machine-generated information, interpreted by machines, and then machines need to take action. Today, if you're in automation, in your systems automation, you're doing anything in SRE, you're writing a lot of automation scripts. A lot of what we are looking at with AI right now is how can we automate IT processes, automation of computers? How can we move from static heuristics to learned models in core products, like basically in every device down to the kernel eventually? We're not there yet, but that's going to happen. And then in systems management, lock analysis. And interestingly, a lot of the same algorithms that everyone else uses, you can apply here, right? So it's anomaly detection clustering. You can use the same things that people in trading and finance are doing. So at this point, when we use AI internally, of course, we follow our open-source development model, so it will be available to the community and in general. But customers don't necessarily know that we're using AI, right? The products are still the same, right? You get your rail subscript, you get your security fixes, code quality should improve, reactiveness should improve, things like that. But then we embed that actually in the product and we create services that are AI-based, at which point the customer actually becomes aware that they are being supported by AI. An example is Rated Insights. It's a predictive support service that looks at your system configuration and it will tell you if you have problems, there are rules in there that human-generated, there are rules in there that are human-generated with AI enablement and there are automatic decisions that happen based on AI. Something we demoed at Rated Summit, for example, is to look at, so it aggregates data from customers, right? It reports your system status and what we demoed at Summit was that the system then looked at your configuration and your performance data versus all the other customers we see. And the example is a degraded performance in a cluster and the system told you, oh, your performance is out of bounds compared to everyone else. You're an outlier and your configuration is an outlier. Why don't you change this and then your performance might get better and then we press the button and it changes and the performance got better, right? But it's basically using AI and the key here is that it's not fundamentally different from our support service today, right? If you have a problem in production, you call Rated Support and they have a lot of knowledge and they have our knowledge base, we have really good support people, they know how to diagnose systems, but we're trying to augment them with AI that can basically process more data and long more vectors more quickly and then give you a faster or even predictive, before you call, an answer to you have a problem there, right? It doesn't replace a human support but it gives you a lot of value before you have to escalate the human support, human support can focus on the real hard problems where you can't extrapolate from statistical input, right? Things, one example, a lot of this is basically herd immunity, right? We're going to be able to predict problems based on what we have seen before. The things that we haven't seen before, that we still need human creativity to figure out what's really going on, right? But again, we can augment them with a lot of content. So we'll basically put AI into the platform products themselves and in our support services to make, you know, turn our platforms into intelligent platforms and then in the intelligent apps, we are providing the same capabilities that we are using, right? We are running this on our own stack, with our own tool chain, we're working with our ecosystem, the broader open source community and the commercial ecosystem to create end-to-end AI solutions that we use to build this and we make them available to our customers so they can go and build their own intelligent applications to serve their customers, right? So in a way, we are a user of AI and then become not a vendor of AI in the sense of like an AI vendor but we become an enabler by providing open source technology that lets you build AI. And all of that is based on the foundation of data, right? It's the recognition and the culture shift from treating, from being code-centric which we are like most software companies and the open source community today. It's all about code. Our value is in the code. That's our mindset and it has to shift to treat data as an equal to code. It's data and code going forward. So what we're doing there is basically, we're doing the usual things in enablement. AI is an interesting topic because it's very hardware bound. Suddenly hardware performance matters again, right? In the cloud, everything was about scale and individual system, vertical performance wasn't the key differentiator. Now it becomes a key differentiator again which is great for us because we have this hardware enablement capability end to end. There is a lot of work to integrate because AI writes on the shoulders of cloud of microservices of containers and I'll talk a little bit more about how that looks in detail but so there's a lot of work going into integrate and enable in an ecosystem so things can work together and we'll get to that. We have a talk later and I'll go a little bit more detail. We have a project called the data hub which is basically a reference architecture for an end to end AI platform built on top of Kubernetes and Kafka and S3 as the storage interface with Seth as the storage in the RATED world and we're actually operating that inside RATED and we're going to announce operating it in the Massachusetts open cloud for the community as a project. So AI's workload as I said it's really interesting from a hardware point of view you need to end to end enablement you need not only performance you have depends if you want you need data security end to end a lot of because data becomes important in DevOps we have figured out how to manage application life like a code life cycle where Kubernetes based platforms containers are thriving now we have to figure out how to manage the data life cycle because when you're training a model you want you need to keep the training data for completeness of your source code for reproducibility but you also need it for audit because the code on its own doesn't explain anymore what the software is doing so a lot of our customers that are in regulated industries or have potential abilities need auditability for them it's really important to in their application life cycle management also captures the training data you need compliance when you're feeding data back for example you're training a model with data that model inherits knowledge from the data so if there's confidential information in that data it potentially can end up in your model it can be disclosed that way so you need proper separation of compliance and access consistent between data and trained models an example would be you have let's say you're a trading stock trading company you have general data, market data you train a model on and then you have customer specific data for your large customers you want to provide customized models that are trained on their specific data and the behavior of their data that model will be considered proprietary information for your customer they don't want you to share that with other customers because it might disclose aspects of their portfolio so you need a very complex consistent compliance model to ensure that code the entanglement of data and code gets handled so they're really interesting problems that you have to manage throughout the whole stack on the hardware side quick announcement it's a good example we just in OpenShift 3.10 start supporting device plugins which is what you need to get the GPUs right now for NVIDIA but it's the generic feature in Kubernetes that lets you expose hardware capabilities to Kubernetes so the scheduler is aware and also figures out how to make the right drivers available to the application which is a bit challenging in containers because you need to know which version of the driver is in the host to load the right version of the user space low level user space so that's really important there's a block if you want to run machine learning at performance for complex machine learning because you will need this for most use cases you need GPU offloading and that's possible now in Kubernetes which is a big deal from our point of view because it gives you this ability to do the integrated life cycle in the DevOps model so on the core system we're looking at enabling AI in the core system so basic what that means is we're trying to train developers to look at learning instead of static heuristics that's a research area and I think we're going to talk about that a little bit today or tomorrow so think of augmenting the scheduler a good example one thing that's being discussed is in Kubernetes you have a scheduler and you have a de-scheduler that will evacuate nodes if they don't like if there's a performance issue or something right now they don't know about each other so you can have a situation where something gets de-scheduled and then the scheduler puts it back on the same node perfect example where you can apply a statistical model but you can also very quickly get benefits from learning models because they can factor in more vectors than static rules can and you can have a scheduler that for example learns from the behavior of the de-scheduler it's a pretty low hanging fruit pretty straightforward idea that we think will improve system performance significantly over time so that's where we would move from static heuristics to a learn model the another aspect is AI Ops and AI Dev we're just complexity complexity grows so much example would be a flake analysis right if you're running if you're running a DEF Ops CICD system you have a lot of tests what happens is that in many cases you'll see a failure and next time the test is run it goes green again it works and most developers will just call that a flake it's a test flake it's something in the system or the test was broken so let's move on because it's working again now there are situations where these are actually not flakes not one-offs it's just a weird failure of the system it could be a hard to deep sitting problem in your code that just only happens on a full moon if someone at midnight if someone tortured a black cat on a grave yard not endorsing that I like black cats but it's just these weird situations that are really hard for humans to reproduce because they depend on these complex intersections of data streams that system fails in complex microservice systems in the underlying infrastructure with AI we are able to process more of these vectors and see if there are if there's a clustering right so the same thing happens and somewhere deep in this vector of inputs you see a clustering of situations that like coincide with these failures and it will tell you oh they're not actually flakes because it's really it's a full moon the lock tells you oh it was a full moon every time this happened and then we can inform the developers this is actually a problem in your code it has a problem in the moon cycle algorithms and it goes off and does something stupid so we're helping deal with complexity in the intelligent application space what we're doing primarily right now is integration work integrating existing retic program projects as well as community project in ISV there's a website called retanalytics.io where you can find some inputs some pointers to what we're doing there an example where we are doing some work today is for business process automation so there's the idea of robotic process automation it's kind of the oh I'll learn what a human is doing and I'll repeat it but we're trying to help with can you put artificial intelligence into business process automation to enable that we think that's a valuable target I talked about data already one of the key things we are seeing is the change in mindset and in that context if you want to change the mindset and you want to give people the ability to for example put like three data like code you need to figure out a way how to manage that and you need to find a way how to give them give them tooling and work flows which today don't exist right this is new there's a lot of startups in the space a lot of projects in the space but there is no common work flow so we are putting up the data hub it's open data hub IO as a project to incubate and integrate technologies to get to an end to end solution that's completely open source and we're going to operate that for the community to give you a place to experiment that to put data to also exchange data in the open source community so we can start seriously putting AI into open source projects the problem we are trying to solve here is the complexity threshold because everyone can just get like all the code is open source so I get a lot of AI tools put them together and then I'll use my data the problem is that in order to do that you have to run a very complex stack if I want to just use AI I probably don't want to put up the underlying infrastructure figure out how to manage the GPU hardware things like that it's a pretty high threshold to quickly put up a build service and application platform for AI so it's very easy to just put up a Jupyter notebook on your laptop and do it once but if you want to run an extra project on it you run into this and this is just to show the simplified view of our internal stacks that we are running to do AI experiments and putting all of that together for an open source project it's probably not it's interesting but it's not what you want to do if you want to be up there in the top and define research AI models or maybe you just want to apply some well understood anomaly detection like say I'm home assistant the home assistant project and I want to put in kind of nest like an open source alternative to nest that learns my users behavior with their climate HVAC control in open source home automation right do you want to have to put all of that up? No so as a writer we are trying to provide this as an open source platform that projects can use to develop in that space I'll skip that one so the problem I'll go in there quickly so the problem we have is right now that everyone just goes to the cloud because they over come the threshold Amazon SageMaker is awesome it gives you everything you need if you just want to do an AI experiment you can put your data there like you'll give it up but you can even put data there and share it with the world but the problem is that at the end that's a black box they reinvented the main frame right you're running black box services on least hardware and you have very limited reproducibility right you don't know what they're doing the service abstraction is awesome because you don't have to learn how to do it but you will not be able to actually do it on your own which for open sources that's the problem for all the reasons I talked about earlier so the risk is and that's a starved GNU the risk is that we are starving open source if we go and follow the black box service abstraction model and so we want to enable open source to basically run this with a commitment to have full transparency on the code and on the data that's used in open source development and a place to do that without having to do everything on your own right that's what we're trying to incubate with the data where do we expect reuse and I'll go bottom to top right so if you look at the bottom tier it was that platform building and running containers Jenkins pipelines Kafka as a streaming access control policy like all of that right now everyone who is doing AI platforms is building the same thing right you can modern back in the day there was this recently the standard was like a Hadoop model where data analytics was a special case a standalone cluster we are moving very quickly to converge the application platform because everything is AI based all applications are intelligent applications going forward so we'll have a converge platform which means that you're running your applications and your data analytics in the same cluster Kubernetes has already established itself as a default solution for that use case so most people who are doing this are doing it with Kubernetes Kafka is the data transport S3 is a protocol for data address Spark is pretty common so there are really common things that everyone is doing in that space and we expect a lot of collaboration of like integrating that creating operators to make this easy to run on Kubernetes a lot of standardization into standard products in the space the second tier is well understood AI functions like anomaly detection like clustering where we think you can actually pre-train models generically enough so they're useful for many people so we'll kind of a pre-defined library of AI functions that you can directly connect to you can still take the model and train it further for your own purposes or retrain it completely but they're going to be pre-trained model and we see a lot of interest even in the industry to collaborate on some of that and then one level up you have the actual business use case where you aggregate different models into an actual customer function that's where people usually see their differentiation an example would be fraud detection if you're in a financial business if you're a big bank you have your own fraud detection service and you're going to treat that as a trade secret as a differentiator if you're a small bank you're probably contracting an external service that does it for you but they will treat it like the model itself and the data as a trade secret so we see less collaboration up there but there are still options for collaboration and in a way you can look at Amazon recognition the facial recognition service I mentioned earlier as an example of something where they're not providing the transparency they're not doing it all in open source and publishing it back like we would but they do get collaboration from all the people who make that model better so you see some generalization and commoditization of use cases even at that level but that's much less than lower in the stack the common pattern we see it's basically you have a secure data platform that abstracts from the hybrid cloud basically Kubernetes Linux S3 and Kafka we're using Sath there OpenShift our Kafka to provide that on top of that you have a DevOps application lifecycle management that expands to data so you're going to treat basically data as an equivalent to code in this and for example in your when you train a model or retrain a model you'll package up not only software you'll package up the data with the software basically today we do that for reproducibility most most projects do that like when you build code you store the source code that you build the code from or store reference or package up a source RPM or a Debian source package so you can recreate that we're expanding that into containers now with a concept called source container I think someone's talking at Defconn about that because you want to have the aggregate concept and then we are going to add a model for tracing data so you can reproduce the exact function based on the training data used originally on top of that then you have common things like language runtimes processing toolpids like Spark or Flink and then some common services that everyone uses like messaging we see usually most customers and most larger projects having a predefined library of AI functions like we are doing that internally so that would be for example Flake Detection is a service we have inside Rated that teams can just use you connect to an endpoint we are a REST API and then it tells you you put your data and it gives you results and analysis of your data so the use case here is basically you are a developer or a QA person and you don't want to learn AI you just want to benefit from it so you use this predefined service equivalent to Amazon's recognition you can use that because you have data with faces pictures of faces you don't have to learn how it does it so very simple the analytics private microservices that's what if you are a data scientist or AI developer you create your own services you could create generalized services or you create your own private microservices and then there is a data science and developer toolchain that you would use to do Jupiter and so on on top of that just general API routing and then of course identity access control so what we are seeing talking to customers looking at what everyone in startups is doing what's happening in the community this is pretty much a pattern that everyone is doing where we think we can get the collaboration the idea is create a meta project so it's not going to in that project open data it's going to drive individual projects deep in AI it's going to be focused on making it easily accessible in an open source context and then operating in sense building up operational knowledge about it feed that back into the project this is a goal to get access to the community and so both the academic community that's today using the Massachusetts open cloud and the goal for RATED is to expand it to the open source community to begin with and other RATED communities with the capability to use AI tools at different entry points right so Flake analysis for anyone in the Fedora community would be something that's pretty close but then also the ability for you to do your own experiments that develop your own AI solutions and the entry point basically could be data only so an S3 entry point look it is like with some governance around it's like a github for data and then build up from there container platform you can run things streaming services an AI toolchain TensorFlow availability other AI toolchain up to a full workflow or predefined services example so I talked a little bit about that already but you know quick example we have deeper talks during the day and tomorrow by the people who are doing these experiments but for example we are looking for our own operations teams that run our cloud services we are working with them to look at anomalies in their clusters the problem is that traditional systems monitoring doesn't help you anymore you can try to create manual rules to filter down either you are getting too many alerts or you are not going to get the alerts either way you are going to miss the alerts that you were actually waiting for because either it is lost in the noise or you filter is too tight and you are not getting it at all that is a common problem it is just too complex there are too many interdependencies let's take some well understood AI algorithms machine learning based tools to train it on the data to for example find anomalies so if I train it with the data of a cluster and I have enough data I can identify when unusual things are happening and I can alert on those which is much more powerful than a static rule system for flag analysis I talked about that can I just get the complexity of these systems are so for humans it is hard to understand the complex chain why something happened or even recognize that two things two failures are related because the relation is deep in some some condition that both shares that is not obvious for the human a computer can find these things because it can just shift through all the data and similar we are doing associative rule learning so this is for humans to write rules but we are helping them to visualize relationships between things so they can give them rule examples so they can more efficiently derive knowledge derive rules another example is when you open a support ticket going forward your supporter will know your state of mind because we are going to do we are experimenting with sentiment analysis it is pretty obvious that you would do that it is really important to know how pissed off the customer is so the idea is to derive more information from the information we already given go deeper into that so recap for RATED overall we think AI is an extremely important trend it is a fundamental shift of paradigm I am not joking it is as big as the industrial revolution and you can go beyond we are focusing very much on the technology aspect and the application to AI but overall it is a big shift because it is a different kind of automation it is a big instead of traditional automation instead of swinging a hammer on my kernel I am not going to swing many hammers at the same time on many nodes but I press a button instead of swinging the hammer myself but it is still a hammer being swung in the same predictable setup it is just doing what I told it to do here we are now the machine is learning to do something based on data I give it and parameters I give it and that means I don't exactly know I don't know how it is swinging the hammer and I don't necessarily exactly know why choose that specific hammer it is not 100% predictable for me anymore I am not between the machine and the action anymore I am just on the outside putting input into the machine it is a big change it also it means that certain types of tasks probably we won't have to do anymore fundamentally on a society level it is a big change because in the past you could always just move from swinging the hammer to pushing the button but now there might not be a button to push the button is automated away so it is a big deal we think that everyone has to be aware of it like as a business, as a software project as a developer because it is changing how you interact with systems how you interact with software what customers expect what users expect from the software now we see a very strong trend towards this hybrid cloud container platform which we are fairly happy because with Kubernetes OpenShift we have done a lot in that space and so the biggest priority for us is to make OpenShift the ideal platform to run AI, machine learning workloads and enable the broader ecosystem of the open source community because as Reted as an open source company we depend on the open source community to picking this up we are not going to do this on our own we are only one part of the overall equation and so we want to enable the open source community to get on board with AI and the application of AI within the scope of the open source project if you want to find out we have we are going to publish a blog at next.reted.com and you can go to the open now you can go to OpenDataHub.io we should have updated that page and Reted Analytics.io and has a good place to find out what's going on with AI at Reted and get some quick starts and tools there is going to be we have this whole today in this room tomorrow in Metcalf Small I think an AI track with a whole bunch of good talks and on Saturday there is also a data science workshop run by Mike McEwen and Will Benton Mike is in the back there which I can really recommend it's a really great workshop on how to run it's build around Spark and how to run data science and machine learning on top of Kubernetes so really good I don't know how much time we have for questions I think we have 5 minutes is that right? 10 minutes, excellent we have 10 minutes for question and discussion any questions? in the back do we have the second microphone back there? note to self for the next talk get out the hand microphone hi I'm just wondering will you make this presentation available? yes so we actually live streamed it and the recording will be available and we will make the slides available any other questions? let me ask you who of you has done anything with machine learning? have you used it on your laptop or done it in a workflow already both who of you is using Amazon for that? maybe it's not as bad as I thought while the Reddit people are saying Amazon well there are no questions one more hi I found your perspective quite refreshing I have seen some quite chilling things that are coming out of the land of proprietary software from companies such as Salesforce when it comes to AI can you expand more on the importance of doing this the free software way so why does free software matter here? I think there are multiple layers to that discussion the first one is if we want to do it in open source we want to enable open source to be AI enabled Kubernetes becomes a self driving cluster we need to do it all in open source otherwise it's not open source so that's self evident that for applying it to open source software we have to do it in open source and that's for the code and the data when you go into it needs to be functionally complete if we don't have the training data then we haven't provided an open source solution to the problem I think in our world we'll see a separation of data domains in community data and community data will always go back to the community and we'll develop, we'll see a development of some licensing schemes similar to copy left for that, for data we then have your own data created a lot of data some of that will be open source data some of that will not and that depends on a lot of factors and then we have customer data that the customer is always going to be private just by requirement in many cases by even regulatory requirement unless the customer donates the data into open source which of course is a possibility for that but for us it's really important that this is complete open source needs to be complete self hosting reproducible it means the data you use to develop open source to put AI into open source needs to be part of open source otherwise you haven't done open source it's very straightforward there now you can take it a step first and look it's this general trend if you're using black box services you're giving up control and that's true with anything you do in proprietary software you have less control than in open source that's why open source matters open source is about enabling you to understand what software is doing and then re-distribute that software now if you go into cloud services black box services that operate in the cloud that's the extreme case of proprietary software because you don't know you not only can't you change the software you don't have input on the operation side now if you go into data services where now decisions are being taken based on data and you don't even control the data in it but you lose control over your own data when you do that it extrapolates so it's a control problem you get into a very strong dependency when you use these kind of black box services you lose reproducibility on top of that let's say if you're in research you're doing scientific research you want to do publications how do you do peer review if you are dependent on a black box service you don't know how is it going you don't have reproducibility anymore it's a business problem regulatory problem an audit problem in many cases you can probably offload it if your service provider is HIPAA compliant you don't have a problem with HIPAA compliance anymore but you're also completely dependent on that that doesn't work everywhere that doesn't work in research it goes deeper where of course your results are dependent on what goes into that I don't like the term algorithmic bias because I think it's misleading because I think it's primarily just the old garbage in garbage out problem it's not that the algorithm the math doesn't have a bias but if you put data into a training model and the data has a certain statistical aspect characteristic of the data will show up in the model and so if you have an imbalance in the data you put in you will see the imbalance most likely in the decisions a model takes and the problem is that if you have a model part of your decision process and something and you don't know if you can't validate the data you're fully dependent on the people who select the data to train the model to do it right so even like the correctness of the results depends on training the model so if you're using someone else's trained model or it could depend on weird things in the platform if I can't reproduce the binary platform that I trained the model on I can't guarantee that I can create the same behavior let's say I validated it was correct but it was correct but there was like a rounding error somewhere in the GPU because of a microcode problem or that's actually a hard one let's say in a driver or just a parameter setting if I have no control over that I can't reproduce it I might not be able to reproduce the results so that's why we think it's extremely important just for consistency both for businesses and for researchers to control the or have transparency in the full stack and have the ability to get control over the full stack you can use services but don't use services that will not tell you what they're doing and then you can take it a step further where now we are these systems become more important and then you can get quite philosophical or political on that but arguably I think it's the old Lawrence Lasek I think he wrote this Larry Lasek Harvard professor he wrote this code and it was like probably 20 years ago or something like 15 years ago but it's all about how code becomes law with the world today our interaction with the world is limited by the code that we use to interact with the world a common example if we are all in this room that's great we can talk unfiltered at the moment where I'm posting this on Twitter or on Facebook there are already algorithms between us and if protocols don't talk to each other they don't talk to each other filter things and don't show them they don't show up and that's another bigger transparency problem that I think only open source can overcome at the end you want to make sure that there's transparency in these kind of decisions there's transparency in and you don't have to go to the big question for self-driving cars the dilemma like one or five people which is hard for humans there's a Sam Harris book on that so for humans the answer depends on how you ask a question where you kill one or five people in lab environment in a test psychologically and for machines we will have to figure that out self-driving cars are not possible without taking that decision they will take that decision and I think that kind of decision needs to be transparent and the only way to do that is with open source and open source treating data as part of the code so was that was a long answer I hope it was helpful one more question two more questions okay three so I was very interested in the idea of a trusted aggregator of data being a third party can you talk about some examples with some really promising areas so right now the focus it's for open source projects to share data that's where we're starting well right now it would be operational data for example for you know out of your cluster let's say you want to do my favorite example it's something if no one is doing it I will do it so I'm running my home automation with Home Assistant which is awesome on the Raspberry Pi on Fedora on the Raspberry Pi it's important in this context Debbie is great too and right now I have to for my climate control I have to write manual rules with three thermostats and a bunch of day, night, season people home like it's already too much right I can't keep up with that right so it's a very low hang through to just train that and that's what with Nest you do right like it learns from I have no idea how they do that I never own the Nest because it's a black box service so I'm doing it myself and with the open data you create that very easily a place where you can put the data with trust we can ensure compliance around it you know to make sure that data gets managed properly and the open source project can now collaborate on it without having to build up the stack on the four so it would probably fall under GDPR because of personal identifying information so you need compliance to the I can't say that in English but you know what I mean and so we can work together to just make that easier available we're going to enable you to do it on your own if you want to right and you can participate in the project and it's going to provide all the tools to make this very easy but if you just want to use it we are trying to create an implementation that you can just use while being transparent I think we're out of time but two minutes so one more question so I had a question about the practical approach of dealing with these large datasets I can use AWS and I can write my code that's open source and I can use PyTorch which is also open source and I can provide my data and just use AWS as a service and still meet all of your requirements but when we're talking about nest data and we're talking about self-driving cars and we're talking about medical data we've seen from a very simple example Netflix a couple of years ago when they had their million dollar recommendation challenge that even the identified data in a very limited source is enough oftentimes to re-identify people so when we're dealing with petabytes of large datasets how do you envision that people without encroaching on the privacy of whoever is containing that dataset would even start to think about making that data open source because what you need, I mean the whole premise of machine learning is the more data that you have the better data is far more important than the model that you throw at it or if you have enough data to get somewhere but that inherently seems to conflict with the idea of making that open source specifically for these kinds of models like text to speech or speech to text and everything else I don't know if the audio was strong enough so I'll repeat I'll summarize data is the key we all agree and you need more and more data the problem is that often you can identify individuals with very little data and for what we're trying to do we're going to try to aggregate a lot of data wherever we're doing it with AI and the problem, how can you do that with open source without compromising privacy so I think so we don't have an answer to that yet there are some techniques you can do with pseudominization that get you there ultimately it's going to be hard it depends what you're doing when we start with IT data it's fairly simple so where we're starting out now when you get to medical data it's already a big deal medical data out there already and it's being shared and people are actually not aware how well identifying it is today it's a problem today where MRIs actually are identifying information I don't think they're covered in the regulation yet to that degree and the problem I think so there are going to be techniques to improve how you put separate the identification from the data that works for a lot of the simple use cases I think for the harder use cases we have to get better at secure compute capabilities so things like secure multi-part compute research like that I think is where we're going to see it's going to get to kind of an escrow model for data where data is stored but not disseminated to everyone and you have a secure environment to do something with it it's it it isn't conflict with the concept of open source right so there's going to be compromise somewhere there in some areas you probably will have to decide whether you're going to go with privacy or you're going to go with transparency and in a way I think it could be an opt in model like so for something like a self-driving car I don't see a reason to have private data there I think you can probably make it anonymous enough to train the decision systems or it becomes irrelevant enough because there's so much data about who was there like individually there might be things in that data that are compromising individual privacy but if it's everyone I think that washes out a little bit with medical data it gets a bit more like so location data is one thing but medical data I think is a more problematic thing well thank you very much