 Which is pretty cool. We have a lot of talks today, very short talks actually, because we wanted to fit in as much as possible from all the applications. Hopefully next year we get the full day again, which would be great. So we've got different topics like graph streaming, graph analytics, graph varying. So all things graph, which is cool. And yeah, with that I want to introduce our first speaker, Vincent from Intel, who will talk about scalable cross-platform graph analytics. My name is Vincent and I work at Intel on Hive, which is not an acronym for scalable cross-platform graph analytics framework in Python. And this is work we are doing co-joining with Anaconda. So quick outline, what is Hive, what is the architecture, the interface, how do you extend it, and a quick summary of it. So what is cool, what we're trying to do with Hive is to build a graph analytics API in Python for graph users. And because you're operating in Python, you probably want to do integrate to do that whole data science ecosystem. And the kind of interoperability we want to provide is that it's easy, for instance, if you have containers like NumPy, arrays or data frames, you can use them, convert them into graphs. So this sort of already exists, if you think of network X, the issue with it is that it's pretty slow. So one of the things we want to do is really leverage all the fantastic high-performance graph libraries that are out there. Some people, it's a bread and butter, they do their research, they maintain it, they make it work. Once you have these APIs and then a set of graph frameworks, you need some kind of a glue, something that can orchestrate how do you go from this iPhone API down to actually coding into some implementations. And for this, we're going to use that. We want this to be a community-driven effort, so it doesn't involve data science packages, doesn't involve high-performance graph libraries, and we can't just tackle everything by ourselves. So we want to give out the frameworks, the interfaces, and help people contribute to it. And finally, we want this to be hardware agnostic, so we want to be able to plug as many hardware vendors on hardware in there, so do nothing in that architecture that would make it that you're looking into some ecosystem. Finally, this is in development right now. Think of this as a kind of a teaser of things to come, and it's going to be open-source in 2020, sometimes, zero.something, to get feedback from the community. So let's talk about the basics, right? So if you want to express a graph algorithm, right, you're going to have to need data information, and you're going to express your graph algorithm using your paradigm, right? So think of this, for instance, you can express a graph algorithm using an algebra, you can use your text-centric, book-synchronous kind of model where you update from your neighbors, it's centric. So someone is going to provide you with an API, and you fill that paradigm, and then you can implement your graph. And finally, you get your data and so all of these things are interrelated. So this is an example, just a list of graph frameworks, right? They're in no way associated with this effort, and just be done because people are doing great works, just to redistribute my point. So if I take, for instance, shoes path, right, I'm going to have, basically, using a subset of this graph representation, and a certain way of expressing things to write my graph algorithm, and then targeting a self-enough detection, right? And this is basically just because people are focusing on something, and it's not a critique, I guess, that's what I mean. But if I switch to Galois, then I might have a different set of things I'm going to use, and graph it, and so on. So the question is, if I'm a data scientist, and I'm sitting in the Python ecosystem, and I want to leverage this, how do I do that? So if you have a Python interface to one of them, you can use it, because you have a small ecosystem, right? You have to maybe format your data so that the graph algorithm can use them. And if you're building a workflow, maybe there's something in one of these frameworks that's not available in the other, I know if you have to deal with how do I go from one to the other. So this is where we think the Hive API would contribute. So the goal is to have a high-level set of graph APIs. So think of this, like, you want to do communication, you want to do some graph pattern matching, so you have some graph isomorphism. So very high level so that if you're a data scientist, you can just take it up the shelf. One thing we want to explore is doing this kind of graph API with Numba. So if you're not familiar with Numba, it basically allows you to write a function in Python to make you annotate it, and then you can check it out. So one of the simple things we wanted to try to do, for instance, is to create a function. You stay in the Python side and write this kind of a predicate, and then you have the runtime go there for the HGs or the vertices that function and you get a subset out of it. And finally we want to make the interactability easy with the data science system. So think of this as how do I convert across containers. One of the most popular things we can sort of support by default is that you don't have to. So once you have these APIs, you need a glue in between the APIs and the framework. So we're going to use the Dask runtime. So think of this as you do lazy evaluation on the APIs and you dynamically build this Dask graph into the things you want it to do. And now you have the user express what he wants to do and you also have a bunch of algorithmic limitations so you need a broker to do the orchestration. So this is both how I schedule the computer on resources and also this runtime to handle all data movements. And finally we want this to be accessible, but you can just jump in and add your own things. So let's have a look at the framework interfaces. So the centerpiece is this Dask runtime. So it's going to be orchestration. And then you have the user APIs, so again, a title of all things like do, do, go on the graph. And the graph is kind of this opaque type of the data type. It's just a graph. Then you can define your data models. So for this abstract type, that's a graph, I want a concrete type which is maybe have a data frame that represents an SSNC matrix and it's going to CPU memory subsystem. And I also have a CSR data type, anything I can imagine. So then you're going to define transformers. So how do I go from that data frame on the CPU to a CSR on the CPU and so on. So some of these they can be part of the common libraries. And some of them just primal components I can go and add them. And finally you have the graph algorithm back end. So if you look here, I'm saying I'm exposing Luga. It's running on this XBLAST framework that I just made up. It's running on the CPU and taking a CSR. So taking all of these, you actually just build a graph of dependencies. So this user API they have implementation as graph algorithm. They use some data models. These pages between the models they are the transformers. So that can now take this graph and reason about what the user has expressed so it's really doing graph analytics with the help of graphs. So on the the left hand side you can see a workflow of the task graph. So here I start outside of the high ecosystem including data, pre-processing, whatever is your workflow. And then once you cross in, you can say I want to create a graph, I want to apply an operation and so on. So if you think about that graph we had before dependencies, here I might have a data frame coming in here. So the system not okay. I have a data frame data model. And what do I need to convert it into? What is the set of graph operation implementation that I have. So solving this and you come up with basically a schedule. And finally at some point you just exit that and go back to your regular workflow. Another cool thing here is the data transformation. Just by transitivity you might actually support some stuff that you haven't provisioned for. So here for instance I start in the file format, I know how to make it a table, I know how to make it a second graph format into a last one. So it would probably run horribly but what I found pretty cool about this is that now you can think about building some government monitoring infrastructure. So in that you could look at this graph and figure out, oh you know you keep doing this path that is very long and expensive. So the system would tell you, hey if you actually implement conversion directly from that file format that graph format, then you're saving time. Same thing on the task graph, you could think of you know for for different graph inputs you could try and characterize them and then you could see when you run on CPU when you run on GPU with these different inputs these are the results you have. So you can go into some machine learning framework and basically learn from that. So in terms of the extensibility this is how you would see both new hardware. So as you see there's no functional changes to the user API so you basically the only thing you might do is just maybe add annotations to say I really want to use that new backend that you've provided but you don't change your code so transparent for the user. And then you only need to implement these brackets. I need to go and describe the data model that CS on that XPU I made up what are the transformers again and finally my implementation. So here this is an example where I use a framework and I just extend it but what I find what I find pretty interesting is that you could just take any code of the Chef from internet that's using PFRED or whatever you wrap it in the Python interface and then you have it advertised in the framework so it's a great way to basically leverage what's out there. And once you've done that then it's part of the hybrid type toolbox that makes this graph we've seen larger and then you can use it. So one thing that's pretty cool also is that it allows you to meet hardware architecture so you can start your if you have different graph operation maybe you do one on the CPU the next one on the GPU it's also a great way to get portability so if today I'm running on the box and it has a GPU and I'm using the GPU algorithm but tomorrow I'm not as long as I have an implementation that covers the CPU my code would still run. So in terms of the extensibility for the if you want a new user API then the same thing you just have a subset of plugins you need to work with so the user API so here I'm extending that I want to triangle counting on the graph and I just have to provide at least one implementation so here I'm still using the last example and then again it's just going to be part of the API of the runtime toolbox so we're on this basic so that people can just plug in your algorithm and liberate all the work that's been done on the back end so as a summary for the stack of those view if you're a data scientist what you gain is a unified API for graph analytics you have the Pythonian that allows you to basically put your work into another workflow and basically you just have this state of the art back end always available right there maintain and optimize you get the transparent orchestration so that you don't have to worry about what is the actual underlying hardware you have and you get the increased workflow possibility if you're a graph framework developer you we hope this would help sort of structure a way of this is how I present a graph algorithm in a Python form so hopefully this would sort of bring the community together it's an increased user base right so now because it's much easier to plug in your framework into another high ecosystem hopefully that would bring you a user base and what I really hope we can do is provide performance feedback for the people who are developing through that by logging and sharing those data and finally if you're a researcher I guess in the grand scheme of things we want it to be easy to integrate in your workflows because of the Python ecosystem and be easily extensible so that people can do research the orchestration level or the data models and basically have this performance monitoring and optimization to improve the code so that was my presentation about how to bridge between graphs and data streams and I will take any questions if there are somewhere where innovation not to for example go from CPU so I think the question is basically if you could optimize the performance of the tasks that you don't do back and forth so I think you can definitely do that I think part of the work would be also to instrument so that you kind of learn what is the performance characteristic of running something then when you have this data you can make decisions it could be very well in some cases even though you ship data back and forth your accelerator is so much faster so if you have the data then you can make a resume I think at the task level basically whatever is the orchestration plug-in that's why you plug-in your mistakes I don't know how you should you know most ideas would be also policies but you can just inject for some reason what do you want to do one of the conditions that you had was you could be able to have any hardware and then run an algorithm on the CPU and then another one on the GPU who do you expect in that scenario to handle that data movement going from CPU to CPU so the question is who do you expect would be doing basically the data movement between CPUs and GPUs so yeah I think it goes a little bit into how we this effort would fit in the ecosystem because some frameworks you have one entry point and then it kind of decides that they're only realistic to say do I do CPUs or GPUs so I think what we would need ideally what we want is as we build that task graph we would like that to have the control basically to have the most flexibility possible and for that you kind of have to use these frameworks to say it's not just that I'm doing look down on the graph I'm doing look down on the GPU or look down on the CPU and at that point then these data transformers because it goes back to that question right then you have the whole picture and you can say okay it would cost that much to do transfer and you can actually do something about it because it's not dating the framework so that would be the