 Thank you so very much for joining the session today. So this talk is a Tiger's Guide to Dungeons and Dragons. I have no idea what expectation you have. And I'm really surprised that so many of you showed up, and I'm pretty glad actually. Let's get started with a brief introduction of myself. And this slide is normally what I call a short summary of myself in logos. I'm a researcher. I've been working in academia for many years. Now I am Devrel in Anaconda. And I'm also a fellow of the Softus and the Birthing Institute. That's the serious part of myself, the nerdy gigi part of myself. I am into Python, quite a lot. I actually help organizing many conferences, Europe's high-py, which by the way is in a couple of weeks. Python, Python Italy, PyData events, also play Magic the Gathering, Dungeons and Dragons. So this is where I am. I couldn't wear my Darth Vader mask to come, but I have my bag to compensate. And that's it. So this is where I'm. Sorry. So let's get started. And let's start by saying that we have a D&D problem. And in particular, let's say we have, we're about to embark on this mission. We are a mage. We have to fight this green dragon, hidden in this ancient castle. And in particular, the mission we are about to embark is that we have to find this dragon, and slay the dragon, the poor dragon. And, but we've been empowered by an extra set of skills. We know that we have to cross this path. We have to go through this maze. We have the map. We have to essentially find a way to reach the dragon, which is hiding in the forest, and cast a fireball. Sorry. This is very nerdy language, I know. But, and in case you haven't noticed, this map is essentially Prague Castle. So we did have indeed a D&D problem in Prague. So going back seriously, what we have to do is essentially finding a way very efficiently. This is the extra skills we've been gathered by this ancient potion, Pythonic. To go through this maze in a very efficient way, and so find a way into the forest, and also, so maybe the shortest way to get to the dragon, and essentially cast a spell. And the spell is, so this area here is actually the forest. So the royal garden, I think, I gather from the map. So we want to increase the power of our magic, and so we want to find exactly the spot in which we want to cast a magic, increasing the effect. So in other terms, talking algorithms now, what we probably want to do is something like shortest path to reach the dragon. This is technically referred to as SB, shortest single source, shortest path. We probably want to do some travels of the maids. This is a very confusing name. Technically, it's historically called breath-first search, even though sometimes it's just moving through the graph, so there's no search necessarily, to maximize viable effect. And we can think of many, many others' algorithms. This is just an excuse to talk about graphs. And the following slides are essentially taken from this wonderful talk. I absolutely recommend it from Guy Rose. And in case, when I mentioned, please take a picture, this is really fantastic, the talk is called D&D in graph databases. He's talking about graph databases. I am not, but I have been borrowed some of his slides. And in case when I said graph, this is what you had in mind, you're still in time to probably go to another talk, because this is not what we're going to talk about. So talking about graphs, graphs are identified by vertices. And these vertices, or sometimes edges, can be, sorry, nodes, can be connected by edges or arcs. The terminology changes in literature, depending you're considering whether these connections are directed or not, I'll get to that. But you can have isolated nodes or connected nodes. If every single node is reachable, the graph is said to be connected. If essentially you have a, this subgraph here is also referred to as fully connected graph, because there's always a path from each node. As I was saying, graph can be undirected or can be directed, meaning that you have a direction to follow when you're moving from one node to another one. And the last thing to say, a couple of more things to say, sometimes we can talk about degrees of nodes. This is a property of nodes. You can have, if the graph is directed, like in this case, you have an inbound degree. So the number of edges getting into the node, in this case, is 1. And you have a degree of 3, meaning in this case, meaning you have a 2 outbound degree, and there's no than 1 inbound. The other degree, of course, is how many edges are actually starting from the node. And just to be clear, when you're dealing with graph, anything can be connected to anything. So you can always have situations like this. And situations can even be more complicated. Anyway, enough with theory, I promise. And also because we still have our D&D problem. And we were saying that what we want to do is to look for shortest path or breadth-first search. This is just an excuse to talk about these two algorithms specifically. Now let's talk briefly about graph obstructions in Python. And to be clear, this is a very interesting problem, at least to what I define interesting. It's there's a lot of theory behind it. And it's certainly not a three-word problem, for many reason. Pythonically, this is an old page of documentation. And there's a Python pattern to implement graphs. So it's still in the legacy set of the documentation, but it's a very interesting read. So in the following slides, we try to think or reason together, and please feel free to provide your own comment on what we're going to say about what we can do in Python to implement a graph. We can start by adjacency lists. So we have, like, set of nodes from 0 to 7 in this case. And then we represent our graph as a list of other lists. And each list, that's essentially the nodes that each node is connected to. This is pretty easy. We can do slightly better in case every single connection as a weight or a value, whatever you want to call it. We can have a adjacency dictionary. So it's a list of dictionaries, every single, essentially, reference for each node is a dictionary containing the node as a key and the weight as value. We can even have a more flexible approach. This is still on the Python basic data structure. This is what I'm talking about. You can have a graph. And in the previous implementation we're considering was considering nodes as numbers. We can have labels on each node, so it's not necessarily numbers. So this is a more general approach. We have a dictionary in which each key is the label to the node and the list of corresponding reachable nodes. Even better, we can do same obstruction, but this time we use a set. The difference in what I'm trying to say here, that the multiple implementation, and this is still a very basic idea we can come up with, can vary depending on what you want to do with graphs. And so this is the kind of message I'm trying to get across. Depending on what you have to do with the graph, implementation may vary. So when you have a set, for example, and you're looking for a specific and you're adding a node, you're absolutely sure that you won't have repetitions. So this is one feature you have immediately for the data structure you have. So this is the kind of thing we're trying to think about here. Another probably quite popular representation, obstruction for a graph, is the agency matrix. In this case, we don't bother in storing just the piece of information related to every single node. We store everything. And in case there's no connection, we can have a zero here, meaning there's no connection from this node to this other node. Or if we have weights, for example, or any information on the node, we can put directly numbers. So either then zero, one to indicate something like there is a connection, there is no connection. We can actually do numbers. And in case the graph isn't undirected, this matrix is essentially symmetric. So we can store just the triangular matrix, so we don't need to store everything. In case you're not entirely familiar with this representation, this is a simple graph. We have the multiple nodes. We have weights, or the edges here. And yeah, the matrix will be symmetric. The row in this metric represent the outgoing edges. And the column for this node represent the inbound edges. It has to be said that, apart from standard Python, we do have alternatives. And probably you know already where I'm going. Graph can be represented as a sparse agency matrix. And indeed, scipy has a sparse package. And in this particular case, it's even more clear that depending on what you need to do with your representation, you have to choose carefully what sparse implementation you want to use. So scipy, this is all a sparse metric class in scipy. So you have multiple formats, multiple strategies to implement sparsity. And in fact, in the documentation, you can say that if you need to construct a metric sufficiently, you will probably want to do list of list, which is lill, by the way. List of list, in case you're missing this, this is exactly what we're talking here. This is the list of list, nothing different. Or you can use coordinate format here. But this is strongly discouraged to use numpy directly on this format, you have to convert them first. In case you need to perform manipulations, such as inversions, operations algebraic operations, there's another format. This is a comma separated column sparse, column sparse row or something like that. Sorry, I don't remember exactly. Just to be clear, this is what scikit-learn supports internally. So whenever you have sparse data, you can pass on matrices in this format, because this is the format where you can use arithmetic operations, as you would do in machine learning. And if you will have to convert this format into another format, like COO, for example, it's linear time. So it really depends on what you have to do. No format is always perfect. And you have to choose carefully. And when you have to do both the two things, you have to convert from one format to the other. And of course, probably the best solution we have in Python is using data abstractions. And networkX is the package we want to use when we work with graphs. This is the implementation of graph in networkX. And there's also digraph, which is directed graph, multigraph, multidigraph, and so on and so forth. So networkX already provides abstractions in Python to work with graphs. Are you all familiar with networkX? Fantastic. So I guess from the audience, half of you. So when you're dealing with a graph, you have nodes and edges automatically. And you have methods to add nodes and edges directly. So it's very easy to work with. And it's very pythonic. So you have a very fantastic abstraction. So you don't have to worry about the internals. NetworkX is working everything for you. Fine. What are the pros of networkX? Well, it's a reference implementation in Python. It's very well-known and popular if you're working in graphs. Many algorithms already provide it and it's well-documented and it's nice to read. And it's great for small graphs. What's the cons, though? It's quite slow. It's very slow. And in particular, if you compare performance here, this graph and the other kind of graph, sorry, it probably is not entirely readable. The green line here is the sci-fi sparse. The blue one is networkX and the red one is numpy. So sci-fi sparse is certainly the faster, increasing the size of the graph. But networkX is certainly not the slowest. It's slower than numpy up until some point because numpy diverges because numpy doesn't have any notion of sparsity. So numpy stores everything in memory. So that's where the difference comes from. But nonetheless, networkX is certainly slow and you can't actually use it for small graphs. So what if we would still be using networkX and all the advantages of using networkX? Because networkX, again, provides abstractions, very easy to use from Python, and algorithms already implemented. We don't have to reinvent any wheel. But we can still have networkX pros retaining all the pros but having a faster sparse algorithm. In particular, what we're looking for is a foundational sparse graphs library that is fast, flexible, scalable, and runs on any architecture. To be completely clear, sci-fi also provides a package within the sparse package, which is a CS graph. You have already, so the sparse matrices I told you earlier, they do come with algorithms already. So they're not just like data structures. You can also do graph operations on them. So this notion of using sparse matrices for graph problems is indeed a very practical case. So you have connected component, shortest path, the breadth disorder, essentially the same algorithm we're talking here. The problem is, and I tell you immediately, sci-fi sparse is not that library. It's not the library we're looking for. Sci-fi sparse is still too slow for what we're talking about. It's single threaded, first off, and it's not expressive enough. We don't have masking operations that work efficiently. We cannot change operator in matrix multiplication. And I come back to that in a second, what I mean here. And it's too low level. You have no integration with network X. There's no way you can have the two talking to each other. And also, has some what is technically called format gymnastic. So you have to work out different formats to work with it, so depending on what you're doing. CSR, rather than CSR. And, last but not least, is not yet hardware or implementation agnostic. OK, so let's think about this. Graph problems can indeed be expressed as a past linear algebra problem. And I probably convinced you enough already that sparse matrices are the actually way to go for representing a graph, especially when it's sparse. And so with these in mind, let me introduce you a very new project I came up with. I came across with recently. It's called GraphBlast. So GraphBlast is exactly what you can imagine to be. It's just BLAST for graphs. BLAST is basically an algebra of programs. So GraphBlast is indeed a BLAST version for sparse matrices. In this particular case, graphs are presented, of course, sparse matrices. And the matrix multiplication is the foundational tool graph operations. And in particular, everything is expressed as matrix multiplication with the caveat that we can customize the operator. So I'm not going to go into any details. I would be happy to do it after the talk. But whenever you have to, for example, express a single source shortest path, when you're doing matrix multiplication, you can replace, you can use min plus as the two operators to use during matrix multiplication. It's very, very convoluted. I'm not going to go much more into details than this, but happy to later on. So just to give you an example, when you have to process incoming edges, you can use sparse matrix times sparse vector. When you have to process outgoing edges, it could be sparse vector times sparse matrix. And this is exactly what you get. To be entirely clear, GraphBlast is not a library. It is a standard. So it's a specification similar to Blast. And so you have the auto architecture here, Blast or GraphBlast. And then you have some lay graph or graph analytical apps building on top of it. So this is the full stack we're talking here. And so how do we transition from GraphBlast to NetworkX? It is very possible. GraphBlast is the math specification. There exists the C API of GraphBlast. Then there is the suites pass GraphBlast C implementation. And to be clear, this specification was created, as far as I know, as a library to support sparse matrix operations in MATLAB. This is where this comes from. What we're talking here, and that's why I'm presenting this to the EuroPython audience, is because you can use this library in Python as well. And there's Python GraphBlast and GraphBlast algorithms, which connect, ultimately, to NetworkX via this patching. And I'll talk about it in a second. So more in details, having this NetworkX connection, essentially through GraphBlast algorithms and Python GraphBlast, we can target even multiple architectures. So you have an obstructions working on top. And depending on what is your architecture here, you can have CPU, you can have Dusk arrays, you can have GPU. You can also plug in Kugraph, which goes directly to GPU and Dusk. So NetworkX has this advantage through the dispatching method. So looking internally at the stack, so GraphBlast is a pseudocode. It's just the MATLAB specification. This is what we're talking here. The C specification is this level of details. And there is an implementation to this specification called suites pass GraphBlast. And as I was saying, MATLAB uses it for sparse metric multiplication. And it's based on OpenMP for parallel operation. And every single format we mentioned before is supported in this implementation. Python GraphBlast is a project which is currently open source project. You can install by a pip or conda in a conda forge, jointly developed by some of my colleague, Jimmy Kitchen and Anna Conda and Eric Welch at NVIDIA. And it essentially provides a Pythonic implementation to the C specification. GraphBlast algorithm is, so this is in some sense the low level Python here. This is the general algorithms similar to the NetworkX algorithms. So these are the algorithms that essentially implemented the plug to NetworkX. Currently implemented more than 80 algorithms. But the project is open source. And please provide a pull request if you have algorithms that you can't find there and you want them to implement. They will be more than happy to do it. And NetworkX finally is on top. So you can call it as you would normally do on your graph. And thanks to the dispatching method, it goes down here using GraphBlast. I'll show you an example in a second. This is how the dispatching works. So NetworkX provide some of the methods, for example, shortest path. They do have a decorator, which is nx.dispatch. If those methods have this decorator in place, it means that the dispatching method for that algorithm is available. And so NetworkX automatically figures out whatever is the library that should be called underneath. So just to give you a concrete example here, we're using, sorry, this is a typo here, but we're generating a random graph with a very low probability of connectivity. So it's a very sparse, 10,000 nodes. And this is, oh, sorry, this has been updated. Sorry, I should have played this slide. It's 10,000 nodes and more edges, finally. Sorry about that. If we try to list all the past shortest path in that graph, NetworkX way, it takes 32 seconds. We have to slightly adjust the code to use GraphBlast algorithms. We just have to import GraphBlast here. We do convert the NetworkX graph to GraphBlast NetworkX graph through utility function included in GraphBlast algorithm. And we pass to NetworkX method directly the GBLS graph. So nothing changes but the input of the graph. This, so the conversion takes more or less 8,400 milliseconds on my laptop and the whole execution here takes 3.4 seconds, 10 times faster. There's actually a benchmark of all the changing the sizes, the changing the different graph and changing also the algorithms comparing using NetworkX as a baseline. And so GraphBlast versus NetworkX and also a speed up against iGraph. And if you're working with Graph, iGraph is probably one of the most popular library available nowadays for network analysis. So to briefly recap what we're talking about, GraphBlast API specification at the bottom of this stack, SuitsBast GraphBlast is a C++ OpenMP implementation. Python SuitsBast GraphBlast is what is in the middle, CFFI plus SITEN, Python GraphBlast is pure Python, GraphBlast algorithm is pure Python and provides a connection to NetworkX. Multiple formats are supported, highly tuned matrix multiplication kernels and you have no import or copy lag and GPU support is forthcoming. So key messages, DRD is very cool. NetworkX is amazing, even if it's slow. GraphBlast uses Barcelona algebra to solve graph problems, mathematically elegant and blazing fast. If you're interested in all the very details of that, I would be happy to talk about it. NetworkX can become the graph Python, the graph API for Python similar to what NumPy does because at the end of the day, we have a fantastic API which goes very slow, so essentially, NetworkX says, I can demand the workload to whoever goes faster than I do. And that's what I had. Thank you very much. Thanks for the amazing talk, Valerio. If you want to ask any question, you can use the mic. I know lunch is coming. Go ahead. Hello. Hi. So first question, what is your favorite D&D class and why is it a wizard? I know. Yeah, I've always been a wizard. Yeah. Yeah, that makes sense. Thanks for the question. Yeah, but I also have another question. So in the example you have showcased, you have created a dense graph in NetworkX and then converted it to sparse. Is there any way for this graphblast on the Python layer to load a sparse graph directly from disk or whatever? So you mean loading the graph in NetworkX from disk? Not in NetworkX because NetworkX deals in the dense format by default, but in the graphblast Python library. Oh, yes. Yeah, yeah. So the example I had was like showcasing how you can use NetworkX algorithms directly. So this is why I went through NetworkX, but you can use graphblast entirely using their API. What is, what was happening here? I was trying to just think if I had examples. I had a few backup slides, but no, not what you had in mind. Yeah, yeah. So in this case, essentially I'm not using, I'm just using this graph algorithms utility conversion from NetworkX and you can also convert it the other way. So from NetworkX to graphblast in case. And this is graph algorithms. Python graphblast, which is what is underneath in the stack, has their own data types, metrics and vectors. So you can use them directly to represent the graph. You can do that. It's just a lower level. So probably you would go, it's preferably going from NetworkX because you already have the algorithms, but it's not necessarily the only way. If that's what you meant. Yeah. Okay. I meant, is it possible to entirely deal in sparse graphs without having a dense graph, a dense graph is an intermediate step, but yeah. Yes, exactly. So Python graphblast only talks sparse graph. And that's a very good question. That's exactly a very good question because if you're not dealing with a sparse graph, graphblast at the end of the day is no different in performance than what NetworkX is. When you're using this sort of breach, it's the same performance. You start appreciating the performance when you have sparsity in your graph. Yes, that's a very good question. One question that... Thank you. Have some cases when still a graph is better than this one? We can look at the performance actually. And this is... So, for example, PageRank works better on this particular network. I have no specific close about this benchmark, but these data sets and these results are public, and core devs are very lovely people. So if you're very interested in that, you can... There's a whole section. I can remember there's a whole issue in the graph algorithms repository talking about this benchmark. So I can probably send you the link. I should have put the link there. I can add the link maybe when I share the slides. Thank you. No worries. Thank you very much. Hello. Hi. One question on the need for dispatching to NetworkX. Sure. I mean, if the algorithms are getting re-implemented, but it is 80 plus of them are re-implemented in the Python package, then what advantage does it bring us to dispatch them to NetworkX again? You don't dispatch back to NetworkX. You dispatch from NetworkX. So what I mean is this is what happens. So if you... So, essentially, graph algorithms has a similar interface, Google source shortest path. They're NetworkXRs. And what you do, from the top level, you just use NetworkX. And this being dispatched internally to GraphPlus. Because the G type here is a GraphPlus algorithm. Sorry. Yes, it's a GraphPlus graph data structure. So NetworkX knows exactly how it works. I had a slide, I forgot to add it, in how the dispatching method works internally. So... It's difficult to explain with that example, but I promise you, these two are interconnected automatically. You don't have to do anything. What changes here in this internal implementation is that it uses Python GraphPlus rather than NetworkX implementation. And this is the kind of code we're talking about. This is Python GraphPlus. So, for example, the single source shortest path, even if it's barely unreadable, sorry about that, this is the whole implementation of the operation. It's very compact and it's very... It should be easier to read, but it's very compact. So you really have to understand what's going on. That's why graph algorithms on top exist. This works because of how GraphPlus works. So, transforming using the MinPlus semi-ring and blah, blah, blah. This is very, very integrity detail. But this is Python GraphPlus code, abstracted from GraphPlus algorithms. I have one quick follow-up question. So, say I'm an end-user, I need ABC algorithms on graph. I can just stop at the GraphPlus Python layer, right? You can. You can. It means that you're essentially... Yes, you can. You can totally can. The thing is, you can go back and forth in NetworkX. And if your program already runs on NetworkX, you can plug in with just two lines of code, GraphPlus. This is probably a good way to explain it. So it's totally true what you said. As in, you can use GraphPlus algorithms on their own without going through NetworkX at all. But if you were going through NetworkX, you can plug... You can speed up your code by just adding these two lines. Thanks. Thank you. I think that's it. Thanks again for the review. Thank you very much, all of you.