 Today I want to tell you a bit about something as sort of an ongoing project that we've had in our group. I really have two, I guess, three goals. First, I want to tell you about a cool library that we've been building. I hope it will be of use to you. Second, I want to potentially recruit some of you to pick up this if you find it useful and add to it. And third, I also want to emphasize some of the benefits that you can get from incorporating Siphon directly into some of the libraries that you build. Some of you, many of you may be familiar with the library, but we've been using it extensively in building Zen and it's been absolutely instrumental. It would be pretty much impossible to build what we've done without it. So in my lab, I'm at McGill by the way, we work with complex systems. We're interested in looking at how human behavior, basically modeling and predicting human behavior at very large scales. And so we spend a lot of time looking at things that look like this. This is a network. I'm sure that everyone in this room has seen a network at one point or another. They're extraordinarily sexy and they appear everywhere on magazines. The essence of a network though is to articulate the relationships between a bunch of nodes, these little circles here, and the relationships are these edges or lines. And networks are being used everywhere. So you can be working on social networks. This is a Twitter network where you see a user, sort of an ego network where you have a user in the middle. And then here's all their friends and the connections between those individuals. So they're mapping the social context of this user. And from this you might ask questions about the propagation of information. You know, how do hashtags propagate? How do people build social networks? How does information move through a social system? And all this would be embedded in asking those questions within a network. You would have to actually build this data structure and then do the text mining, the data processing, the simulation inside the infrastructure itself. You might be interested in infrastructural networks. This puzzling network is the subway system of Tokyo, which I can guarantee you is just as confusing to ride as it is to look at. So this, and here we have another network. So civil engineers will actually work with networks, will actually study attributes of these systems. And there you're interested in throughput, the extent to which you can move populations through these different systems. And again, you're building these networks and you're analyzing how people move through them. Or how resources are shuttled along them or what their structure is. Another question that we, or another area that we've worked on is biology. So this is a very abstract representation of something that's happening in your cells right now. This is the EGFR pathway. It's responsible for cell growth. It's responsible also for cancer, a lot of cancer growth. But ultimately it's another network. There's a bunch of molecules. The molecules interact with one another and they produce biological processes. And so we would be interested in actually understanding the way that those work as well. And then finally, I just pulled this off the internet. Somebody was studying flavor networks. You can study some pretty curious phenomena. This is the relationship between recipes, I think. But in any case, you can build networks out of a lot of things. But when it really comes down to it, there's one truism about networks, particularly in the past couple years, and that is that they are getting bigger and bigger and bigger and bigger. As the amount of data becomes available, we incorporate it and we fold it into the networks that we build. And so these networks are getting very, very large. This just gives you a sense. I mean, if you look at any reasonable social network, you can be pulling in upwards of a million to 15 million nodes easily. And that's pretty large. Now, there are network tools out there for working in Python. And some of you may be familiar with them, NetworkX. There's a library called iGraph that's based on Boost. We've used those in my lab, but we run into two problems. First, these are network tools. We need network tools that work on big data sets. And if you've ever tried to load something really, really big into NetworkX, you've discovered that that doesn't work. And they also need to be easy to use. If you've ever used iGraph, you've discovered that it is a completely non-Pythonic library in the sense that it just feels very awkward in the Python context. And so what we really wanted was a library that could actually allow you to work directly with a network in a way that felt right for Python and yet without actually giving up usability and performance. And so Zen, the library that I'm going to talk about briefly, is effectively designed to be useful, to be easy to use from a Python context. So it's actually designed with sort of Pythonic conventions. And then finally, but it does not sacrifice on performance. And some of the benchmarks I'll show you, Zen is actually probably the fastest library in some regards for network analysis just across the board. So just to give you a brief overview, if you don't work with networks, this may seem like a rather foreign slide. It's simply dazzling you with all of the amazing features that Zen has. So you can do lots of different things. You can represent different kinds of graphs. You can load different kinds of data. You can lay out networks. You can visualize them. You can do all kinds of analysis. And we're also working on interoperability with NetworkX. So if you want to use it, you can sort of move back and forth. I wanted to give a sense for what usability looked like in Zen. So to give you an idea of sort of how you would do this, here what we have here is we have me loading a graph. There's a bunch of data sets actually built into Zen. So there's me loading a graph. And there's me printing the degree, basically, the number of neighbors that two different nodes have. And there we have two well-known characters. If you watch the Oscars, you'll know them even better. And then down there, what I'm computing is effectively the betweenness of these two nodes, which is to say how important are they to the structure of the network. And so what we see is we have Valjean with one score. We have Javert with a different score. And we can come to a conclusion that Javert is an introvert from our network analysis. He's not very central, and he doesn't have very many friends. So, but this is a kind of deep, insightful analysis that you can do very quickly in Zen. You can see that you're able to ask questions and get to the very heart of it with one or two function calls. And the goal there was really to make it possible to ask any kind of question in just a handful of lines without sacrificing performance. One of the things that I've noticed in a lot of network libraries is they tend to just make everything positional arguments, which annoys the heck out of me, because positional arguments should only be positional if they need to be. So we've really placed emphasis on actually making functions descriptive in the sense that keyword arguments are keyword arguments so that you can easily read what you write. So it's not like you have just sort of this mishmash of random letters and numbers sitting in there in your function calls. We have strong conventions for the libraries that we build, and that is to say that across, if you want to load data, it doesn't matter what you want to load, it's the same kind of function. You just simply change the package that you're importing with. Remarkably enough, this is not the way that other libraries are built, which is, again, really baffling. But the thing that I really want to talk about, you can all go and discover sort of the usability for yourselves. What I really want to highlight is performance. And so that's what I'm going to spend the last little bit of time that I have discussing. First off, everything that you're about to see is generated using a built-in benchmarking system in Zen. So benchmarking is very, very important in Zen. We actually want to know the performance of it, not just internally but against other libraries. And so there's this built-in benchmarking system that allows you to very easily build benchmarks across other libraries. So here you see we're comparing Zen against iGraph and against NetworkX. It feels very much like you're using a unit test, the unit test framework. So it's a very natural sort of thing to use. And what you can see is it'll even spit out these visualizations. Smaller is better in this visualization, so you can see Zen is outperforming iGraph and NetworkX. And you could create your own if you wanted. I found this to be tremendously useful not only within the context of building Zen, but for figuring out how my code is actually performing. You can drop in a couple different implementations and it can actually perform very well. It'll tell you a lot about what's happening. Okay, so how is Zen actually built? Zen is implemented entirely, almost entirely in Scython, which is, for those of you who are not familiar with it, Scython is effectively a dialect of Python that can be compiled directly down to C. And so what you'll see if you look at the source code for Zen is effectively this really some very strange language that's actually expressing data structures that are converted directly into C. So all of the actual graph data structures and most of the algorithms actually compile straight down to C. So here you can see node information. It's actually stored. We declare them as integers and there you have int star. So we're actually using pointers that are not native to Python. We have built a number of data structures that make it possible to use Zen for extremely efficient memory management of nodes and edges. So that you can create graphs. You can go through and modify them. We run a lot of simulations in my lab. You know that you're creating graphs and changing them many, many thousands and hundreds of thousands of times. So you want that to be fast. And that's what Zen does. So here we have efficient memory management. What I'm showing you is we have this node array and what can happen is if you delete nodes or you delete edges, you're effectively creating holes in memory space and existing libraries more or less just ignore those. They just delete them and then they continue adding on, just growing the node or edge vector afterwards, which doesn't make any sense. So we actually keep track of this using effectively a malloc-based system and we can actually repopulate those holes so we make very efficient use of memory. So you can delete huge sections of your graph, create whole new sections, and actually not increase the footprint of your graph at all. So the big questions that you'll want to ask if you have a graph, primarily revolver and edges. You'll want to ask, does an edge exist and what edges are the neighbors of a given node? And both of these are extremely fast. So the traditional way of doing this is to hash this edge to an edge lookup. And this is basically an O of 1 lookup. The problem is that this doesn't yield very efficient use of the edge array because you have this whole other dictionary sitting around storing your edge information. So instead what we do is... And then what other people will do is they'll actually put edge information down for every node. So for every node, they'll actually have this huge list of edges. The problem being that you're duplicating your edge information all over the place, increasing the amount of space that you're taking up. So instead what we've done is we've created sort of a merged data structure where nodes maintain edge lists that are very easy to sort. So we've created a sorted order which makes them very high performance. You can actually index in very quickly. And so you can actually test for the existence of an edge. Rather than actually having to hash in, you can check in basically O of log of the degree of the node. And degrees of nodes are generally very small. So this is like one or two steps. It's effectively faster than hashing. And then you can also access a node's edges because they're actually stored in the edge list for the nodes in O of one time. So you can simply iterate straight through the node lists. We have a question. Have you considered a bit set for storing the edges? Have you done any measurements against? So how do you imagine doing a bit set? Well, I guess you'd have to number all the nodes. Oh, and have a vector. So networks are generally very sparse. So generally you wouldn't want to do that. So you waste a lot of memory in a full bit set representation. If it was almost complete, then that'd be exactly what you'd want to do. But we don't do that because our focus is almost entirely upon natural networks, which means that they're going to be very sparse in general. Thanks for the question though. So this is, so sure you're going to want to check for the existence of an edge and that might actually cost you something. But in general, what you're going to spend your time doing is walking over your network. And what you see here is that we give, effectively, O of one time to actually access and content in your network, which is as fast as you can possibly get. So algorithms that you write can actually access, basically cost nothing in terms of trying to actually walk over the graph structure. Okay. In network X, objects, and in general in Python libraries, objects are the basis for interacting with the graph. And that's fine. That is as we would like it for ease of use. But if you want to write high performance routines, if you want to load Wikipedia into memory, you do not want Python objects to represent every page because you will absolutely blow out your memory. So what you need is actually a more efficient representation. And what we give is two ways of talking about nodes and edges. Indices and objects. Objects are the actual unique objects that represent a Python node or an edge. But you can actually do away with that entirely and only use actual memory references into the actual node array and the edge array. And this gives you very high performance indexing into this structure because you're actually calling directly into the memory structure that the graph is using to store memory rather than taking the additional cost of using an object, bashing it, finding a node index, and then looking that up. And then finally, one thing that always bothered me was that when I was writing in Python, my code was always slower. So, you know, it's written in Siphon, but you're not always going to be writing in Siphon. And so the general idea behind this is that we would like to be able to actually take Python code and make it faster, make it possible to write it faster. What you're seeing here is two different routines that are simply calculating the node that has the most neighbors in a graph. And the one on the top is the way that you would typically write it in Python. You'd iterate over the nodes, you'd ask about the degree of the node, and you would basically increase some counter or some comparison. Using node indexes, which is what's happening down there, all you do is you add this underscore. It's simply a notational reference, but rather than iterating over objects, you're iterating over indices. And you can see the tremendous, you actually get a two-fold speed improvement simply by adding underscores to your functions. And let me emphasize that. You add underscores to things, and it makes everything faster. So, and that is the way that, in my opinion, that is the way that optimization should work, right? You write your codes that it looks good, and then you add underscores, and for many, many things that you want to do in Zen, this is precisely how it works, because all you're doing is you're moving from the object world into the index world. And for a lot of algorithms that you want to write, that'll be all you need. And for a lot of analysis that you want to write, that'll be all you need. This is just showing some of the algorithmic speeds, so Zen actually outperforms all of the other libraries that are available for Python in terms of the sort of standard algorithms. This is for loading memory, so this is for loading data, so if you want to load a Wikipedia, if you load it from an edgeless format, the standard one, it will take on the order of about 15 minutes to load, and that's like, assuming you have plenty of memory. It'll take on the order of about a minute using Zen's libraries. So you get a tremendous speed improvement if you use some of the input and output functions that Zen actually supports. And we've actually developed our own in order to accommodate really large-scale network analysis. We've invented something called a binary edge list. You can read about it on the website, but what's really interesting is not only is it very compressed, but it also can be written to database fields very easily. So you can actually use it for MongoDB. You can store it in MongoDB or MySQL. You can generate simulations for all of these things to file in ways that we haven't been able to do before. So finally, just in closing, because I only have two minutes left, let me just encourage you to take a look if this is an area that you work in. I would, you know, I'd love it if you would take a look, join the community. It's sort of its ongoing development in my group, and of course we're always looking for help. Here's the number of things that I think need some attention. Adding more unit tests, that never hurts. Additional input output functions, supporting additional file formats, databases, loading from databases. Network generators for doing sort of no model testing. Implementing algorithms is sort of standard. Network X in all fairness has a much larger spectrum of algorithms that are implemented. And I don't see us as necessarily competing with them. I'd like to sort of live in a complementary world where we provide very high performance, very high performance, very pythonic ways of implementing things, and they provide this huge spectrum of different functions, but I would like to increasingly get those into Zen. Layout and drawing, we do somebody who would be nice to do others. Windows, if you run Windows and would like to help, I would love to talk to you because getting Windows and Siphon to work together, I remain baffled by exactly what's going on there. Binary installers, it's very much a scientist's tool and a computational scientist's tool right now, but I would love to make it more broadly available. And there's my contact information, so please check out the site and if you're interested in this or in anything else that I've talked about, please feel free to send me an email. Thank you very much.