 Thanks everyone for joining here. I have to excuse myself. My pathetic vain is not so Greatly developed like the previous my STC. So we have to be more practically here So it's about analyzing TV and packages. I actually don't like this Analyzing TV and packages with with a graph database Why? for me it's Now about 12 years that I became TV and developer, but it's the first TV and conference. I'm participating so I thought So yeah, that's for me It's always a reason to get to know people because it's much better to see them in face and talk to them I think everyone knows this so I thought about do a bit this a bit of self introduction here So by education, I'm a mathematician. I'm doing this kind of stuff This is called mathematical logic proof theory intermediate logic Well, whatever you will not hear about it because I probably guess you will run away I'm also deep in developer. Yeah, you know maize moinly about tech Tech and font packages and a few other things that I use around What's the main part of my development stuff is is centered out tech life So I'm the the maintainer and developer of the whole test life infrastructure. Just to give you an idea Deepen is quite happy because they own we only have linux as basis We we provide binaries and distribution methods including installer updates for about 15 different architecture operating system combinations including some strange psd's I eek solaris windows and that all should work Well in the same way somehow so that's a bit challenging. I have to say especially the windows part Since my move to Japan nine years ago. I've not been now also very Involved into the Japanese tech development community. I will give a talk in two days I think about a bit about this and yeah, besides this if I'm bored. I'm also a mountain guide So I like to carry a pull up and push up and and get people into the mountains Here in Taiwan. I'm unfortunately. I don't have more time because there are so many nice mountains here to climb So I will I will have to come back. Anyway, it's my third time. I think we're okay Yeah, I forgot my job. Yes, I'm working at Axelia. I'm also grateful to my company who allowed me to come here It's not related to Debian what I do there. I'm research and development the company is one of the well it's a small company but the CDN and Internet services security in Japan. I Do well security machine learning some kind of formal verification which I carried over from my previous work So this is what I do for a living. Let's put it this way a bit of an overview What I want hope to carry through today a quick introduction to graph databases because I mean I'm not sure how many people have heard about this and what it is But only very quickly because otherwise we cannot then okay packages in DB and this will probably boring for most of you But I mean just to be sure what we are talking about the few things about packages in Debian then the ultimate DB and database Well, I will introduce this and discuss and then I will look at to how to represent parts of this information not all of it parts of it in a graph database and What would be the advantages of all this? I discuss the technical parts a bit how to convert from well from the UDT how to go get the information from the UDT into a format and then into a graph database in this case Neo4G and Then finally it will show some example curies and visualizations Okay, so what are graph databases graph databases are just started like I don't know like 15 years or something The idea was that tables like relational databases Well, they are nice. They are efficient for some things but not for everything many things in our daily life are Well are based on really where everything is based on some kind of relation and it's often easier to Represents if so what is the graph? We don't go into mathematical stuff think about nodes and edges. So nodes and relations between them So the these Graph databases they try to fix well fix or improve a few things about above relational databases The one is the duplication versus join So if you if you try to represent a certain amount of data in In a relational database you have more or less two options You completely normalize the database which from the theoretical part of the site is very good because it's nice There's no duplication of data whatever but you come into that hell of joints because for every look up You have to join Several tables and if the tables getting long like I mean seven millions of entries then look ups can be Can slow down? Well, we have indices whatever technique in RDPs have evolved to go around this but it's still a problem and On the other hand to make look ups fast what you can do you can denormalize you copy all the data that you have It all in one database then it's fast and very easy and also programming wise nice But well you have to application of data, which is generally not very advisable Another thing is the rigidness of the database schemata as a schemata. It's very Well easy to set up a RDP But if you want to change something in the representation and anyone who has done this It's really a pain to convert to a new database scheme it takes energy server downtime whatever because you have to get everything out and This is in in graph database much easier and Well finally also for the for the look up. This is related to this index and also to this Publication of data and joints is like the locality of data. So when you have Well stuff that is related to his other is also easily look so it's easily to be looked up in the database I mean by easy point operation or something then things get very fast and So if you look up in the RDP if you think about this as a huge matrix you have some blobs of Data and the rest is more or less a sparse matrix and well That's not really the optimal representation a sparse matrix a huge matrix because it's not very fast so very simple example think about Your favorite social network system and who is friends with whom or who follows whom or whatever and then ask for Who follows who is friends with someone who is friends with this guy? This is very easy How very nasty with the RDP or if you if you don't make it very nice Then it's very bad because you have a double look up through all the members in the database which can be a lot and That's not optimal while with a graph database is just following edges very fastly are very fast And that's that's nice. So double join along Chris So graph databases the basic idea is that you represent graphs and have relations as first-class objects It's like well if you go to functional programming a function is first-class object if you go to graph database You have relations at fast-class object in the RDPs have are the relational in there But actually the relation is not the first-class object. It's just a table the structure It's not a first-class object that describes and this is different in a graph database So there are various after the business nowadays the most common is well you have a labeled property graph So you you can attach to each Note and each relation a set of key value pairs and types and so that is very common it is It's the usual crud system so you can create read update and delete methods very standard and transactionality So you have everything you would expect from a normal database if you're in this Okay, Neo4j. This is not the only one. It's the one I used for used for this project There are several others Databases right nothing I present here is really fixed to Neo4j So Neo4j was one of the first it has native graph storage. What does it mean? You can of course take So there are two parts to graph The one is the storage and the other is the computing engine Yeah, you cure the database and I mean the storage can still be a RDB But well it is not very efficient. So the best thing is to have a native Graph database where things where relations and nodes are saved and notion those nodes that are close to each other Directly related I immediately Referential so this is yeah, not native graph storage. There are a few others well in all kind of languages and Well, I made to see this index free adjacency. That means that if you're really Really in a direct relation and you immediately can pick up your neighbor. You don't have to search anywhere So that makes it very easy to pull up nodes and these questions like who is the friend of a friend of this guy That is just jumping over two relations, which is very fast How to cure it is language so there are many that it's currently a big activity to standardize on this graph career language One of these graph career languages, which I used here is Cypher which was developed together with Neo4j It is the basic constable It's just well you want to question ask your like SQL you want to ask the database for some data and that Well, the basic construct is something like this. You have a node and well two nodes in relation and Well, what do you see here? For example is a typical example We are but here note one and note two and are real are just variables They don't have any meaning. They just captured a specific note. There's nothing more about this You typically carry would be match node one relation nodes to return. Well, that's That's very bad. I'm sorry. I Have So match It's safe. Yes type one So this would be I give a and b here are the variables and then I say it should be of a certain type and This should be a relation here And then what what should be returned is like return a AP So what what comes out here are two nodes and they have a relation. That's that's all well easy to This is a typo So you can the basic query well one of the basic curious is matching searching for something you match You align your search somewhere and then well from there you define relations and return a certain set of nodes And then you get all this stuff back We will see this later in several other instances with the tabian packages So how do you create stuff? It's in the same way like for example create and then I see what the type of the note I can attach arbitrary tags. I just told you for example name. Hello. What hello or? Relation has no text so we can attach arbitrary text to these Notes and also to relations and that that creates that actually creates two type two nodes of type one and one relation We can select all Notes of a certain type or we can select all nodes that are in relation one in a certain relation So you have a very expressive language to search about you can pipe this into each other So like like iterated searches. It's a very powerful language. I Don't go into more details in this because well Well, that would feel a whole talk here So next a bit about Debian I mean everyone knows here because it's the Debian conference just to remind so you upload to Unstable then it goes to testing and then then to stable and then there are some other suits like experimental and whatever and All these packages all this information in these packages are recorded in in this udt. We will see later and packages have a lot of Well, we have source packages and binary packages. Everyone knows develop a upload source packages and well also binary packages and binary architectures other binary artists are built out to build us and Well, all is going to unstable. I think we know about this. This is from the handbook Just nice example this for people not know a bit about versions because this is actually quite Nasty if you come down to represent it. So we have different versions In seed testing stable unstable experimental security, whatever quite a lot of versions They are additional this intermediate that never made it into a release For example in a swim though the package I maintain there is the old old stable is there to 31 with the Debian extension And so on and there are many other in between versions The full version string looks a bit has an addition the epoch, which is now Please don't use it whatever, but there are enough packages to still have it and Upstream version and Debian release. We have a bit more complicated example here in musics tech Which actually doesn't exist anymore, which has an epoch Version of a date string when it was uploaded and the Debian release. So all this information is recorded somehow The components of a package. I'm interested now here. There are many more But these are the ones I'm interested in this this representing in a graph is the maintainer Well, who is responsible for this uploaders section and priority Versioning and dependency declarations. They are a lot more, but I will not discuss them Some caveats that you just realize only when you rewrite this this this package database in the graph database is There's well one source package can build many different binary Well, we all know this but the source package and the binary and package name can be very different And can also come from different binary package can come from different source packages in different versions Yeah, like what I do is to take life. We often incorporate other packages that we are packaged separately before and then you have some Some temporary package for the upgrade. So source package and binary package are quite Different beasts here in this sense. And if you want to represent this in a database, we have to Represent is face fully in some way. You're not just package, but well dependencies are also quite complicated. We need List is actually extending as far as I see always I mean for source packages for the build stuff for the binary package We have this and then we have various forms of dependencies it's like package just a normal package then with the version version packages and alternative packages and restricted to some architectures all these Can be also combined and mixed together in some cases which makes it quite complicated Okay, the UDD the ultimate database is a very nice database which collects well If you have seen seen Brazil, it's like the Ministry of Information there It's like everything from all sources is pulled into a huge boss quest database I mean it's packet source files bugs from deep and from Ubuntu or the upslotes the complete history It's impressive A popular contest the whole history linty and checks orphan package, whatever so it's quite impressive The database scheme if any one of you has checked it ever before it It is this well, okay. This is not very Let me zoom in Yeah Here the tables Can make it a bit bigger So here for a wanna build All the tables here and then a few a few connection between the between the tables, but generally it's yeah information about maintainers and Package names and dependencies is repeated all over many many times Who X for example Ubuntu package summary and Ubuntu package the same information included several times Ubuntu bugs Security issues Where's the normal package here public packages? So here is the typical binary package with version and Well, thousands of Stuff and well here Summary again, we see here for example maintain a name email is all of this is Disapplicated as I zoom out for just for the fun of it. I Think that's the complete You never want to read through all of it. So if you look at this, it's highly denormalized you as a I think every piece of data appears about like 50 times in different places in this database, which means also like I Don't know actually how this is managed, but updating one single field must be a real horrible thing It's a typical example grown over time I believe I mean I don't know but it looks like well we just pull in this and then we pull in it and then pull in mortise and just stack everything on top and Lots of application without connection. So there it's it was for me like, okay if this we want to be actually used and well well If you want to denormalize is you get into a huge hell of joints because you have to get all this information And so as it is now you have a huge duplication of data So both of this these options were not really like interesting for me So I thought that's that is an interesting example of what one can do with graph databases comes back Why do I do this? It's like for me for our company. We have for some clients We use are we planning to use graph databases? And so that was one so like finger training to see what is possible and how how you can what you can do with graph databases Yeah, I forgot. It's a pleasure for a square feet If you look at some of the examples that pulls out data from the databases very impressive Okay, now so as I said I was interested in seeing what can be done. Can we put this into a well graph database that tries to represent the actual instances of objects in the sense of the source package of maintainer Into into different entities of the graph and well build connections between them and I will go through so to say that generation of the database schema The graph database scheme that means which type of nodes and which type of relations where I've developed step by step So that one sees also how one develops a graph database and how it comes up with a graph database So that one can use it in in different occasions. So the first is well a source package builds a binary, right? I mean, yes, that's one of the most basic sink Already I see source and binary are quite different pieces because well We know they are have well the one binary can be built from different packages in different Levels of different versions. So this what we first do is Something like have an unversion source package that represents. So the name in the database So if you look up for example in packages tbn.org on the website, you can put in a name or a source package name and Well, this is the most general that catch all files and if you have also versioned Also versioned source packages and worse and binaries these are then they actually if you go on packages Then you click in db in unstable. You have the version 3.9 point 7 whatever So what you come up what I came up here. So this doesn't work, of course is SP as what it now introduced that types of nodes and relations So names for special notes while SP is a source package and VSP is a versioned source package and BP binary package and versioned binary package and then you have Relation between them somehow naturally a versioned source package is an instance of a source package Well, every source package. This is a general concept. We need this later for relations but Version is an instant somehow the versions 3.9 Dash one four of the source package and the same with the version binary packages an instance of a binary package then we have the connection a source package builds a binary package and We have next I introduced some next that we have so increasing relation where actually is a tree Would be better to represent in some nice way But at the moment I only pull in the information from from the released version is no intermediate So what you get then in this case if you if you throw in All the nodes here. It's like for example. I search to think we have on the right side. We have a Lua a sec. This is a This is a binary package Yeah, Lua sec. This is a binary package then on the right side the blue ones These are the version binary. So if we have four versions at that time when I created there were four versions of Lua sec in versioned binaries on the left side is the source part So you see here ready that the first two source source packages were built from a source I said this version source packages were built from as of these these two source version source backups Packages were built from a source package Lua sec Later on we incorporated it into take life base. I guess I cannot read is now take life base So then the source package name changed here version source package name But the binary package will build remain the same and these are just the build relations and here We have some next relations to to get the next relation So these are so the basic steps were source package worse and source packet binary package and worse and binary package Next we want to introduce is suit Like Stable all stable stable testing unstable what experimental it's very easy to suit contain a version binary package Yeah, and that is just well contains is not very surprisingly named after this Well, if you put this in then you get here some Some relations like sit bastard stretch Jesse and we see contained. Well all these versions here Um, I of course I don't show all the other relations because they are they are quite a lot Just if the few here we are using then for the next like maintain us what we do with maintainers. Well maintain us Well a note will be a maintainer. This will be we will see later like a name and an email address and Maintains a virtual source package or a virtual Binary package if we do this here. We add in the graph before for example here before Before the incorporation of this package into take life this package was maintained by the DB and science team and Later on it moved into tech life base which was maintained by the DB and take maintainer team Yeah, and so we well there's the maintenance for the version source and binary package for this for packages Well Up to now it was easy now comes the nasty part and that's dependencies dependencies are very complicated to represent because as you have seen before we have very complicated expressions in dependency The first is a normal dependency depends without the version. So I have to somehow point to a Package without the version and so this is something I can do I can depend on simply a binary package and There is also in this relation. I can we can put in a relation type Which means less less as a strictly less lesser equal Exactly greater equal or strictly greater. So the normal relation and the relation version so this information has to be recorded and here we use the The the the fact that we can add key value pairs to each node and relation here we added to a relation and so Unversioned dependencies are recorded with a relation type of known just because something has to be in there Yeah, if you do this here, we get a bit more complicated stuff. So here we have have Where for example down there we have build dependencies From the lure sec package onto all kind of packages and lure sec itself depends also on all kind of other packages So yeah, actually, I think I did I think I showed all of them, but that makes the graph already quite full There's a problem how to deal with alternative dependencies, of course I mean how do here we introduce a new type and your object and here you see also why it's so nice with the graph Database because you can introduce a new object a new type of nodes without disturbing all the other relations Nothing changes in the rest of the party introduce a new new type of node and new relations But you don't have to record anything in it in the in some tables anywhere You just add these relations and nodes So what I said, I just add a note I tell the dependencies that records the alternative dependencies as is and add something that is called is satisfied by an alternative Dependency can be satisfied by either this or the other package that depends on a real binary package That allows to represent if and it's it's necessary alternative dependencies are structurally something different than a normal dependency because it can be satisfied by several packages Okay, so what up of the summary summary of nodes and relations? so on the nodes we have Maintainer for the maintainer we record just the name and the email This is different to the to the udt or something in the udt. There is a complete The email is correct, but the names often changes So there are many different names for the same male email address in the in the udt So it seems that uploaders sometimes change the name or that the name of especially of groups Then for a binary package source package suit and alternative dependencies There are nothing more than just the name Well, we want to know how is this package called right and for versions binary package and versions We just add a name and the version well since this version we want to know For the relations and their attributes well for all these relations There are a lot of them. They have just two as explained before they have the relation type whether it's Unversioned or strictly less whatever this and the respective version and For build contains is instant of maintenance and next there are no attributes. It's just between the two entities This is the relation is already contains the following information a builds be tells you already everything you want to need Okay Summary of when I pulled this down the last time we have 28 suits, which is quite a lot This is not only stable unstable or the security or whatever. We have maintainers 3510 at this time. So unique Maintainer email addresses. We have alternative dependencies appear in about 8,000 9,000 cases source packages at that time were well 32 or something thousand well Binary package. I don't well anyway So you see quite a lot in total nodes in total of this graph and developed is about half a million and Relations well, I don't count it. There are too many in total. It's like 4.5 million relation Entries in the graph database and This is only the data the information. I've So extracted by now. This does not contain any information of bugs. The whole back database is something I want to do later Okay How to get this whole stuff from the UDD into Neo4j or into a graph database? Well, there is a public mirror for UDD. It's a post SQL post SQL server Everyone can access it the information on the on the website gives you the username and password I use the pearl script to access this is Completely standard of course I pulled once only the whole data in this e-file or whatever and then I worked on this My first try this is now for a bit people with with grafting My first try to generate a lot of cipher statements because that was quite easy and just feed the cipher statements into the graph database That was not really a good idea. I just say that I Think after a few hours I Stopped and I think we were like a one per mill through the data of lines The problem is with the cipher statement. You have to lock the whole database Make the transaction and then do it out. So that's completely impossible to actually carry out so there's a Neo4j import tool that where you create for each node and node type and each relation type a CSV file and then you Feed this in yes, ten seconds for this five million Data points I mentioned before so that was quite easy I recommend to not even think about using cipher for for any for any import of huge data So that's good How do I do this just for those the pearl program that pulls the data from the from the UDT is a pearl program That saves the whole stuff in CSV generates a huge hash reorders all the stuff deals with Inconsistencies in the whole database like different maintainer names with the same email Udap packages are not treated with all this kind of stuff then generates for each item a unique UI ID There's necessary for the CSV and for the linking and then generate the necessary CSV files It's a bit convoluted pearls group, but it's well not so bad some sample queries since we are running out of time Okay, this is a complex query. You have seen only a bit of of cipher But I want to see but it's actually readable. So what you say if you remember what BP and all this is well what we search for Find all packages in Jesse that built depends on some version of tech common Well, all at the end is s.name is Jesse So suit is when if you start on the right back right on the back here There's the suit Jesse and Jesse should contain a binary package, right? the binary package is captured in this VPP in this variable variable and Well, then we need a source package that builds this binary package and you see when you query this stuff you can have relations in all directions you want and Then and the source package build depends on the binary package and the binary package name is tech common What you get is what you see down there. So there's on the on the left is So here Here you have tech common and here you have Jesse and well, they are all the packages in between So what I'm using here is the neo4j browser. You can use Normal web browser also for this kind of stuff. So To give you an idea Okay, yeah, that Looks a bit like yeah, it's JavaScript doing all this kind of stuff after this you can reorder this and Then the location is fixed and after some time the wiggly wiggly wiggly stops Until you kick it again. So well if you click on one of these nodes You see below the name or the package and the version or the relation between this kind of stuff Is there a bill depends? There's for example a relation type and relation on the lower part here So all this information is Dali Re-recorded in the database Another question is like the number of packages in sit that bill depends on X whatever package and order by the number of depending packages. Well, what it's a bit more complicated query but actually not it's just suit is against sit and We want a dependent build dependency on something and then you see also that you can add something like Well, like in SQL some Aggregation function like counting on this kind of stuff or something and then return and order by so it's for if you know SQL it's it's besides this match statement. It's very similar. So what you get is step helper is not surprisingly the the biggest use used and DH Python is the second one, which was quite surprising I have I know I think that's that's the second Well, I have many more but I don't want to show some conclusions So finding a good representation in the graph is not easy, it's not straightforward actually it's Converting a traditional RDB and a system of RDB into a graph database is actually somehow a pain Because you have all this old material It's a very nice technique if you start from fresh if you want to represent some customer data Whatever then graph it that has some resemblance to graph then graph database is probably better way than RDB nowadays Well, don't use cypher for importing any reasonable amount of data Well, I said grown our old grown RDB is always a pet and it's also a bit Visualization is a bit of a problem It depends on which version of chrome and firefox and sometimes it's fast times. It's low I think there's it has to be with the blood moon or something Anyway visualization works they have the shipped out recently a new tool near 4G. It's called bloom It's only for cars clean customers. Unfortunately that should help in Visualization, but actually there are some other libraries based on just the interface is open the specification is open It's all going into the open graph Graph specification language, I don't know. What's the official name? So this will be all standardized is in the process of being standardized So there will be for sure better tools in the in the near future There are some things I want to do Allowing time the biggest one is the bug database that would probably make The database the graph quite huge, but it's on the head and quite nice because things like well in which package version a bug did appear and When it was fixed and all this information is quite easy to represent in this graph It would be also possible to get all the information of Intermediate uploads from the UDT there is a table somewhere hidden. This is is the the upload table So all this information is somehow they are but it needs somehow parsing The dependency management could be rewritten. I'm not sure if this is the optimal representation of it What would be nice to represent some then I mean if you have done all this but there is well It's not the one-man show. It's just represent some of the UDT dashboard or some of the services with in tbn by Interfacing to the graph database instead of this UDT just in the hope that it well Speeds makes code more interesting or more graph theoretic other things are interested more graph theoretic It's like cycle Dependency cycle connected components things that group how how stronger connections between certain packages So that's more on the graph theoretic side Okay, the sources and everything is on github. There are also some slides and thanks for mentioning I should do this. Yes, I will put also I put some reports on my blog on this I will link it here and put it also in in github or at least some part so a Markdown version of it into github that the blocks are readily available Okay, thanks for the attention and well if there are questions, I'm open to everyone Good coffee break