 Welcome back everybody for the next talk which is Jimmy Angalacos from the University of Edinburgh about bridging bringing sorry the semantic web closer to reality post-crest QL as RDF graph database. So I tried to break the words longest title record but I think there are people that beat me to it so I added an extra bit or how to export your data to someone who's expecting RDF so what is the semantic web and what is RDF how many of you inside here know what RDF is oh okay that's that's great because everyone at the graph dev room knew what RDF was so it's a way to overcome the limitations of HTML and make the web machine readable so the way to do it some people think is to add metadata to everything and one way to add metadata to everything is RDF which is a data model that lets you turn everything into a graph actually a multi-graph because multiple objects can have multiple things pointing to it to them and it's it's also strictly speaking in graph theory it's a labeled and directed graph which means that not only objects have properties but those properties usually have directions as well so it's person a authored book a this is a direction whereas you can't go back the other way using the property authored because the book obviously hasn't authored the person so instead of the entity attribute value model that we're used to we've given them fancy names and now they're called subject predicate object in RDF the subject is the thing that we're talking about the predicate is the property or action taken by the subject and the object is the recipient of that property or action so as these are three things we call them RDF triples and triples usually have URIs encoded in them that let you instantly access the referenced thing so in this silly example we have an example org person which is called Mark Twain who is who has the predicate example org relation author to link to an example org book in a more real-world example of the style that we are going to look at in a minute it's and a dean IC UK I work at a dean at the University of Edinburgh and we do lots of software development and bibliographic stuff mostly as well as some geospatial stuff but here we're just going to talk about text so we have an a dean IC UK item that has DC is double in core it's a way to encode metadata for things like records and books and journals and articles and it has a title and the title here we can see that it's a literal it doesn't have to be a URI you can encode a literal in RDF because how else can you give the full breadth of information that you need to share so what we have our namespaces that allow us to make this whole thing a bit shorter so we can bind namespaces for example DC elements which are all the things in Dublin core that can be encoded in RDF and we can bind them to the prefix DC so we can just say DC title this instead of writing the whole URL for everything now to store our triples we need triple stores and triple stores are what offer persistence to our graph so it's a way basically to put it on disk and index it and query it so for this purpose we went down route that not many people use we used RDF Lib which is a library in Python and we used an extension which is called RDF Lib SQL alchemy unfortunately there used to be an extension that connected RDF Lib directly to Postgres but it's now being abandoned and the author says please use SQL alchemy instead so it's pretty clear by now that we decided to use Postgres to store all these triples and how can you do that because Postgres is a relational database how can you turn it into a graph and we're going to let RDF Lib do this for us also excuse me RDF Lib supports querying of this database once you create it via the query language which is called Sparkle now Sparkle isn't very pretty to look at but once you get your head around the strange syntax which should appear strange to everyone in here who knows SQL once you get past that it's actually as functional as SQL now the bad thing is that there aren't many Sparkle databases that perform well but we're going to look into that in a minute so let's look at the original data that we had that we wanted to encode into triples so originally we had some JSON records stored as JSON B and Postgres and our main identifier for those records because we didn't have serial IDs or any other columns we just had one big column which was called data and this is pretty much what it contained our identifier is the DOI which is the digital object identifier and it's a pretty much standard way for academics to and other people to identify digital objects whether those objects are a file or an article or a book or a data set so it's it's basically a pointer that redirects you to that object so it can be used as a URL as we can see here but this URL only points at the resolver which is dxdoi.org so you just ask the resolver to give you a redirect to the actual object and we also see that we have other metadata like title and created that date index date and so on I haven't the space on screen to put on everything so let's start a little bit to look at the Python code that allows you to make a simple graph on disk inside Postgres so of course you have to import Psycho PG2 which is the let's say best Postgres connector right now unfortunately it's not pure Python so if you want a pure Python solution then you have to go with PG8000 or something like that and unfortunately those don't perform as well why would you like to do that because usually you don't have the Postgres libraries compiled for weird systems like embedded systems and ARM machines but anyway let's go on in our Wintel world and assume that everyone can use x86 libraries so we import a few things from RDF lib that let us create the graph in memory and we also import the extension which is called RDF SQL Alchemy which lets us transfer this thing to Postgres excuse me now we also have two namespace declarations we create two namespace objects and one of them is our own it's private I mean even if you try to access this URL you won't be able to get anything because it doesn't exist it's an arbitrary namespace that we use to describe the objects inside our database and prism is actually a fairly standard namespace that gives you some bibliographic types of metadata you can use to encode an RDF so we register the plugins we give it a database URI and we say that please use Postgres please use Psycho PG my user my password at localhost and a database that we created called RDF graph which we're going to put our data in we also have to give this graph a name so that we can find it in the database so this is the context that we're giving it and we're giving it in the adena namespace so we're going to be creating a graph called adena slash RDF graph so we tell the plugin to open the data store that we described using SQL Alchemy and also put everything under adena RDF graph which is our context and then we just do graph open create equals true and that creates the tables in the database unfortunately it doesn't until you run a commit or you add something to the database so at this stage it doesn't have the tables even though we've run the create true so we do a couple of binds to write two things in the database we bind the namespaces to the abbreviations adena and prism so we can easily use them later on and we start to create our triples so we make an empty list of triples and we start to populate it with things so we create an adena item which is just an object in the adena namespace and we give it a URL a URI which is item slash a number which we're going to use as our identifier for our own database and I'm just giving it one for now so we create a triple and we actually create this as a Python tuple so we give it an item that we have created we the predicate is the type so it has an RDF type of adena item we're actually telling the object what it is and we also give it some properties so we create another triple for the same item and we tell it that its DOI is this string here now you may wonder why this is not a URI and it's a literal and unfortunately the answer is that many valid DOIs result in invalid URLs that are resolved correctly so that's a problem your parser will reject this not not the specific one your parser will reject the DOI as an invalid URI but there is over will work for some strange reason so the safest way is just to encode it as a literal and here is what allows us to do this in a great scale because we have many records to talk about and this is the function called add n and add n goes through your your list of triples in a nice pythonic loop here and adds everything does a list operation on this list and adds everything to our database which is called gdb here why don't you add a triple at the time and you prefer to do it as a batch it's much faster to do it with batching than just adding a triple adding a triple and then our last step to export our data and give it to someone else in our DF format is just a dot serialize we do a serialize on the whole graph the whole graph is exported in the turtle format and we just give it to the person who's waiting for it here comes the big moment we've only created two triples how does this thing scale and what is big data big data is when you have lots of data I mean there's nothing to it it's just taking the same principles that you applied to your 10 or 10,000 records and make it scale to 10 billion records so how do you supersize this thing first of all you have to loop over the original database contents in our json-b table you have to create the triples in RDF lib in Python and then you need to find an efficient way to add them to the graph because this whole interface between RDF lib SQL alchemy and Postgres has some significant overhead and then you need to find a way to serialize the graph efficiently how do you create an RDF output do you create the whole thing in memory do you have three terabytes of memory I don't think so but why do big data without Java is a question that I've heard also why do you do graphs without Java because everything is so standard we can use gremlin and blueprints and tinker pop and Jenna all those things and the reason is that all those things are frameworks and they add significant processing overheads and these are things that you cannot do at scale on your own desktop hardware so why don't you use an existing triple store database was another question because I'm a postgres guy is one easy answer the and because we know and trust Postgres not to lose our data and not to crash because it's been working reliably for 20 years or so and I also heard a question that why didn't you use so-and-so database because our database is the only graph DB that does so-and-so but those people don't tell you that they run this on very very expensive servers with very expensive SSDs and very expensive sticks of RAM so this argument is invalid we can just run it in a few lines of code this whole project was less than maybe 200 lines of code 200 lines of code is looping through a graph in Java without doing any of the other things that we're going to do and we also have to take in mind that this is an unoptimized method I just used whatever components I could find and had to hack a few libraries to get them to work reliably with great data set but there is potential for improvement this is all open source most people will try to push you towards an closed-source database if you want to do big data and analytics and graph operations and so on so let's look at a little bit of familiar code for everyone who for anyone who's used Postgres under Python and we'll see that we open a connection and we create a cursor and then we create a sequence that we're going to use to name our objects we need some integers for our object identifiers in the adena namespace so as we're reading the database we're creating objects in the graph database and in order to read efficiently the database we have to use a named cursor in Postgres because if you don't use a server-side cursor in this code what's Python is going to try to do well actually Psycho PG is load the whole database in RAM and that is something that you do not want so we create a cursor on the server side we give it some large batch sizes to loop through so we give it an iteration size of 50,000 JSON records and an array size that we're going to use to fetch the data inside Python of 10,000 and then we execute just a select a single column from the table that we have and the the loop is very simple I haven't included the batching code which is trivial but all you do is loop every 10,000 records and for each record you find the attributes that you were interested in in the JSON data and if they exist then you create a new adena item and similarly you create the triples that refer to this object as we saw in two slides before so what is our first challenge because this seems very simple up to now the first challenge is RDF lib SQL Alchemy itself I'm I don't want to slag it off but it wasn't written with big data in mind it's just a few lines of code that let you connect RDF lib with an on disk data store so the version that I started with was auto committing and that's quite bad because it was trying to insert a statement when you added a triple it issued an insert into and for the next triple it issued another insert and in between it committed as well but that has been fixed fortunately now it doesn't auto commit anymore but it still tries to for 10,000 records do 10,000 inserts so I changed the code a little bit and now it does this thing which we all know and love from Postgres insert into table values so and so and so it also creates lots of indexes and those should be dropped if you want your inserts to be optimal because you're going to be inserting sorry I haven't talked about the data size the DOIs that we're talking about are all the DOIs that have ever been issued so we're talking about a hundred million DOIs and the metadata that come with each DOI so the final database size was around 3.5 billion triples so those indexes must be dropped if you want to insert efficiently. Our second challenge is if this thing stops we have to restart our select and we have to restart inserting so we have to get rid of our graph database because we don't know if we're going to insert the same things twice so in order to deduplicate and be able to continue we just use an offset from our select from our original select we do a select data from my table offset the number of records that we know we've processed and we also do some deduplication in order to not insert the same object twice so if we have a publisher which is University of Edinburgh then we want all records that were published by University of Edinburgh to point to that specific object we want to point it to the edna record for University of Edinburgh so one way to do it is you cache it in RAM it's just text so it doesn't take up lots of RAM so the way to do it is you write a sparkle query which you prepare for execution and all the sparkle query does it finds organizations in the fourth namespace for his friend of a friend it's a namespace that lets you identify persons and organizations and online accounts of these persons and so on and you just extract the label so the title of each organization and you put it all in a nice python dictionary that you can refer to so you know what your serial what your object identifier is that you've already inserted in the database so before you insert before you resume inserting you select all the things that you've inserted in the database that you know are going to be replicated such as persons and organizations publishers and you put them in a cache that you can go through so if you find a key in your JSON data which is publisher you just say publisher found you go through the cache and you see if you have encountered this publisher before if you have then you use that data identifier for the same publisher if you don't you create a new record and insert this publisher so what is the third challenge the third challenge again has to do with rdf lib SQL alchemy which tries to select the whole graph and memory when you're trying to do this so you hack the code a little bit and you find some SQL alchemy code which specifies execution options so for this query you just run the option stream results equals true and all this does is creates a server-side cursor in postgres and streams their results to python instead of trying to load them all inside pythons memory and we also added a batching loop which every thousand records this is totally arbitrary thousand worked best for my desktop machine it yields and so it continues the loop also rdf lib SQL alchemy has the tendency to cache all the literals that is found up to that moment in RAM and that code was deleted because it's totally inefficient when you're trying to do massive batch operations like this so the next step once you've inserted everything inside your database and you have a fully fledged graph database that lives inside of postgres is you want to serialize it and serializing is a CPU intensive operation because rdf lib tries to transform one type of data into another so we did it using multiprocessing which is a python module that lets you fork additional processes so we create a few processes that are the workers we have a writer process that takes ownership of standard output so that we can write directly to standard output and pipe it into other things and we create queues that we send batches of data to so we can process them in small forked processes that don't have a large memory footprint so we send a hundred records at a time to the worker process it finishes then dies because python is not good at freeing up your memory so we actually kill the process we spawn a new one and generally it's much better to deal with the context switching of all this thing than having python processes balloon and never give their memory back so all you do is you run the python script that serializes the data and you pass it through split you give it some numeric identifiers for the output files and you split the files every four gigabytes just so you don't have any huge files to load later on and those rdf files will be in n-triples format because n-triples is better than turtle which we used in the initial example because turtle actually tries to condense all your data for a specific object inside of that underneath that object and it also tries to sort everything so there is no efficient way to serialize and turtle because the whole database has to fit in memory to do that so we just output in the simplest possible format which is n-triples and we also compress it on the fly this caused an unexpected problem that will go into but it's generally worked fine I mean it seemed to utilize most of our CPUs with a little overhead and this is just a desktop machine it had 16 gigs of RAM and the memory footprint was 400 megabytes so you can do it on pretty much anything so here is our fifth challenge is the serialization we were creating n-triples at a rate with all our four CPUs that could not be written to disk fast enough so it was in essence an undetected memory leak the process kept getting bigger and bigger but it was actually not fast enough the disk was not fast enough to absorb all those triples so we fixed that by using gzip which is much lighter than bzip2 and we also I mean CPU wise and we also changed the code to empty those queues of batches and wait for the standard output to flush before we continue with the next batch so that slowed things down about 5% but when you're opening that's not that much data it doesn't matter I mean 5% is acceptable so this is what the Postgres tables look like when you've filled the database so it SQL alchemy creates those ugly names and puts all of your records inside tables so literal statements go into this table and you arise go into this table namespace binds go into this table so you can imagine that we have tables that have billions of records in them and postgres handled them like a champ I mean no problems at all the only problem is that the index is created for those things were b3 indexes which were very large so let's look at one record now and a record has our table ID it has a subject which is our edina object a predicate that says this this object was associated with this agent which is a person so this is all it needs you don't need complicated objects in Java you don't need a messy serializations from and to binary we're just using text here more problems please make sure you don't enter literals in a field that is marked as URI because Python will complain when you try to serialize the URI that is not a URI you're gonna have problems fortunately our data was relatively well behaved so we only had about 88 records out of a hundred million that had invalid DOIs unfortunately you can't convince those people to change their resolvers because they do tend to resolve those invalid URIs so some of those things failed when you try to URL encode them so you're always gonna have a few bad apples also we mentioned the indexing issue your indexes for data set of this size are gonna be huge so we need to find a way to change this RDF lib SQL alchemy code to better utilize full-text search related indexes like gin because B3 doesn't really scale for this sort of thing when you're trying to search for particular strings or expressions also one other problem we faced was that we had some content fields which contained base 64 encoded bytes and some of them were bigger than 10 megabytes RDF lib does not handle this well it tries to convert everything into text and copies it multiple times so you end up with a memory footprint for a 10 megabyte file you end up with a memory footprint of sorry a 10 megabyte record your memory footprint goes to maybe two or three gigabytes so it doesn't really scale please make sure you don't have unnecessarily long records or if you do link them from outside the graph database you also need to drop the indexes before you create which makes things more complicated if you have to restart because in order to select all your publishers and persons you have to have an index because your operation will never finish if you don't so you need to recreate the indexes and spend a couple of hours waiting so PG dump is your friend when all the indexes have been created just do a PG dump which gives you all the drop and create statements so you can do it at will and please make sure you have enough maintenance work memory because if you try to create a hundred gigabytes index with 64 megabytes of maintenance work memory that will not go well so this is what PG dump gives you and it's just a huge number of indexes and primary keys that are created so you don't need all of them but find out which ones you need in order to do your selects if you want to restart thank you very much our adena developer blog is labs adena seuk where we have started putting up interesting snippets of code and things we're working on and the hack will go up on github on rdf lib SQL alchemy unfortunately that's a commit I haven't pushed to github yet but I'll do it very soon so some people have also asked me for this huge data set to play with I'll check out what we can share from the University and I'll be sure to share what I can with you so thank you very much thanks Jimmy and we have time for questions so please wait for the microphone yeah so what kind of schema do you use in Postgres to store these stryples underneath all these layers for rdf this thing here I didn't choose it it's what rdf lib SQL alchemy tries to create so all it does is it creates very simple tables with subject predicate object and so on and fills those tables with all our triples so it's only five tables and two of them are the really large ones one that contains the URIs and one that contains the strings okay so just the triples themselves basically directly and does Spark QL I I don't really know it very well but I think it would just support recursive queries or what would in SQL be recursive queries does does it and if so is such a query map to the database or does the software need to load everything in memory to try to do very advanced sparkle queries but I'm sure that rdf lib supports the full sparkle query set so these tend to work as fast as your SQL query would work inside Postgres so we can use this that's another good example is we can use it to serve an API directly from our Postgres that serves sparkle queries with the performance that we expect from Postgres and not some Java thing which will have a huge memory footprint and need tens or hundreds of servers to reply do you know if sparkle is as expressive as SQL or more expressive it doesn't have all the features that we're used to in advanced SQL but you can do many things and you can also do multi-dimensional queries which you know but I meant the other way around can any sparkle query be translated to an SQL query no not directly it maps it maps the things that it needs to select from our relational database into relational queries and that's done by RDF lib SQL alchemy so the better we hack SQL rdf lib SQL alchemy's code the better the queries will be okay thank you are there more questions thank you right thank you for the talk I'm still I'm still I'm very interesting I'm still just like I'm looking to get that slide I'm trying to work out the relationship between the graph database and the traditional relational one I see there's a subject predicate object then I see three other fields I can I can you know there's an ID that's fine then you got a context and a term so the context tells us that this data belongs to our named graph which is called RDF graph and term combo is some term combination thing that I haven't used and don't know how to use and it's it always it's always zero in our data for some reason that seems like a lot of wasted space on on a very large data set yes the other question I had is I thought you mentioned this earlier and it's touched on the last question too is you know that you're going to have these these you can have your sparkle queries and they're going to get mapped to you know database operations which is a Postgres DPA you would understand you know the indexing for that how much time do you spend thinking about where you need to put your Postgres indexes and you basically have all those indexes to consider so you don't need all of them you just the best way would be to run your query and just analyze it and see what tables and what columns it's hitting right I guess of all the indexes come pre-built yes you're not thinking about the extra ones you can of course you can but it all depends on what your what RDF Libesco alchemy is trying to select from your database so on it all depends on the generated query and so you have to enable logging to find that statement and analyze it to create your indexes okay but you haven't you haven't been adding extra indexes so far I didn't really have a use for indexes in this project until I mean it will need indexes if it's going to be connected to an API which will serve things directly from the database that the but the only use case I had for the indexes up to now was to restart the insert operation any other questions if not then let's thank Jim again thank you the next talk will be about the evolution of fault tolerance in postgres at three o'clock so they have a 20 minute break