 I'm Daniel from the two-star platform in Europe and I'm sitting there and I also work on other projects like Detective, which we want to be an open source solution to make European public tendering data or public procurement data I want to do a couple of things in this talk. First I want to describe why public procurement data is interesting, why we should take a look at it and I want to discuss some problems of how this data in EU context is currently accessible And then I want to show you our project of alleviating some of these problems with Detective And then I want to show you how you can actually contribute to the project with your company Still very much in the early stages, just getting going We love the opportunity to show this now so we can actually contribute even in the earlier phase of the project So what's TET? TET is in the name And what's TET? So TET stands for Tenders European Daily And it's basically a data set that's published by the EU Publications Office And they've published this data for a long time, they've been publishing this for a long time since 2015 actually They've been providing this free on the internet And it's data about basically who buys what from whom Which public institutions do you buy what for what price from which organization So it's really data about the relationship between business and government And if so for example your local school or some ministry in your country in the EU Wants to buy something that's of a certain threshold they're defined in EU legislation You can look them up in the link here, I will upload the slides upwards It needs to go into TET and it will be in this data set And there's at least 670 billion per year in value that's kind of encapsulated in this data And there's more than 700,000 notices that they publish each year That describe this entire process of property procurement Very great, that's something you want to join So you could think, well great, publish it, so what's the problem with that I mean the way this data is made accessible is via this UI One funny thing is, this button for this is the drone I still haven't found out what that does, like what that changes Maybe somebody from the EU can illuminate But basically you have to really know what you're searching for in the first place In order to be able to use this kind of interface And there's also a lot of other problems with access in this data For example you can't really search by organization Which would be interesting, I mean it's about the relationship between government and business In all of the money terms So why is there no option to search for organizations that I'm interested in I can only really do a full text search over these huge XML files Which are really complex, and I can do some other stuff But there's no type of tolerance for example None of the really nice search features that we can use to And most importantly there's no ability at all to really visualize the results that I get Like if I type something in here, in a search mass I get back a list of HTML, basically just an HTML list of notices And I need to understand what's the notice And what are the different types of notices that I'm interested in So it's really hard, so it makes this a tense pride Because accessibility is really bad with this data So why is detective needed? In the past there have been a number of attempts to look at this data And transform it into a more manageable, more readily analyzable format And we weren't really able to identify a single, freely available solution That was published under a free software license That allows you to explore this data Even if you don't have domain expertise, more data science And you kind of need both now to be able to make some sense of this data And we thought this would be interesting So why isn't this more readily available? So we applied to last year's EU data problem With this idea basically to make this data more accessible And this is what we told them So we have any type of, let's say we have a public servant That wants to find out who buys what from, like, within their state Who buys from Microsoft, Deutsche And how much they spend on free software from this company And yes, maybe make the case of how much they can save That they use free software instead Or let's say you're a journalist Who wants to investigate recent purchases made by Microsoft Or authority You could do that now with a patent to face But it would be very, very difficult And you'd have to jump a lot of hurdles to get there So we want to take it to be an application that you use Which lowers the barrier of entry to analyze So we thought let's present the publications of this concept With free software And keeping it very simple So we built something roughly with this architecture So you have this XML file And this was very quickly built just for this data problem So I'll go through it quickly So we had this XML file I transformed it to JSON for whatever reason Which was a very bad idea And then I parsed it in Python And put it in some ad hoc schema and Postgres And then I used the Neo4j ETL tool To put it to Neo4j database The data I was interested in This relational data between And it shows the relationship between business and government And then I used Neo-dash to visualize that And that actually already gave people a few days Some chance to see what might be possible If you open up this data So I'll show you the little demo Of how that looked So basically this is just an overview I parsed data for roughly three years Or two and a half years This shows you the activity for a country There's just some general overviews Like roughly a million tenders And then it's not optimized yet You get basically, you search for Microsoft Germany And then you have this graph You have a geographical distribution Of commercial activity that's related to Microsoft And you get this nice graph of relationships Between Microsoft Germany here in the center As an entity And then the yellow or red ones are tenders So here they sold something To some institution of German government In this case here Mostly because Microsoft Germany Mostly sells to German government And the red ones are tenders Above one million euro And that gave you a very quick way to Like very quick overview of the commercial activity And the relationship between government entities And business entities I do the same with You get more information here You can actually go to the TED website To see the notice That was, that analyzed this Yep I'm sorry for the short question You search now for Microsoft Usually they work with like these server providers That can help to To interact with the relationship Is there something Within there Yeah Can we go back to this? The challenges that we face That you can all help But that's most certainly one So here I do the same with the Polish Order authority Here it's more like Who does an entity buys from Over the past two and a half years You can see what kind of Fence and weapon and communication Stuff they bought I mean I'll have to get through this Because Yeah this is actually another problem That I'll talk about towards the end of the talk It's De-duplication So in TED data As it's published in these external files There's no de-duplication Of entities at all So you can have Microsoft Deutschland DMPH Microsoft Deutschland Whatever that is And Like you can see here Microsoft Ireland Like there's all these different So I did some very naïve de-duplication attempts I also put the data in a new project graph But there's much more to be done On that front And it's a very interesting problem I think Also because You need to think about it From a policy side to this as well Like is Microsoft Deutschland A different entity from Microsoft Ireland And if yes What does that mean for my data analysis Should I analyze them together Because they're really operating as one entity Interesting questions connected to this That are not only technical So let's go back to my Here So yeah So that was obviously Limited in scope Because it was really ad hoc It was quickly made And there were lots of problems With how I parked this data For de-duplication So now we're at the stage where There's actually a lot of interest in this In the FSP doing this I heard from a lot of people that They would be interested in In analyzing this Data and being able to Explore this data So what's next And what's already implemented So there's the open contracting Data standard Which is something that Actually came after TET Was first published In 2015 I think the OCDS started being Developed around 1829 Something like that And if you now build Any kind of public procurement platform You use this data standard Because it's just a very nice way People have put a lot of thought into How can we display This entire process of public procurement How can we put this neatly Into a data structure And so now we're building This data structure At its core And the first task will be To parse this TET X and L Jungle into this Nicely Specified OCDS So I built a relational database That roughly captures OCDS You see a lot of JSONB Because some things Were many to many or many to one But JSONB for now Makes it much, much Easier, otherwise this table Would not have been presentable And now This is the graph system After all, the next question Because I think analyzing this data Analyzing public procurement data Analyzing these relationships between Public business and government Really lends itself to Being catch data In a graph database So This is really the core of OCDS That's interesting That would be interesting to model In a graph database like Neo4j We have this tender Like a tender is basically A company says Like a public entity Says we want to buy X or Y amount And then an organization Another organization Can apply for that They say We can Furnish this tender We apply for this tender And That's interesting data Who applies for which tender And then there's awards That's basically who gets the contract After all And so that would be A very simple place to start With a graph database Have all the data Going back from OCDS And then take this Subset of what's really Central and put it in a graph database And really start exploring this Visually That's what we Want to do Part of it is already done So I'm currently working We are currently working on Yeah, parsing this data This XML We use LXML library for that Which is really nice And I punch this into A relational database And I specify the OCDS Data schema With SPL model Which is really cool The library basically gives you Identic models and SPL open Models in one entity It's really cool It's really nice to work with And then I want to create To be able to input that data In your project I'm scaffolding around that And then also build Some UI Which we are currently researching Which framework to use And I'm also here to find out Which one would be the coolest one So I'll stay here Because I think there will be some problems In your project Yeah, there's also Reactforcegraph And really the nice UI Of analyzing public procurement data And yet I had that back and backed by these two Like the relational database And the Neo4j database That you choose depending on the query Which data sources you You actually use I'll go through the rest But this is if you want to get on boarders Bring documentation To run around the edges And I'll do my best in the next Days and weeks to Really make the project approachable To the developers To find this interesting And want to work with us So Some key characteristics that we want to Really Put a focus on with detective Is that it's Must be free software And it's views compliant That means that every file Is copyrighted So that you can really Easily use And we want to make it for the people So a lot of my work In the next weeks will also Include Speaking to people who analyze Public procurement data And ask them what kind of queries They would Like to ask because that's really important For the design of the system that you use Ask people that How could this be helpful We have done some of that But we will do way more of that Especially now Because we start building the UI And we want it to be So everything that Detective uses, every data that it uses Will be also published under RBCC5 4.0 license And there will be open API Interface Which is available Some limits But we will think about that When the problem arrives And also we fundamentally believe that Link data is more interesting Because once you have this data In the OCS format You could start linking it With other data sources If you have a graph database You can start linking it with Other data sources Things that come to mind That you have an organization With data that seems to be a public database Of corporate entities Open sanctions would then allow you to Flag people or companies Or entities that are some sanctuaries And stuff like the offshore Leaks database would allow you to Highlight things to offshore companies That's of interest for your Analysis So this would be a future Possibility that I'm really excited about But the first step is obviously to Put this into a nice format And then think about extending Some of the challenges Is redeeming this step data Because Some of it's quite old Like if you look at data that was published in 2015 And it's just There's a lot of typos there And it's just these huge XML files That didn't currently do much validation On the forms that were used to When they input this data So it's in Some places very messy And also the S helps a lot Actually with this part of the session Because it's very well-defined standards And there's people like a mapping concept Of what the S has done Some people have published So it's pretty cool And then the next big problem That we would be helped with Is de-duplication Of problem And they input very cool If any of you have Good ideas on that as they contribute Because I think that's really Central to Take it being helpful So how can you get involved All the code is on our Git instance At the moment you can only really Contribute the FPR issues If you make an account And it gets free It's just a couple of clicks And that's for now If nobody Manages that Then we'll think about mirroring On GitHub But let's try this first Maybe there's a federation coming For the Git for just Not there yet as I understand There's also websites With documentation And then you can also write an email To just reach always the maintainers Yeah, I'm looking forward to your question Thank you very much Regarding funding Did you try to contact The official European So that you can have funding For this ride And so that it becomes So I know that So the question was Whether we ask The applications office for funding For this Not specifically yet I know that they are working themselves On a huge reform Of the entire ecosystem So they do this, what they call E-forms now Which is supposed to substitute What used to be TED But E-forms still isn't those years There's discussions around that But I don't fully understand all the time And they're also rebuilding The tech website We shouldn't get the compact for them I have the compacts And we have the data We have some content there We should make use of it But I was really that In the past couple of weeks But this would certainly be very helpful And this will happen And we already got some funding Because we want this The data that's currently produced For our publishers Is it still some Ted or is it also called OCDS It will be all in OCDS format Honestly, I don't think Anything else makes sense Hope data that Will republish There's some place like OpenTenner Which was a Funded project Which also does this Republishing But it's not consistent In how it's Regularly it's Database Very active I've got a question When you look at the status and companies involved Are you also able to extract What the action trend is about So is there an underlying structure This is about classroom Furniture And this is about military So that you can coordinate By item or by Contract product Yes Should I repeat the question The question was whether there's also data about What is being procured And details about what's being procured By a public institution And the answer is yes There's usually a title that's Very descriptive And a description Sometimes in English And then there's CPV codes Common procurement vocabulary That's specified What kind of category But Some stuff is excluded By this legislation Military requirements Sorry, I can't talk about open procurement In the good context yet Because there's still lots of Sensitive data That's not being included in that Do you plan to host it At that time? Yes Absolutely At the moment The API is down because I Refactored so many things But it will be It will be host again Of course it will be publicly available But if everything crashes There's so much interest in it We'll think about it somehow But there's a system coming in Exactly, yes So we'll see There's really that much interest Why not be in the biggest challenge You find So what was the biggest challenge in Cleaning the data So I would say One is just finding If there is an English translation Available, finding that Because we really lay out Lay out Where's If a translation exists Where is it What does it apply To One was Yeah Languages that I didn't know the alphabet of The heart parts Yeah, just general back company names That they didn't have for a long time Having any validation On what you could put in there Which makes it really hard It would have been very easy to implement upstream And now it's because of the downstream