 Hello, everyone, and welcome to the session. Glad that you came. This is my first talk, so please be patient. Sometimes I may appear a little bit nervous. That's because I am. So let's enjoy it. Hopefully it's going to be both interesting and educational for you. If you wish to open the site that's up there, the link to the site is at the schedule. I linked it into the schedule details, so you can open it and follow the presentation if you wish so. Right now it's going to be just slides, so it's not necessary, but if time permits, we're going to get into some R notebooks. And it might be beneficial for you to see the code because I'm not going to stop there for too long. Just some details and biography about me. I'm Mark. Hello. I'm associate software engineer at Red Hat AI Center of Excellence. My colleague, Marcel, has the presentation before me. As far as my skills and languages are concerned, don't worry about that. The time is very precious. And let's get straight to the outline of this talk. So there will be two presentations, Open Data and Burn-O Data. The Open Data one, I'm going to talk about Open Data in general. And then I'm going to fit it into the data that is opened and published by Burn-O Portal. And then I have prepared some data sets. There are three data sets or three main data sets and some merged into this. Burn-O Municipality Units, where we're going to explore some basic Burn-O population data, traffic accidents, and Burn-O crime. I like this presentation to be somewhat educational for aspiring data scientists, let's say. So if you want to play with Open Data, you have the basic idea how to do that and what it looks like to be working with the data and also to the public. So you can get some interesting information, perhaps, and hopefully interesting information about Burn-O. So let's start. Presentations are down there. I have the one open here. Let's start with Open Data. I've always, I don't say hated, but not very much liked the introductory question that the speaker asks. But it somehow feels very natural to ask a question here. So how many of you are pretty sure that know what Open Data is? OK, that's pretty good. So that's like 10% of people here. The rest probably categorically refuses to raise their hands or thinks, what a stupid question is this. All right, so let's go to the definition itself. It's actually defined somewhat, and it should be information published such that it allows for remote access in some machine readable format. And it's published on the open license. And it also should be registered in an Open Data catalog. All right, that's the definition. But that's not all there is to it. Here are some specific requirements that should be met in order for the data to be open. And we really, really want these requirements to be met, as we'll see in the rest of this talk. We surely want to data, we surely want the data be available in some machine readable format. We want it to be accessible, not necessarily registered in National Open Data catalog, but that's a good thing because there is time that you want to link the data together and find the data according to these links. We want it to be created and maintained. You know, there is a problem, and there will be problems. You want to contact the maintainer and ask the questions and we'll document it. Now believe me, when we get into the Bernaud data portal, those requirements are not always met. And I like to say a few words of criticism afterwards. We'll get to this. So the main requirements, availability, reusability, universal participation, and interoperability. Just remember these. Now, what's not open data? I want to emphasize data in PDF format because it's so common, right? People still publish data. Even Bernaud portal publishes data in PDF format and claims it to be open. How do you process it? It's very difficult. Also, data in XLS prints format is format. Very common, claimed to be open data, and it's not. Also web services, there is a time that the publisher just gives you a link to an API and do whatever you do with the data. Sometimes beneficial, but for most purposes, this is not an ideal thing. You want to have some schema, and that schema might not be part of the API. So there's something which is called five stars of open data. It was defined by Tim Barns-Lee. And you can read more on that link there. But basically what it is, is that the first star, we're going to get into more details in a bit, but the first star is that the data is published under open license. That's the first star. The second star is the data is published under open license and machine-readable format. The third star, machine-readable open format. That's not always the same thing, right? And the fourth and fifth stars, these are just some very good things to have linked data and some referenced data formats. Now, good. OK. So here we go. These are some details about it, and note that we can consider data to be open for us as developers since the third star of the data. Usually, this means data published in CSV format, for example, and MIT license, let's say. OK. So what are the costs and benefits of having these kind of data? So when we have one star data, me as a consumer, I can look at it, I can print it, I can change the data as I wish because there is open license, right? And I can share that data with anyone. As a publisher, it's simple, right? Just dump it in there, never mind, never care. I don't have to maintain it, I don't do anything with it. It's pretty simple, that's what Burnout Portal does. Then there is, sometimes, sorry, sometimes if there's someone from Burnout Portal, I hope they are not waiting outside to punch me or something, but sometimes it's better. OK, so as far as two stars are concerned, we can directly process them because they are in machine-readable format. And we can export it into another format. Usually, this is not in an open format, that could be something like Outskis, for example, for tier-spatial data. This is very common. And as a publisher, it's still quite simple to publish. As far as three stars are concerned, I can manipulate it in any way I like. And as a publisher, there are some consequences to this. I might need converters because I use proprietary format and I need to convert it into the open machine-readable format. Now, four stars are a little bit more interesting. Of course, we can do everything we could at the third stars. But on top of it, we can link it from another place. We can bookmark it. We can reuse parts of the data because the data are linked. And there are some URIs defined. And it's pretty cool, but it's not always necessary for common use cases. Now, there is a consequence for a customer, sorry, consumer, when it comes to this. We need to understand the structure of the RDF graph of the data. Which is not always an easy thing. And as a publisher, we need to do some additional stuff, like invest some slicing and dicing. You need to assign the URIs to the data items and fix existing patterns if you had a wrong world assignment. And the five stars, this is very precious to have. I've never seen it yet. You can do all sorts of stuff. If you want to dive deeper into this, just knock yourself out. But the most important, you can present arbitrary links here. Also, words of caution. You need to deal with those 404-like errors because those links are very often broken. It's hard to maintain for the publisher. You need to invest resources to maintain those links to other data sets, to other web services. And you need to repair them, et cetera, et cetera. This comes twice. And what's the meaning of it? Why do we want to have open data anyway? Well, transparency. This might be open government, for example. We want to make decisions that are transparent. And we want to provide underlying data why we made those decisions. And we want the public to trust us, right? The infrastructure, it's important to remember that data might not be just a data set that comes from a service or something or monitoring. It might be IoT devices. And there's plenty of them out there. And we might use this to build the whole infrastructure. Normalization, right? Normalization of the process, of the lifecycle. It might come very naturally that we get some data. We put it into some standards. We fulfill, let's say, three stars of the open daytime. We publish it, and we give it for feedback, let's say. And we can normalize this process. That's good. We can compare the data with reality. We can compare it to other companies. Also, there's the economical aspect. Well, it's very convenient to provide the data in that format because this format is usually targeted to developers, usually. Or not very unaware people or untouched people by data processing. And if you target those people, I mean, come in public, you might need to invest in creation of PDF files in design of the data so that the data was picturesque, likeable. So it's basically very economical to provide data in this part. And you'll be popular among developers because we all like open data. And many, many more. For example, it's beneficial for education. For me, I like open data solely for this part. I like going into the Bruno portal. I like taking a new data set and play with that and check what's new at Bruno, what's changed, what's worse, what's better, et cetera, et cetera. You know, do it, too. All right, so my call to arms for this presentation be a curious consumer of the data and be a smart architect, responsible and unaware publisher, forthcoming maintainer. And I hope that you don't think anymore that open data is just a buzzword that you don't really understand. I hope that you don't think anymore that open data is just any data published under open license. You don't like PDFs anymore. And I hope that you don't name your columns with the one sentence. And this is something that I'm going to explain in the next presentation. I like to take time for FAQ. Just one, two questions if you want to so that you don't forget what you wanted to ask. So please do so if you want. Cool, all right. So let's go to the next presentation. Now, Bruno data. This is a hashtag Bruno2050. This is something pretty recent, actually. It hasn't been very long that Bruno started this data at Bruno portal, and it opened the data. Now, it is an open platform for sharing data. And it's designed to be used by public, including citizens, entrepreneurs, students, et cetera, et cetera. And if you open this portal for the first time, you'll see that very clearly, because it's not just a catalog to search your data. It's actually the whole dashboard of applications built upon that data that you can visually see. Pretty nice, actually. Now, why? Data is our new urban wealth. And we need to use it to full capacity. I like this sentence very much. Data will, for certain, be the driving force, at least for this century. And I think most of the companies are very aware of this. And it's high time that cities start to be aware of this as well, as well as the citizens. Now, where the data comes from, mostly collected by the city itself, either from the smart city project, for example, or by the companies and other providers. Now, what we have here, there's plenty of data. Quite a lot, actually. Not enough, though. So there's economy, health, and environment data, transport, people and housing, education, infrastructure, safety, city. And you can find not only data sets there, but also some useful apps and articles. Where's the pride? Sorry. Sorry. Where's the pride? It's a step in the right direction, OK? I think we can all agree to this. The dashboards are pretty cool. This is something you'd see when you open the Bernoulli data portal. You'll see plenty of those dashboards. When you open them, yeah, I'm going to say it right now, when you open them, they are mostly built in Arskis app, or I think it's called something like this. This is proprietary software that is not accessible. I think it's not even accessible for students, for free, which is very bad. Which makes it very hard to reproduce. And in a bit, I'm going to get into words of criticism, where I'm going to say all of that. Responsible and helpful administrators, that's right. At least from my perspective, what I've encountered when I was working with that data was I had three problems on the road. And yeah, in one case, it took the guy the whole week to respond. But he did, and that's a good thing. It's not common, actually. And he did respond and helped me out with a problem I had. By the way, it was a problem that shouldn't be there. But anyway. And now the words of criticism. I've always been better at these, like criticizing things, that's my thing. So there's a few more slides here. The five-star standards are not always met. And I don't want Bernoulli.2 to meet four stars, five stars. But those three stars, this is like pretty basic. And even this is a problem sometimes. Data distribution could be better, for example this. So most of the data in Bernoulli portal is in check. And it's going to be there as well as a slide. So unfortunately, I don't know how many English-speaking people or non-check or Slovak-speaking people are here. But I see me a lot. So I'm going to have to translate a bit. So this is, for example, the data set that I have for the criminality. And the first four files there are actually columns. And the data is separated such that there is just a column in the schema that is very relational database. And I don't like these kind of things when it comes to publishing a simple data set. It could be just one CSV, or in this case, XLSX file. But no, they have to split it into five data sets so that we have to invest our time to search the Google how to merge data sets. So next thing, the column names. This is very tricky for developers to deal with. And if you're familiar with R, this is the same data set actually, the criminality. So I just printed out the column names. And what you can see on the line 17, it's the whole sentence, actually. And it states that it's an object that the criminal guy was interested in or his relationship with the victim dash, text one, dash, text two, or dash, text. Very good, like clapping. Try to address these. If you're working with IDE, some of them can do these. Some of them can do these things. But can you imagine the line that is there? And there could be like 30 columns. And you don't want to rename all of these. All the time it's very annoying. So this. Now, still being at this thing, here's the example how you can address these, this debt, dollar, and the column name. And I at least stripped the spaces. So the next thing is the poor ontogonality, or schema, or inconsistency of the data. What I just said was it's an object of the criminal guy that the criminal guy was interested in or his relationship to the victim. Can you imagine working with data like this? You have airbags. You have a bank mat here. You have a husband here. You have nothing, this NIC thing. That's nothing. Very good. Or at this 261 sentence, there's yellow rays, Indians, Asians, Australians, Aborigines. And there's still not enough data. You just want to get some data like I'm going to tell you the population data. I couldn't find recent population per municipal units in machine-reelable format. I had to open the yearly report that is in PDF. I had to search this 60-page PDF file to find those 10 numbers and write that down so that I could use them. There are still plenty of PDFs there. And sometimes, it's too user-friendly. I'm a developer. I want to work with the data as a developer. And sometimes, there are cool data sets, like very cool. But all that you get is this. So either you are a damn good in image processing, and you can pick these and pull this out, then all right. And one of the last things is built on proprietary software. The SESRI is a company that's specialized in geospatial data. They have pretty good software, like very good. But unfortunately, it's not open. And if you want to reproduce this, you can't. And most of the data is in Czech language. Oh, yeah. One of irritating things that's here as well is that when you open the app that is related to some sort of data set, there's no link to that data set. There's just the app. You might want to have link there. OK, so my cold arms for this. Take a look at the apps, see what data is offered, and take your time to explore and play with the data. And of course, give them your feedback, right? There's this whole red button thing called feedback on their side. Just hit them. Perfect. So that's for Bernodata. If you have any questions, hit me. Yes, please. So the question was how many of the data sets published by Bernod are actually five stars, four stars, three stars, et cetera? I couldn't possibly answer this. I have no statistics of that data. But from my experience, what I can answer is that I have encountered four or five star data. I have encountered in most cases those three stars data sets. And by most cases, I mean 60%, 70%. And the rest, like 20%, 25%, was two star. I haven't encountered the one star data. Everything that is there is obviously under open license. Unless there is no link from the application to the data set. I don't know which license it is. OK, cool. Oh, yeah, sorry. So the question was whether I saved it in Microsoft format the files? Oh, the picture that I showed you, you mean the criminality and the XLSX files. This is not something that I produced. This is something that I downloaded from the open data, sorry, Bernou data portal. So this is just, I haven't done anything with it. So obviously, it's much better to process it and dump it into CSV or something like that. But oh, open document format. Well, you can do anything what you want with this data. But I think the answer to the question is that I just downloaded it from the portal and haven't done anything with this. At least as far as this picture is concerned. All right, so hopefully I haven't overlooked anyone else. And we can now, I think we can write, we can now get into the data sets itself. Now, I was thinking about putting these into presentations as well. But then I figured that most of the guys are developers here. They might want to see the code. They might want to see what it actually looks like to be working with this data. And then I decided to embed the whole R Notebooks into the website so you can see that right there. And first, we're going to look at the Bernou Municipality Data, which is Bernou population data. I'm not going to explain the code very much. I have submitted a few proposals for also our workshops because I'm sort of our advocate. I think it's very overlooked language. It's pretty neat. Actually, I made all of this, those websites, using R and Hugo static generation, by the way. And also these visualizations that you'll see are pretty similar. For those of you who are not familiar with R, it's a very similar language to Python. It has some aspects of Haskell. And there is a difference from Python that is very significant. And that's incredibly bad variable naming, like incredibly. Those standard functions are called abbreviations like ST, SF. Yeah, that's one thing. But once you get used to it, it's pretty good. OK, so all those three data sets that I have here are actually picked so that I could focus on geospatial data, which is something that the S3 does. And I wanted to show you that you can work with geospatial data in an open standard as well. And you can do that on your own. And I got 10 minutes, so I'm going to have to speed up a little bit. So first thing you usually do is you just want to see the distribution of the data somehow. So you dump this stupid box plot, which is sometimes very clever, by the way, but it's very easy visualization. And this is what it looks like in municipal units. The population is distributed. The median is somewhere around 10,000 people. And by municipal units, I mean the parts of burnout. Now, we want to be more specific. We usually proceed from general information to more specific information. And in this case, we want to see how many people are there, actually. So the good way to do this is to produce a bar chart. Now, from this, we can immediately see that the Bernal Center, which is the very tall one, has the highest population value. Now, when we want to present those kind of charts to public, we want to color them, right? And we want to figure out the best way to color these. So what I usually do, I have something called Cenk's natural split approach. And I split the data unless there is some underlying factor or feature that I can color the data based on. If I have something like this, I usually split the data in a few categories. Let's say categories, so in a few parts. In this case, you can see this place here. And we color them, right? And for Bernal guys, it's not surprising that a Bernal Center has the highest population. Then there is the Bernal North, Bernal Lichain, et cetera. Now, for me, an interesting part is plotting the things on the map. So we get the Bernal base map. Now, going back to Bernal data portal and open data, this is not an easy thing. Bernal does not publish the shape files for those maps. And it's kind of hard to get these. But with an effort, you can get the shape files for the Czech Republic. And then you can somehow filter these out. You can color these based on the population. You can make the colors more beautiful and add legend. You can add labels. And what I usually do is when I'm adding the labels, I also want to make labels a little bit symbolic. So I use the alpha increase to show the most valuable labels there. In this case, that's a population. So that would be the municipality visualization. If you were into another colors, use them, right? No problems. Now, it might be also interesting in area size, right? Or we can visualize this on the map as well. We can visualize population density. And the reason why I'm having so much is because I have so much time. And this is the least interesting data set. But what's interesting in this, and I want to point it out, is the Kepler GL. For anyone having something to do with special data, this is an incredible tool. It's incredibly easy to use. I would show you demo if I had time, but I don't. But you can sort of plot it in this pseudo 3D, and it's very good. Use it, check out the demo. You'll produce visualization like this very easily from CSV files, JSON files, GeoJSON files. Pretty neat. All right. Then there are traffic incidents. I'm going to skip these because I have last four minutes. And I'm going to go to the most interesting, at least in my opinion, most interesting visualization. And that's burnout criminality. Because I think that's something that we can all relate to. Nevertheless, you're from burnout or not. So I'm just going to slide through this. This is leaflet, by the way. If you've ever used it in JavaScript, you can use it in R as well. A has pretty neat visualization. And in this case, it provided us a new view on those traffic incidents. We would expect that the highest population in the burnout center would cause highest, or not cause, but have some correlation with the highest traffic incidents. But it appears that there are actually some parts of burnout that are, let's say, more dangerous. And we would figure this out in this notebook. So if you want to, or try it yourself, check it out. And the burnout crime. All right. This is an interesting one. I like it very much. Although the data is very unpolished, you can see the examples. There's exactly this data set. It's not very good. But it has interesting information. And it's published from, I think, Czech Republic Police, or just the Burnal Police Department, some of these. This is thumbs up. It's very good that those public authorities publish their data. And it's interesting data. You can see, actually, that this is, by the way, a data set from 2016, I think. And those are crimes when they were committed. So in 2016, they actually found out that there was a crime committed before 1990. Now, good job. Now, if we zoom into these, what I find interesting here, for example, is those spikes at the beginning of the year. It's like criminals take some new year resolutions or something. OK, color them by seasons, for example. I like colors. You could tell that. So why not color these with seasons? What we can see here, for example, is that there are higher spikes in winter. And there are more spikes in spring. Spring. That's interesting, too, but it's somewhat spring and winter. Both are at the beginning of the year. So maybe there's a correlation there. Now, what I think is interesting is the crime by category. I had to handpick these categories. They don't have it implicitly there, so I had to do some text processing to pick those categories. So I picked these. And most of the crimes are caused by robbery or burglaries. And you can see that this is a pretty high amount. Then there are the blue ones, drugs, vandalism, for example, and other criminal activities. Now, what's good that in Verno we don't have much violent crime. You can see that there wasn't even a single murder, I think. It's just pretty cool. Now, if we slide through this damage caused by a crime incident and do some outlier analysis, we could see that there are some incredible outliers there. Damage in 12.5 million crowns. That's pretty neat. So I want to visualize these. So I took a quantile, and I got rid of these outliers. And you can see that most of the incidents are up to 25,000 crowns, which is about $1,000 in Verno. And for those who are from Verno, it's not very unexpected that the highest criminality and the highest density of crimes is in Komarov or Zidenice. And also, Verno Center is doing pretty well here. But what I found interesting was that when I plot the damage caused by crime incidents with respect to the number of crime incidents, which is the comparison that is right here, is that actually the thieves are very crafty in this. There is high damage and less number of crimes in Verno Center. So the thieves, or the criminals, can do more damage with less incidents. So they are more crafty, I guess. OK, so what are the stolen goods here? I'm running out of time, but we're on the end of this presentation already. So I hope that's going to be good. Now, by stealing goods, robbery, or burglary, fraud, and vandalism are the most common in this case. What about the municipality units? That's what I was talking about. The left one, Komarov. The middle is the Verno Center and the Verno Zidenice. All of them have a high number of crimes. Most of them robberies and frauds, and also some financial and economical criminal acts. And we can plot these as well. Take your time to explore it. I'm going to go to the end of this presentation, which is my favorite visualization as word cloud. Although there are people that think this is pretty old school, I like this so much. And there were about 240 objects that thieves were most interested in. And if I used bar charts, it would be difficult to explore these unless you use some sort of interactive visualizations. But word clouds do a pretty good job here. And you can see that, obviously, the people are most interested in money. This is the Peniza thing. Bags, iPhones, credit cards, personal items, mobiles, computers, and that stuff. And if we color it by the damage done money, this comes from the financial frauds. There were some huge outliers there in the millions. Bags and iPhones. And what we can do, and I find it particularly interesting, is that you can play with this. And you can shape it, these word clouds. What I did is I applied some rotation to these and shaped it. And I shaped it into this Bernaud city emblem. So you cannot say this is old school, right? This is pretty neat. OK. So that's it. I'm sorry that I couldn't finish all of the data sets. We can take a few minutes for question and answers. Yes, please. Oh, yeah, that's true. So you were saying that when I introduced the criminality and I told that in 2016, they actually discovered that the crime was committed in 1990. It's because it just introduced a different storage. So it's depict in this particular data set. Not because they discovered it, but because they just archived it in that way. Yeah, thank you. Might be the case, of course. Forced envelopes, it's not data at all. So if you can help me. Data Bernaud? I'm looking for what? So there is some data claim. If you go to the DDLs, there's no data at all. Oh. But the last piece is open. Great input. Thank you. Yeah, I don't know what it states in this long description. I guess it doesn't state that the data is missing. So yeah, that happens. That happens quite a lot. I don't know why. I don't know. This is similar to why the, so by the way, just for repeat, there is no data, right? This sometimes happens when you go from the application dashboard to the data it points to, if it points to anything, and you don't find the data there. So yeah, thank you for the input. You can see that it's pretty easy to find out that there is no data linked. Yeah, you can file it. There's the thing called feedback that I was talking about and just, you know, link this well as data set and the first one. That's another thing. You cannot request the format in which the data is. Oh, OK. Oh yes, this is particularly interesting. A lot of data sets from Bono data are linked to Artskis portal. This, I don't know why, but it links to a description of the data, not to the data itself. So what I usually did is there is, I cannot remember exactly, but sometimes those links up here work and give you something like this, for example. Oh, this is the same thing in JSON. OK. Yeah, but what I usually did was there is something called artskisopendata.something, and you find this data set there. So they just point you to the description of the data set, not to the data itself, not to the data itself, and you have to hope that you'd be lucky and find the data on the Artskis open data portal. And also, it happens sometimes that you don't, and that's because there is some link between Artskis portal and Bono portal. They collaborate together somehow, and that is what happened to me as well. I wanted to, I requested Bono for their transport data because they have the application for the transport data. They have it visualized. They say it's from Artskis or any other company, and they refused. So that can happen too. They just publish the visualization of the data, but they do not give you the data itself. That sometimes happens. Yeah, there's plenty of bugs like this, and that's what motivated me to give that presentation on open data. You know, it's the step in my direction, but it's a very small step, and sometimes you fall along the path. Can I one last question? OK. I don't think that five-star data defines language, but I can't even remember if the license, no license, does not imply language. I don't think so. It's just so the question was whether there is a normalized language for the data to be open. I don't think so. I don't think there is, but it's a, I'd say use common sense, right? If you want to target Czech people, OK. But sometimes it happens that even Czech people want to use that data and distribute it to non- or foreign countries and non-Czech speaking persons. So it's always a good way either. If you want to target Czech people, you know, publish both. Publish English. Publish English. English is, of course, generally the language of choice. Publish both. Publish Czech and publish English. Or give us some way to, at least, a script or something to process that Czech data and translate it into English. For example, that can be a simple dictionary-like structure. For example, I've seen this before, and it's better than nothing. OK. Thank you very much.