 Thanks, Mike. Thank you very much. Well, I appreciate everybody coming out this morning. We're going to spend a while talking about migrating to PostgreSQL. And there's a lot of interesting things to consider, both technical and non-technical, that come into play as you undertake migration. And hopefully, in all, I have some case studies from the wild to share with you guys about some migrations that I have been part of, and then talk about some of the lessons that we've learned. So, quick introduction. My name is Jennifer Sherrell. I've been working as an independent consultant building custom data systems for 20 years. I started out as an Oracle application developer, pretty quickly found PostgreSQL. I've spent 15 years working pretty intensely on Microsoft SQL Server systems. I've also worked with SQLite and MySQL. So, you know, on the team, I joke that I'm a database chameleon. I can kind of work with any of them and move between them pretty happily, so I much prefer to be working in PostgreSQL whenever possible. So, thinking about migrating to PostgreSQL, the first piece of considering that is the various options that are available for a relational database management system. So, here on the slide, I have the first four we have here, PostgreSQL, Microsoft SQL Server, Oracle and MySQL. Those are the top four relational database management systems that are available, and they've been the top four for a long time. I included SQLite on this because it's something that's getting increasingly used as a database behind web apps in particular, and one of the case studies that I'm going to share with you guys is migration from SQLite over to PostgreSQL. So, it doesn't make the top rankings of use for systems, but it's becoming more and more common in our world today. So, when choosing which software you'd like to use for your database, there's a number of factors to consider. There's, of course, the cost of the software directly that you need to pay, and I believe for all of the commercial software, these are now typically per-core, per-year prices that you're locking yourself well, not necessarily locking, but you're committing to. Oh, yes, sorry, and I meant to say, so PostgreSQL, if anybody that's not familiar with these PostgreSQL, MySQL and SQLite are open source options here. Oracle and Microsoft SQL Server are commercial paid software. So, there's the cost, there's performance. There's really quite a wide variety of performance that you're going to get out of these different platforms depending on what you are trying to solve and what your project is doing. There's also the ability to integrate with external systems. This is something that over, you know, 20 years, we used to build systems much more in a silo. They had to do their job, they had to do it well, they had to give their piece out. However, people would interact with other systems, you kind of left that to the people, and that's really not true anymore. Now, we expect our systems to interact and to be able to give kind of more comprehensive solutions out to the users, and so integration has really become a much bigger piece of choosing your software and potentially choosing to migrate over to PostgreSQL. And then the variety of tools, like if you guys were just over in PostgreSQL 14, you were hearing about some of the new tools that came out in there. PostgreSQL has a lot of really great tools and we're not going to be able to dive into all of them, but I am going to talk about some of the ones that are most relevant for migration. So again, why would a team choose to migrate to PostgreSQL? There's a lot of comfort in doing what you did yesterday, presuming nobody yelled at you and you didn't have to stay late and nothing broke. But there can be these both identifying that the system is not keeping up over time and various things that can give you these hints that migration may be in your future and it may be the right thing for your team or your project. So security obviously is a really big one. Data security is huge and having the right tool to be able to keep your data security is important. Reliability, like are you sure when you wake up and step into your office on Monday morning that everything is going to be running how you left it on Friday? Automation, how well you can set the system to just run where you have your team keeping an eye on things but there's no interventions required. The performance, right? So many databases today are behind a web application and users sit down, they log into that web page, they want their data given to them fast and regardless of the size of the actual data behind it. Then integration tools, as I said, the systems that I've been working on, the requirement to integrate across and pull data or push data between other servers and systems is growing every year and it's something that we now are doing all the time and it's just part of our daily world. And then cost. Some groups have a lot of money to put towards their database systems. Not everyone does and not everyone is looking to spend their money on the software. Like perhaps you would better spend your money on people or other things that would allow your team to be more effective. So in thinking about how these five database management software tools compare, I put together this little slide, this is mine, this is not like any official raking of them. But just looking at, you know, so of course, if we look first at the x-axis, that's the expense. So we've got our three open source options stacked on that bar because they are free. The next more expensive one is Microsoft SQL Server and then of course Oracle is kind of legendary for its expense. It's not an inexpensive default software if you haven't looked at the prices. They're pretty amazing. And then looking then across on the y-axis, like how advanced is the software? How reliable and how performant is it? Postgres is very much on par with Oracle as far as all of those features below my SQL. Again, this is my slide. It's less reliable. There are things that can hit. And then SQLite is really at the bottom. I don't think anybody would argue. With that, SQLite doesn't allow concurrent use and it can't scale very well. So it's definitely the lower of those options. And so the other thing to think about is that if you are right now considering migrating a system or a database to Postgres, you are not alone. Postgres has been increasingly popular. And so I just pulled this screen grab off of DB Engine's ranking. They rank all of the database engines, not just the relational database engine. And Postgres has won the database management system of the year for three of the last five years. They're the first software to ever win three times. So there really is a strong kind of upswell of people talking about and using Postgres more and more. And again, Postgres, well, I don't think I said this yet, but Postgres has been around for a very long time and has a very well-established software with a very well-established community. So when I, about five years ago, I was undertaking a big migration project and I came upon this white paper that these are a couple screen grabs from the website from Dimitri Fontaine. He's one of the core contributors to the Postgres project. And he had written this white paper for kind of sharing his experience and advice for undertaking migration. And you can go, if you go to his, is my cursor coming up? Okay, so here, this is the website. You can go to PGloader.io. If you want to download the white paper, he just asks that you give him, you give your email address, and then you get sent the link to download the paper in your email. It's such a useful, just really well-thought, well-written, easy white paper to read. But the idea behind it is this really builds on kind of the DevOps tools that we've all been adopting over the last 10 years or so, where building on the ideas of continuous integration and continuous development, continuous migration really, like, it just has been, it has made my life so much better. I will say that, like, we're going to get into some of the case studies where I'll explain how that has worked. But it's just, I cannot recommend that you take on a migration project from this continuous migration approach enough. It's really, the framework of it is really just, it's fantastic. So you can see this is from the white paper, the core ideas that you set up your new environment. You need your new hardware up and going. You build your tools that migrate your database to the new environment. And we're going to talk about some tools that are available in Postgres that make that a lot easier than it might be otherwise. Once you have your sort of pipeline together that takes your current production data and writes it across into your new Postgres database, you then rerun that every day until you do your final pet over. So in the same way with, like, continuous integration and continuous development, you're kind of exercising your product, publishing production on a daily basis and these sorts of things. This is exercising your migration on a daily basis. And that might sound like overkill sometimes, right? Like, do we really need to do it every day? And I would really say yes. And in my experience doing it every day has been brilliant because if you run this once and then you sit back and you're like, okay, we got it, we built the pipeline, it worked. And then you do some more of the different team pieces and you schedule the migration and you do the rest of it and you come back and you run it and it doesn't work. And I promise this will happen. And it will be, sometimes it can be that someone uploaded, some data came into your production system while you were doing these other things that broke the migration pipeline you built. And you're not seeing it because you haven't been rerunning it each day to know that it still works, right? When you have a living system that you're migrating across, it's really critical to do this from this continuous migration perspective. So let me quickly introduce the case studies that I'm going to give you guys some examples from that I've worked on. So there's three different, well, it's technically four different websites, but two of them are kind of a pair. So the first website we're going to talk about is called Feeders. So all three, four, depending on how you count them of these projects are projects that I've done as a consultant to the California Public Utility Commission. I work for their energy division in processing their energy efficiency data. So how many people in the room have ever paid a power bill in California? Okay, you fund all of the work that is done in the CPC Energy Division. They authorize a budget of a billion dollars every year that goes towards programs to save energy in the state of California. This has been happening since the 1970s. California is considered the world leader in energy efficiency. And it's not because California thought it would be fun to do, it's because California ran out of energy first. So they have been the front runner of energy efficiency since anyone really started doing it at any sort of scale. So Feeders, the first system we're going to talk about, it's a big system. It's big data. It's where all of that data for that billion dollars a year of investment in energy efficiency gets reported. It's where the groups that are running those programs apply for their budget and say what they want to do in the future year. And it's where they report every single action, every single light bulb or HVAC system that's been upgraded in this state gets reported back into this system. It's been online since 2015. And we receive about a million records a year just of the activities that happen in this state. And so this is a really big system. It started out in Microsoft SQL Server. And it moved, we migrated it over to Postgres. The advice letter site is another site that's for the CPUC, but it's very different. And it's very simple, comparatively, and a much smaller project. So advice letters are documents that the utilities like PG&E and SoCal Edison, when they're setting their rates for what they're going to charge customers for energy, they have to send a PDF document, kind of a legal letter to the commission that outlines what their rate change will be and why they're making it and how it was authorized that they are allowed to do this. These documents are submitted to this advice letter website. So it's a Postgres, well, now it's a Postgres database. This project is the one that started out with SQL Lite. So you have a database back in a Django front end, and it's really, it's an electronic document store. And we even use just a open-source electronic document software for it. And so then the last one we're going to talk about is actually a project that I'm working on, a migration for right now. And this is, they call it the Energy Data Web. And this is two additional sites that have their places where they manage documents and projects and provide public-facing resources to kind of be transparent about the work that's being done at the commission. So again, we're going to talk about the data problems, like what was going on, what motivated the migration, sort of the non-technical side of undertaking a migration. Then we'll dig more into the actual tools and the technical side of the migration and then talk a little bit about the lessons. So really briefly, this is just a screen grab of Cedars. It's a website that presents tons of tables like the summary table that's on the screen there is summarizing several million records out of about eight different tables. And so this is a case where we really needed strong performance. We need to be able to give these snappy web summary tables off of real, very large data sets. And it's a really big project. So lots of money, giant servers. It's been continuously developed for eight years now. This is just a quick screen grab of the advice letter site. Again, it's very simple. It's a PDF document store with a database backend. The database, if you're wondering, like, wait, you're putting the documents in the database? No, of course we're not doing that. The database manages all of the metadata that describes the document and that powers all of the searchability and findability of everything. And also the tool that we use for this site has optical character recognition, OCR, so it reads the PDF and allows users to search across the content of the PDF as well. Energy Data Web, these are really quite simple sites. They're older, they run different web platforms. They are right now both using Microsoft SQL Server and we're looking to migrate them to Postgres. So this is one where we haven't done the migration yet. We're still in the initial steps of agreeing that the migration needs to be done and planning for exactly how and when it will happen. So as I said, I've been part of doing a number of migrations big and small and regardless of the complexity and the size of the tool that needs to be migrated, there's a set sort of script for what you need to do to undertake a migration. The first piece is to get everyone to agree, the decision makers in particular, that the migration needs to happen. That can be the biggest barrier honestly and what I have seen in real life is actually just getting everyone to, actually these first two steps together, agreeing to undertake the migration and then setting aside the time and budget and the people necessary to undertake it. People often want to kind of cheat that process a little bit which I'll talk about in a moment. You sometimes can cheat that process a bit but you're usually going to pay for it. So once you've done those two non-technical pieces, you have the consensus, you're going forward. Your next step is to plan the new server, the architecture, choose what sort of course you need, what your memory, your SSDs, do all those pieces, plan out your new hardware. Then you're going to build out your new servers. At this point you set up and you run your continuous migration pipeline. You get the data moving from your current system into your new system. And again, once that's done, the first time, you just set it up so it runs again every night. I mean it doesn't have to be at night, but typically that's the optimal time to do it since a few users will be flipping. Once you have the continuous migration running, you want to make sure that you set up your backups and automate your process for recovering your system, your new system. It's really strategic to do that before the cutover because you can take the system down and rebuild it from your backup and exercise that several times to make sure you have it perfect before it becomes your production system. Once it is the production system, being able to exercise your restore from backup becomes a much trickier process. At that point you go back to the non-technical. Then you just have to work with the decision makers and the team to schedule when your cutover is going to be. When you get to your cutover, you run that migration one last time. Because you run it every day for the last while, it just works. There's no surprises in the 11th hour. The advice letter site that I talked about, we just did our final cutover like six weeks ago, something like that. Flawless. It went exactly according to script. That is 100% because we followed the practice of continuous migration. At that point, you're just cleaning up. You want to retire your old servers in whatever way is appropriate for the team and the group that you're working with. Then you're done. You're in this new space that's improved. This is kind of a storytelling slide for us. When you decide to migrate, what I have seen is that there's two camps of deciding to migrate. People who have experienced an incident that typically becomes known as the incident to that team of people, that proves that they need to do something differently. I had an experience as a consultant where I was on a team. We had a Microsoft SQL server database. We had probably 20 contractors working on that database server doing production work. The group had just missed their deadline for having the work that they were working on done. Everyone is working around the clock trying to get things completed as quickly as possible. The server crashed. We all showed up the next day and there's no machine to connect to. It took five days to bring the server back. I was one of those 20 people waiting for the server to come back so I could finish my work that was a piece of the larger deliverable. It was agonizing. It was really rough. That shook people and that made everyone start thinking about what are we going to do to get into a different situation where we're not going to have something like this happen again. The Energy Data Web project that I just mentioned, we see our incidents in front of us, but it hasn't happened yet. We have an authorization system for those websites that was built by a different consulting group that built those sites 10 years ago. They put their proprietary authorization system in front of both of the sites so they both authenticate through this same system. They closed down during COVID. They were a consultant to my client. My client asked me if I would please inherit their sites because they were closing. I said okay. In the transfer they said, this is our proprietary system. It's going to break at some point. We are no longer maintaining it. We are closing our doors. That's going to fail on you. So we know this is ahead of us. We don't know how long we have, but we have to move the system off before the authentication system fails. We've also had some actual hardware problems with those sites where twice now one of the servers has entirely filled its hard drive. It's generally a good idea to not set your servers up where they can fill their own hard drive, like the OS hard drive. The server falls over. In one case it knocked out three websites. In one case it only knocked out one website. That's the sort of thing where if you knew that it was coming, it would be very easy to avoid and to mitigate. It would be five minutes of work to make the drive a little bit bigger. Once the drive is completely full and you can't SSH to it because there's not enough room on it to write your key, there's not enough days of work to rebuild those machines and get the data off and get the new one going. So that's happened to us. I've been managing those sites for about 18 months now and that's happened to us twice. So this is really just to kind of, you know, the scary stuff, right? You want to undertake a migration and finish it before something bad happens. So that's really just the point to make. The last kind of moment to talk about these non-technical steps that get you to actually executing a migration is the piece where you dedicate resources. And I alluded to this a little bit before where this can be really tricky to get people to accept that, you know, often like a motivating factor for undertaking a migration. It may not be that you want better performance or better reliability. It may be that you want to lower your cost. And you know that, you know, like you've done the work, you've run the numbers, you can say, if we undertake this migration, our maintenance costs on the new system will be half what we're paying each year to maintain our current system. We're going to see a really nice cost drop for every year going forward if we just do this thing. This is a really compelling story to the decision makers, right? What they don't like is this little brick on top where you need to actually dedicate resources to building out your new system and running your migration on top of your expensive current maintenance. And so what I see, what people often ask when you get to that moment where you're like, okay, how are we going to plan? How are we, you know, like we were freeing up time for our team to be doing other things and not doing just maintenance and that we're lowering our cost. But to get there, we actually need a couple extra people and we need a little more money to be able to build out the system that we need. And what people, I mean, you guys can probably guess looking at this, but people often want to do is say, well, if we're going to be done with the old system, let's just drop that build new system into our maintenance budget for our current system and we'll just stop maintaining it. And it probably won't die before we get the new system up, right? I've seen that go sideways once. The system went down in 2018. They stopped maintaining it. It was hacked. They couldn't get it back up. It is still sitting in a server room with tape around it and that web resource that a lot of people used for day-to-day work vanished overnight. So just to really encourage people, it hurts a little bit. It's hard. You can't just create people on your team, typically. Worst case scenario, you have to hire a consultant, somebody like me to come in and help for a few months to give the bandwidth because you don't want to necessarily hire FPE if you think you only need it for three months. But just to say that this is really one of the things that I see group struggle with, where even once they've decided they want to migrate and they've decided where they want to migrate to Postgres and they kind of know what they want to do, figuring out how to allocate resources to the migration can be a really tricky piece. So setting aside now these non-technical pieces, let's get into the fun stuff, actually how we can use Postgres and why Postgres is such a great tool to migrate onto. So again, these are just our generic migration steps. Oh, sorry, I reorganized these this morning. I thought we were in a different spot. This is just about the server expense. So Oracle is kind of famously expensive. I googled like how much does Oracle cost and this little blurb came up and they said just perhaps the most expensive piece of software on the planet. They give a case here where you've got a for-processor quad-core server that would cost $2.4 million. This is of course the premium, premium, premium Oracle. You could do an Oracle database for less than that. A lot of the systems I'm running are 32 cores. All of these licensing now for Oracle and for Microsoft SQL Server is done per core. So if you've got a big machine because you've got big data, you're going to pay big dollars. So Oracle, it's funny, you can kind of laugh at it and think about who has the budget to decide I'm going to run Oracle. There are lots of people. And again, it's a great software, very comfortable to Postgres. But Microsoft SQL Server is a very common solution as well. I'm from Seattle, so very common in my part of the world because Microsoft is in our backyard. But like right now, I went on to the AWS calculator and pulled up the instance that I just spun up a couple of new databases for a client, 32 core systems. I could run Postgres on Linux on that exact instance. I could run almost four of them, not quite four. It's like 3.7 for the cost of running one Windows Microsoft SQL Server machine on the exact same hardware. So I actually have a client where that cost saving has allowed us to improve our security because we were able to then run multiple machines and put things in the right places and distribute things and keep confidential data away from where more people have access because of the cost saving, which was not something I had originally anticipated. So just to say that the saving money can actually translate into other features or benefits as you undertake a migration. So reliability, the advice letter site that I told you guys about. This is a site that went live in 2014, the total budget for creating the site and paying the cloud server for the first year was $3,500. And it works fine. It had very few users. People are uploading their PDFs. People are uploading PDFs. They didn't really maintain the site. They just put it up and started it running and just kept it running. Eight years later they called me back and said we need to upgrade it. They had done a pen test on the system and cross scripting vulnerability had come up and they wanted to address that. And so in addressing that security concern we were able to actually undertake the migration and do this whole piece in one big chunk of work to kind of level them up to a new platform that they were going to be able to take forward. We suggested five years. That fits better to the open source model of how things are lifespan of software releases. And then they fully intend to just, we call it coasting. They're going to coast the platform for the next five years is their intent. And again, I shared the story with you guys earlier about the Windows servers that crashed and took five days to be restored at a very critical development time. So those systems have both moved on to Postgres. The Microsoft SQL Server one has been on Postgres now since 2016? Yeah, 2015. And they, we have had zero days of downtime and I don't want to jinx myself, but that same, the same tool that was running Microsoft SQL Server and went down for five days has been running in a much bigger scale with more users since 2015 and it has had zero downtime. And then the integration tool. So this is my favorite, favorite Postgres integration tool and it's foreign data wrappers. When you are migrating to Postgres, you set up your Postgres server and then from that new Postgres database, you can go to your existing, the current or old, soon to be old database server and grab data and pull it into your new server using foreign data wrapper. I put a little snippet here. This is the last foreign data wrapper I made. So I just grabbed a random one. This is a public read-only data access so you could build this foreign data wrapper in your own database and it would connect to the project run by the California Technical Forum and pull their data into your local database. And so in migration, so I use foreign data wrappers both to do integration from my Postgres servers with other external servers but we're talking about how now that's generally expected. People don't want to have to go over to this website and grab these things and then come to this website and grab these things. They just want you to grab that for them and let them just directly interact in one place. Foreign data wrappers enable that in just a really lovely way. This particular example when I ran it, it took about eight seconds. And in those eight seconds, it went to the foreign server, queried all of the particular data element that I wanted, pulled it over to my local database and materialized it into a view. So I have the data element that I wanted in my local server that I can work against and have really great performance. So another thing that I think is amazing about foreign data wrappers is not just that you can go from a Postgres database to a Postgres database, you can do a foreign data wrapper to Microsoft SQL Server, to Oracle, to MySQL. It works to... There's extensions that allow you to wrap all different kinds of foreign servers, which is just hugely great. This, we were doing a table at a time. Yeah, tables that had about a million records and the reference system that we're pulling out of is not...they don't really embrace the relational structure. They've put all of their stuff into JSON and then they make it available as one giant flat table that has like a 150 column. So I was pulling data out of a table that has about a million records and about a 150 column. And do I have any Django developers in here? Anyway, okay. I work with a lot of Django developers. We do a lot of Django front-ends for our databases. I ran this as proof of concept and to get the data into my system once for some testing I was doing, then I asked my Django guide to... I said, okay, this is...you know, we want to go grab this field out of that database. And they were like, oh, can we do it the way we usually do? I said, yes, fine. They went and did that. It took them 10 hours to pull that data through the ORM. Eight seconds? 10 hours. So like, you know, these foreign data wrappers are snappy and powerful. Like, you can grab huge pieces of data really fast. Performance. Like, with Postgres, you get such strong performance. And if you're doing upgrades, your performance is going to improve every year because every year, the new release of Postgres has better performance. So there's a ton of performance that you gain simply by doing the migration. One of the things that I've used a lot in this bigger system, theaters that I've told you guys about, where we have tables with... you know, we're at about a terabyte of this data. We have multiple tables that have lots of columns and many millions of records. And we have to join them all together and synthesize these summary statistics, like the table that I flashed up earlier. And so the way that we're doing that and getting these aggregated data and serving it to our users in the sort of real-time web right now, I click the button, show me the number sort of way, is through Materialized Views. So we set up Materialized Views that take the underlying data and aggregate it up to exactly that summary that we need. And then we just refresh them. We set triggers so that they refresh when they need to. And it's been amazingly... the performance on that has just been fantastic. So it's not just that the database itself performs really well. It's that you can leverage things like Materialized Views to bring performance even when you're trying to do something really fast on a really large data set. Oh, and this is also such a cool tool. This is another one that may look familiar because this is the same person for the Continuous Migration white paper, Dmitri Fontaine. He created an extension for Postgres that's called PG Loader. I've got the references there, and I am going to share these slides so you'll be able to get them from scale as well. You want to look at these... I really recommend looking at this. So it's an extension where you write one line of code that effectively says, well, first you'd have to build out your new Postgres database, like your new hardware, your new software that you're going to migrate onto. You've got your existing system. You take PG Loader in one line. You can point it to the existing system and point it to the new system and say Go. And it reads the existing database, builds out the schema for it in your new database, and then moves all of the data across for you. It's not necessarily going to work perfectly on every use case, like I know some database systems that have a ton of really weird stored procedures and things like that. Probably not going to be able to do something like that. There's a case study on his website where they talk of someone, you know, hit him up and said, hey, thanks so much for this. I migrated a one terabyte MySQL database to Postgres and just set up PG Loader and hit Go and left for the weekend and came back and it was done. So it really just a super cool tool. Yeah. Yeah. Oh, to do the daily continuous migration versus the... Yeah. You don't need to do it again. I don't know. Yeah. No, I stuck a couple of my business cards at the back and if anybody wants to grab one, I'd be happy to try and find that answer. I don't know. I have not... I'm not sure about or stored procedure. I remember... So I was trying to keep all of this... Yeah. Let me check. Can I follow up with you after this? Yeah. I did use this for the SQLite migration. Because I am more used to Postgres than not SQLite, SQLite is really weird to me. So I was super happy. I could point this at it and it just worked. So again, just going back to the reasons we choose these tools. We want our security performance. Postgres is really hard to beat when it comes to choosing what's the best, what's the most strongest tools, best performance, lowest cost. I think anybody would have a really hard case choosing another. Others get chosen all the time. I don't mean I'm not naive about that. But that's really... When I see those decisions, they're not made for reasons like cost and performance and how well it will grow with you as you move forward. They're made for other reasons. People choose Oracle because the company they're in chose Oracle previously. And you're continuing on a common platform. Which is... That's a real thing. You don't necessarily want to... It can be really confusing and difficult for groups to say you guys all know Oracle and you're all Oracle developers. I'm going to build you this Postgres platform. You're going to inherit it when I'm done and you'll be fine. That's going to be a hard sell. Yeah. Yeah. So I have always used the open source product. I haven't. I mean, EDV is great. But effectively, I've always used open source Postgres. And honestly, I've been a member in the Slack community. Postgres has a Slack community for those of you who Slack. I think when I first found it, there were about 2,000 people in there. And now there's something like 12,000 people in the Slack community. And a lot of the... There's a number of core developers who answer questions in there. But that's a great place to talk. There's a bunch of directed channels as well, if you're looking to talk. But I think there's actually one on four data wrappers. And some various things like that. And I... So I do... I typically maintain the systems that I've built. And that's one of the reasons I picked Postgres. It's because I spend very little time maintaining them. If you set the system upright at the beginning and you don't like, oh, let's put something else on the database server with the database. You know, like these kind of silly things people will do. It just keeps working. Like if you take care of it and you set it upright, it just keeps working. And the administration is so close to zero. It's amazing. Like I spend very little time doing administration. And that's by design. That's not the part of it I like. I like being in and manipulating data and building out tools and things like that. The administration part is not what I show up for. So I love being able to have that really light touch on database. And then the Postgres documentation. It is so well done and so comprehensive. If I don't remember something, I just Google it. Postgres and whatever the words are that I need. Well, I've trained my Google to go straight to the documentation pretty much and get the answer. And it's really fantastic. I rarely get stuck. So again, just briefly, the non-technical hurdles, getting everybody to kind of accept sometimes and agree that we're going to, that it's important to do this and it's important to do it now. Maintaining the current system while you build out the new one and not just letting it fall by the wayside. And then just executing the switchover. On the advice letter project that I told you guys, that was actually the first project that my team did as continuous migration and it totally, continuous migration saved us on that project because we got to where we were done. We'd run the continuous migration all the time. We'd done everything. We said we're ready to cut over. And so we said, okay, we're going to check with IT. Anybody want to guess how long it was? Three years later, they told us we could go. We had not taken that project from a continuous migration framework. We would have been in a lot of trouble. We would have been completely back at square one. So, in conclusion, Postgres for the win. Every time I have moved people on to Postgres, they've been incredibly happy. They've had reliable systems. They've had zero downtime. And it's opened up doors to be able to do kind of smarter integration and more things that they had wanted to but had been out of reach to them before. With that, I will say thank you. And I'm happy to answer any questions if you have a minute. That's a very good question. And I do work with a, oh, quite a, there's more front end people on my team than back end. In some cases, it has been where we really, like we switched out the database and we kept the same front end and it would seem less. The price letter site is the example for that. Oh, yeah. Yes, we really did. It was pretty great. Okay. Yeah. We haven't been doing very much for workload reports. We have most, many of the systems that I work on have really episodic use. So, like the Cedar system, which is really big, we get, you know, people put in a, we'll submit a million records to the system in one batch. And then they won't come back for three months. Right? So, daily monitoring, just the, so one of the funniest things about working as a consultant to a regulatory agency is that a lot of, most of the things that happen in our system are decided by administrative law judges and issued by ruling. So, like an administrative law judge writes that something, like they write, Cedars will blah, blah, blah. And I find out about it when they publish the ruling. And then they turn to me and they're like, so you're going to make Cedars, you have to do that by that day. And I'm like, can we ever get talked to before they write that stuff down? And so, the, when data are submitted to us is also regulatory driven. Like the people who submit data to us are required to buy their, you know, here's when they get their money to run these programs. So, like PG&E gets $400,000 a year to run these programs. When they accept that money, then they are obliged then to keep reporting data into these systems on these exact intervals. And we set, you know, days, they have all this stuff. So, our daily workload is just so inconsistent that that, you know, monitoring it at that level hasn't been particularly useful for us. Mark, do you do more monitoring? Do you have a tool that you like? I'm trying to remember. So, I'm part of the Seattle Postgres User Group and the person who runs that is at the Fred Hutch Cancer Research Center and he monitors his servers constantly. He's got this six monitors that each show statistics and the whole thing wired up really lovely. And right now I can't remember what he uses, but I'd be happy to look it up and tell you because I've played with that one a little bit. And I know he exercises it every day and is really happy with it. And it was an open source tool. But Lloyd Albin is his name if you wanted to just Google him. He puts all of his talks on mine so you'd be able to see what tool he used. He's done a number of talks on it. All right, well thank you very much. I appreciate you guys coming out. Testing 1-2-3. Oh, you can hear him. I hear the echo. Yeah, testing 1-2-3, hello. I talk about this loud. Is that okay? Did I blast the audience? 1-2-3, testing 1-2-3. I don't care. They have chairs. You can be as close as you want. They said that this isn't loud enough. Is it loud enough for you? They were here earlier and they said this wasn't loud enough. Yeah, it's supposed to be on. I hear something. I hear a slight echo. Okay, so I can turn my little thing off. I have the same... I'd rather use yours, actually. I'd rather have yours. It's short. It'll be 35-40 minutes, I hope. Okay, that's when my alarm is going to go off too. All right, sure. Can you guys hear the... You can't hear it. Yeah. Can you hear me now? Is that okay? I can hear the echo, so... Hello? Can you hear it okay? Yeah. All right, get to give you a little spiel. So... It's pronounced... I got the pronunciation key right here. I should have put it on the screen. I didn't. No. The J's like a Y. Yeah, you can hang on to that. Yeah, it's just... Ignore the way it's spelled and just look at that. Yeah, it's just... You look at the way it's spelled, it'll give you up. Yeah, it's just... It's Polish, so... There's a Y from J. Hey, I've heard them all, man. So are these recordings that you posted anywhere? Yeah, it looks like it's a... like a security camera. Yeah, it doesn't work when you do anything. It's not in the room there, but it's just... Yeah, I tried to adjust that, it doesn't... It does work. Okay, so the back of the room it doesn't. Yeah, it goes to knowcred.com. And then it redirects to my repository. It's like kind of short. That way I made it to the... Now you have to share it with everybody. Yeah, it's just these, I guess. So hello, everyone. My name is Stephen Yaz. Justy, I've been playing with digital data since the late 1970s, actually. So I've been using SQL since 1997. I switched from SDO Unix to Slackware Linux for my in-it web servers back in 1996. So it's an idea, I've been doing it for a little while. I currently work as a consultant and programmer for a database server provider. I program daily in C and Java currently. My favorite computer languages are C sharp and DO. So even though I'm not working with my favorite languages like most of us. So what this is about is that this open source business schema is like a quick start to starting any application or enterprise project. It already has 79 tables, 40 views, 102 Postgres functions, 26 normalize tables for web sessions alone. But just doing a web session record, it's 26 tables. But not to worry. I have pre-written functions for doing the insert and lots of views. For example, the 26 tables for a web log. Normally people have one table for a web log. But all those same fields you would normally put in a single table, you just pass to the function. And the function takes care of all the insert and stuff into the normalize table. So it looks daunting at first but I try to make it as simple as possible. Why create an open source schema? Turns out currently this is the only one. There's a guy before me who I've actually talked to and he gave up after years. So I think I'm still the only one. I got tired of reinventing the wheel for every data modeling contract. So I started a new contract, I started over. Without exception, they were unwilling to pay too much on the schema portion. But more than happy to virtually pay an unlimited amount for the code you'd deal with that resulting data model. So I got tired of this. So my solution was to spend my own time working on a carefully constructed MIT license data model for the most common business operation. So then using that as a baseline for the project the customer would only need to pay for their business specific table to be viewed. So that way they had to keep paying for like, you know, recording people and email addresses, addresses. All that stuff is already, I'd already done it. So it's like, okay, what is your business view that is different than anyone else? That's surprising sometimes from another hundred tables. So I don't expect to be able to read that. But those are the 26 tables for just doing a web, you know, just doing the web log. So it gives you an idea of how normalized the schema is. But extracting all the possible information on a single page yet up front allows for instant reporting on information that is normally extracted in a post-process of web logs. This scheme works well for structured and fully understood information operation. What a meme of that is it's, you know, it's not a mystery within a web log, right? And so data mining, things like that, don't really buy you much. We already know the structure completely. And so for operations like that, I went ahead and constructed these normalized tables because there's no mystery, right? So any questions about that? Like, you can't see it. So how did this all start, right? I gave a little bit of a clue. So back in 2002, I was recruited to work on my 20-year-class reunion. It's going to be all of them. I was asked to organize the yearbook photos and classmate contact information. So I decided to use my then database of choice, Postgres 7. And I heard in an earlier talk, you guys talk about 14 now. So anyway, this is my, pardon? Yeah, I know. So I think I've got 14 installed on this machine. But anyway, creating a class reunion database turned out to be the excuse I needed to do an extremely normalized team-up. I was able to decide for myself how much was too much, right? I must admit I did get carried away a little bit. Maybe it was a little bit too much. For example, once people turn 18 or get married, they like to change their name. Also, after 20 years, almost no one still lives at the same address they did at the time of the graduation. Maybe a couple people did. So I needed a way to record the name and address changes. I needed to record their reunion attendance under their current name and then on their badges put their graduation name. So already I had to record at least two names already. And with that, the very first version of this scheme was born back in 2002. But I've been dealing with databases almost 30 years before that. So we'll start People and Entities. So the main idea between my idea of People and Entities is a single table column to link to either a person or a company. We have invoice.customer. It's nice if it's that customer to be a person or a company. It doesn't have to be like, oh, you have a contact. You can still go individual or you can go to a company with the same column. And then also want to be able to track name changes over time. Always fill the latest but access to the historic name changes is really important. This is like the first place the scheme diverges from other cred-based data models. Also the idea of storing historical information allows for past invoices to show past names and addresses based on the date of the invoice. Could this show wrong and later corrected information? Absolutely. That's a feature. It's important to fix business document information in time. Once it's been lifted from the database to a piece of paper or communication and that person calls you and says it's wrong you can see that it's wrong even though maybe they've corrected it. If it doesn't matter once you've listed that you've generated a document you really don't want that information to change either to use it or someone else to change the information later. That invoice needs to be fixed. Anyway, that's an important concept to this data. Here's my people view. Here I've selected from the people view I've got three people here. The people view presents the programmer with a simplified look at the people in the database. No matter how much history is in the system it will show one row per person. For example, Candy Fish which is the person in the middle has changed her name several times over the years. This is only the last name. Also the full name the full name portion the full name column of a combination of seven other columns. Just to display a person's name takes seven other fields to do it. You can think about prefectures and subfectures and all kinds of crazy stuff. Anyway, there's that. Here's the person's history. You can see the history of Candy Fish. This is a query result shows a name change of Candy Fish. Only the last row of the name change is the current view's name. I don't know how well you can see that but the first record shows Candice Hall's name change in 1953 then it went to Candice she started calling it Candy and then that changed to 51 then 78 changed her last name and then changed her last name again and then yet again. It allows really easily to track name changes which is really important again when you're doing things like reunions and things like that. Also notice I don't know if you can notice another thing here but the schema has kind of a convention where the names that reference other tables are the same name as the table they refer to. For example, if you read the query it says join name ID equals individual.name so if you see an individual.name you know there's a name table out there with an ID in it and so that's kind of like something I like to do for example given given on given ID equals name.given then family again on family ID equals name.family so it kind of makes it easy to generate these queries without really knowing the structure for example myself I've been using it for 20 years that's very important to me and then even when I left join in here the given I actually left join given as goes by on goesby.de.google.gov so it's the same kind of thing but anyway that's how that part works even though like I said the previous slide Candyfish had one entry but actually it's lost it just for a reason. One more thing to notice is that at the very beginning it says select individual.id and that's basically that supplies the person.id column and that's really important in the next couple slides okay now we'll go to the entity view so entity view is much like the people view but simpler because company names aren't broken into like seven different people so it's slightly simpler the name goes by a lot of things are the same but if you look at the history of Oracle so Oracle has changed their name several times over history and so they've gone by software development labs incorporated relational software incorporated Oracle systems corporation and now they're just called Oracle and those are the only times that they change their name and then notice also it says individual.id makes up entities id and so that is how invoice.customer can point to the same either a person or an entity because I use the same individual for both and so under the covers there's an individual table that has transactions in it and those transactions are people or entities or companies so getting away from that that we're going to double-entry accounting the idea here is the need to record amount and adjust the amount from multiple sources show end balances or totals or report on related balances using double-entry accounting techniques it is possible to store and track values and how they are composed without using CRUD update a single column value of course when you do that you lose the value of history that's what CRUD is it's very handy having a small accounting system incorporated into a business data model it also allows for accounting transactions to be extracted or loaded with very little conversion if any since the schema works with transactional double-entry data so if you need to import double-entry data accounting or get it back out it already does double-entry accounting does journals and stuff like that you can tell I really like double-entry accounting so if you go to Wikipedia now Debiton credits oh my it is here's a screenshot of the Wikipedia definition of Debiton credits and so I think this is interesting and I admit kind of dry but even so I've tried to duplicate this simple explanation of on the Wikipedia page in my schema design using procedures and so you have this and if you look you kind of zoom in on the words here they are and this is a simple explanation but in short you pay rent you receive cash for a sale buy some equipment with cash borrow cash as a loan and pay the salary with cash and that's kind of what their example is and so in the schema here I created a book procedure so if you want to book a sale for example I create a book procedure to do simple journal entries using a single value for the programmer double-entry accounting what do I do and so you have rent $100 sale $50 equipment $500 loan $1100 salary $5,000 and then also the book is a little complicated functionality you can actually book to something like commission sales and it will book the amount it will take the sales and split it among cash sales commission sales it will do that under the cover but as far as the program is concerned they only care about the one value they don't care what accounts they go to so I made that as simple as possible but anyway like I said these are the same five steps you see here I zoomed in on the actual table that Wikipedia has and then you can see that I duplicated that table after doing those five procedures I duplicated that table with a very simple phrase and so notice that the Wikipedia article is slightly different the account column actually has a multi-value feel which I personally don't like so actually in my report here there are two columns in this case it's like rent, expense and cash asset not rent so it's actually a separate column you can make decisions based on that column because that way you can call it whatever you want but anyway gives you an idea how I was able to completely duplicate with these five simple steps and you can do double-edged accounting in your application and it's all different alright so another translation so somewhere else where I kind of deviate from other card-based data models my idea is that you store the translations in the central database that way all applications have the same translation and they benefit from a single translation effort translations are stored as ideas so they're like either words, sentences or paragraphs and each idea is given like an ID so this comes in a little handy for example if you have sql lite and you have an app you can actually store all the translations in the app in the schema the person just selects what culture they want and displays everything in that culture so the idea is to store words, sentences and paragraphs in the database partitioned by culture every database connection will get a view result with the selected culture so here we have the previous accounting information shown as general this time it's shorter I'm just showing the actual balances of the accounts, not the separate transactions but the same information one session said that they wanted the cultures for Spanish the other one said French and they got two different results and so one got it in Spanish the other one got it in French and the other one didn't do it and the other one that Microsoft uses these are the exact numbers and so like 1033 for example and so yeah you can look those up but they also got es-mx and fr-fr so I tried to make it the one with the most options so you can actually put in all the different forms of Spanish that you want in here and it's probably different so this came in really handy when I did an app for a restaurant where in the back they usually selected Spanish for all the displays but the manager using the exact same data because it was the same displayed in English so it made it really nice and the data was exactly the same that's built in from scratch so create, read, update and delete I'm kind of just spurred that's what cred means and this is a pretty common quote and I've got the person down here but I looked at I like the first quote the best it says, crud is important for end users without it, things like registering the website, creating blogs our bookmarks would be impossible and that's crud is virtually ubiquitous in data modeling we already talked about crud you may have noticed in these past examples the trend that you didn't need to do any update you either inserted a record or you updated like a stop like unlist or stop and that changed the result in the deep so is it impossible to create a useful data model without using the crud scheme so far, things are working pretty good so normally, how would you change data about one way, the crud way is just to update and lose the previous value the only way you can get those values back is from backup or it's more common you actually change the data in a crud way but you actually write a log and this is only practical for this is important and this isn't so we'll write all the custom code to log the changes here that's pretty common but with this schema I basically incorporate changes into the design so you're basically embracing the change so I believe it's not impossible it's not simple though but it's not impossible and usually most projects don't want to invest in that that's why I came up with no crud and that's why that little QR code you see is actually nocrud.com I know that's a good domain rows are never deleted so you're inserting rows and never delete anything so you keep all your history on everything though if you have a null in a column it can be updated to a value that's kind of like my rule of thumb the only updates you can make you can't change it once it's turned from null to something else so every change is in transactions so recording one or more tables depending on what kind of transaction it is it might be split like a web it's the very first weblog you may have so this form of storage data storage and retrieval lends itself to data models of very normalize so it requires complicated use and procedures to get the data in and out of the system which is why I wrote all this stuff in my own time it's really hard to get it just ready it's no wonder project managers are unwilling to dedicate significant effort to a data model design like this and it's also very hard to see and understand how significant how this significant initial effort can make the actual application development happen much faster and it can happen 10 times faster when you have a complete functional data model at the start of your development that's always the way I've done in the past where I've actually from the command line I can pretty much run the whole application and before I even start I want to make sure everything works the way the data model everything works the way I design things that's why this is this way that's why we can start with the UI or any of that stuff until it actually works right that's how this works so this is a case in this case the email list is a good way to illustrate the no-cred scheme almost everyone has subscribed to or unsubscribe from an email list at some point so here if I can read this it's much smaller in my screen so for here we have somebody named Dippy at ADD.net who's subscribing to like a clothing list sub-hats sub-socks and clown jobs if you want to see a clown and also maybe a phantom job so once you've done those procedures and you can just do this select and there's all the he doesn't get sent to emails on clothing, on hats and any clown or phantom job so something else this shows another divergence from other crud-based data models is the idea that data seems to have always existed in the database you can do it and so for example there's a very first there where it says clothing hats, Zippy, ADD.net there is no list called clothing and there is no sub-list called hats and there is no Zippy, ADD.net in the system this is the fresh database I did the insert on and it takes care of everything under the cover so it looks as if that always existed so inserting data into the system is the same operation as asking for a unique idea in that data so this is true for all the primary tables in the system so that you can imagine how much that would actually simplify the code you just never have to look things up or worry about duplicates or anything the underlying procedures do all that work for you and they just give you back the ID so the second time you do it clothing existed but not sub-list stopped but Zippy now exists so now it's only one insert really even though you're just passing the whole line that's kind of how the whole thing works again you can see how lazy I am once I read this stuff I just don't want to worry about any more so here we go so Zippy is going to unsubscribe from a single list I was debating if Zippy wanted to unsubscribe from Clowns or Santa so he ended up subscribing to Clown list and now if you do that list at query again it shows the three-list the three remaining lists so here I've as you know how things work to make this happen or updated a null to a non-null and so now he's just not happy at all maybe he's down to job as Santa or whatever and you can actually unsubscribe from all this so you do this one rare insert you can do is a rare insert we say look I want to unsubscribe from all your lists I don't care how many I've subscribed to so now the query returns nothing so here's the classic years go by because maybe you got fired as Santa and then I know it's unimaginable but it happens so when you resubscribe all the lists your previous unsubscriptions unsubscriptions come back as they work so this is like a real hit for me when I was doing these kinds of lists I've never forgot which they may have 20 lists I may have subscribed to 3 of them and it was really nice to know what the lists were and so you can actually show the person a great out checkbox whatever you want to do but in this case you just do the select and now the 3 come back that's the state that was in when you originally unsubscribe so here's the individual list it's called the individual list table transactions for subscribing and subscribing from an email list so all the operations in this list are accomplished with only inserting like I said only inserting rows or changing unlist unfortunately I can't point to anything but the third row down you see there's an unlist timestamp so that removed that description out and then the very last one when it was originally inserted it did not have the unlist but it had a null for the ID so if there's a null in the ID section then you unsubscribe from all lists I don't care and as soon as I said unlist on that one row they all came back so that row kind of disappears as far as the views come from remember this is all done with a view and so anyway hopefully hopefully we have a lot I can feel some questions but you get the idea that you don't need to do CRUD operations they're easy they look easy but you can actually do a transactional based system with views instead the table is not very meaningful you look at it like an individual and an ID what does that mean but the view will actually understand all that stuff so I'm open to questions about this you guys have any questions at all? yeah go ahead it's my preferred database Postgres as well I've been doing databases for like the late 70's but it was my secret for a while back in the early 90's but then I switched to Postgres I never really looked back but I've done projects my next favorite my favorite paid database it's probably DB2 and so I've done project DB2 I've been forced to work with Oracle before and so my job actually is I have to deal with everybody I have to deal with Oracle, Postgres Mendical, MySQL Postgres especially but I have to sit down and start writing stuff it's always okay anything else so I've got a couple of those bonus things so here is the view this is just the full name view the full name column you can see there that's what puts together the person's full name for example so obviously you probably want to use this and not write it yourself and another thing is that of course this is getting bigger and bigger all the time so I use something called fitness or DB fit to actually test the database whenever I do a release and so this is kind of a cool little project little thing so every time I do a release I just hit this button and it just runs all my tests it's kind of cool and so I'm a big TDD person so sometimes I'll write what I want the procedure to do here and then I'll keep hitting the button until it works but that's also another cute thing but the go back to the last one any other questions at all or no I didn't think it was going to fit in today's schedule but obviously it probably would have so the way time periods work and I can't really show it really but you go to the link nocred.com there's a documentation and github and the idea there is I really didn't like having people enter in like Thanksgiving is really hard people are entering all the Thanksgiving in the future into the database I wrote a time period where you describe the rule for Thanksgiving and so you just do a select you can do a select and say Thanksgiving will just try to be true on Thanksgiving Day so you'll see a one pop up on Thanksgiving or your birthday or whatever and so this database has a lot of the major holidays coded in and so that's how you find and then you can use that exact same method for example you do an ad canton and the ad will like I won't show up in a certain time period it can be in minutes hours, days, months, whatever so that's all in there I definitely have that in there I wish I had it better that one takes probably half an hour just by itself this is why I figured it out but there is all the all the examples are on the documentation very handy anything else well hey thank you so much I appreciate you guys all showing up thank you testing one, two, three you guys hear me? is that better? is that better? okay thank you thank you Michael hi everyone thank you for joining our talk it's called explaining explain it's an introduction to postgres cool explain plans and I hope that it will be a very useful talk for you I found this feature of postgres to be very useful for DBAs, developers sys admins basically anyone who uses postgres as a database just to get started a little bit about myself I work at edb as a support engineer does that sound that's from outside it's not from my microphone okay I just want to make sure before I worked at edb I was a DBA and a web developer and I've been using postgres since version 7.4 so basically what I wanted to share about this feature it basically comes out of my experience as a support engineer and the most common question that I get from our customers is why is my query slow and the answer to that question is actually very difficult to pinpoint it could be a lot of factors it could be a slow network it could be high concurrency on your operating system it could be bad kernel tuning it could also be just a lot of resource contention like your disk and stuff like that so the way to figure out whether postgres is doing something that makes the query slow we use this feature called explain and this has proven to be a very powerful tool and it gives you a live look at what's going on in the database so during this talk we're going to cover just quickly what is explained, what does it do how does it work and how do I get explained output and how do I interpret it and then finally we're going to go over some non-trivial real world examples of how explain can help you as you use postgres so to start off so what does explain do explain is what postgres plans it tells you what postgres plans to do for a query so when you type explain and then the query it will print out a query plan all the things that it chooses to do in the process before it gives you the results now you can also tell you what postgres did for a query which is what explain analyze does so if you type explain analyze and then your query you're going to get some statistics about what the query executor encountered during the execution of that query okay, now that's pretty cool, now what does an explain do so I think before we kind of get carried away with oh wow this is going to improve my experience as a user, we need to know what explain doesn't do and what explain doesn't do is it won't explain why a query planner made a choice it only tells you what choice the query planner made, it doesn't tell you why it chose that it also won't tell you about a query performance being affected by another session so if you if you got some locking going on through another session it won't say hey this other query blocking you and causing your query to be slow it also won't tell you about stuff that's happening outside the database let's say your operating system is doing some kind of flush of the cash or doing an antivirus scan and therefore the disk is slow it won't tell you stuff like that like hey your operating system causing it to be slow it also won't tell you anything about the requirements like hey your network is slow there's a lot of network latency so these are the things that explain won't do now just going back to the top the top point explain has something to do with the query planner now just kind of real quick give you an overview of how Postgres executes a query when you type in the query and you press enter that query gets sent to the postmaster and it decides it runs a query planner and generates a query plan for you now how does that query planner work the way that query planning works is cost based so based on some configurations in your in your comp file it will cost of certain actions like a sequential scan or a hash join or what not and that cost that cost that's calculated gets taken into account for those decisions that it needs to make it also uses statistics that's stored and those statistics are stored in the PG statistic catalog it's not human readable so don't look there if you're kind of interested if you do want to look at PG stats which is kind of a process version of PG statistic and there's not a lot of documentation on how to interpret it so I won't get into those details here those statistics are refreshed every time you call analyze so I don't want you to get confused between explain analyze explain analyze is a kind of a two word keyword that tells explain to the query planner just point out all those plans and stuff like that analyze by itself and sometimes used with vacuum will scan a table collect statistics about the values within those tables and columns and it will store that stuff in PG statistic and once again this is all done by configuration so I'm going to give you a quick overview those configurables are things that say enable that begin with enable and then also begin with enable or end with word cost so here's actually the list of all the configurables involving the query planner in the PostgreSQL.conf file you can actually generate the listing of these configurables through query this query right here on the screen now there's a lot of stuff here so it can be kind of intimidating to look at for the most part unless you're getting really deep into the really deep into some kind of query tuning issue you won't need to tune all of these the ones that you're most likely going to touch are going to be these two over here called sequential page cost and random page cost okay so let's actually now go into some examples okay so what does explain do like how does it work so I'm going to basically use pgbench with no extra flags load of pgbench and then if you want to follow along with these slides later on you can just quickly spin up a pgbench instance so explain like I said it tells you what Postgres plans to do to get that data to you and when I type explain select star from pgbench accounts and then join pgbench branches and get only the rows where aid is less than 100,000 100,000 it's going to print out this query plan and the query plan you can see that it filters pgbench accounts at the bottom and it gets only the rows that are aid less than 100,000 and then it does a sequential scan of pgbench branches and then after those two scans are done it joins it together using that equality operator and then after that the way that it does the join is by using a nested loop now you might wonder what are these costs is that how much time it takes or is that how many bytes or whatnot the way that this works is by using like I said the parameters that were in the conf file so we're going to focus on this number 2890 how did the query plan calculate that the way that it does that is through this formula the cost is the number of blocks times the sequential page cost which is found in your conf file plus the number of records times the cpu tuple cost which is also found in the conf file as well so if you if you go through these since you know that there's 100,000 records in the accounts file accounts table you can plug those in the block size which is typically 8 kilobytes on a standard on a standard os and then the number of blocks which is the relation size divided by the block size is 1640 so you plug that all in and at the bottom you get 2890 so as you can see these costs are basically tuned based on whatever you find in the conf file if you change those numbers in the conf file it basically tells postgres hey it's more expensive or less expensive to grab a tuple from disk and by doing that you can actually use the way that the query planner makes his choices I'm going to get into that a little bit more later ok so now that was just the explain that was just the plan it didn't actually execute anything so you don't know how long it actually takes to run it now if you type explain analyze the same query you're going to get this result it looks very similar with the one you saw earlier but you're going to see now the second column ish it says actual time, actual rows actual loops so this is what the query executor encountered as it was processing this plan that it created ok you'll see that the actual time to run it was 61 milliseconds but the bulk of the time was I think it was the sequential scan of the accounts table which is 25.7 milliseconds ok now the 56 milliseconds to do the nested loop is actually includes the time that it took to do the scan so if you want to find all the actual time to take to do the nested loop you have to do subtraction ok now let me jump over now ok so sorry I didn't advance these slides while I was explaining it now the next thing you can do is in addition to using the word analyze there's actually a bunch of other keywords that you can pass into explain those are found in the documentation this one I'm going to share with you which is called buffers if you say explain buffers analyze it will not only analyze it for you it will give you all those statistics during the runtime I'm going to share with you the memory usage of those nodes in the execution plan ok so as you can see here the output the shared hit so during the sequential scan of PG bench accounts it hit two buffers and then it read 1638 buffers ok so what that means is within the cache the shared buffers cache it found I believe those are blocks so it found two and then it had to read 1638 buffers from disk so this was a read heavy operation it pulled data from disk put it into memory for you if you run this query several times you might actually be able to hit more buffers in in the memory and thereby get a lower execution time but in this example here we see the first pull pulls a lot of stuff off the disk ok if you did like an update or an insert you might see things like buffers that were dirty or buffers that were written and that kind of that might be useful for you if you need to know those things if you are seeing a lot of unexpected reading of the buffer off of disk you might need to increase shared buffers because it's getting evicted from the cache and every time you're trying to pull stuff in you're pulling it and putting it into memory and then the second time you pull you pull it from disk again meaning that something between the two executions and feed the cache out so maybe the shared buffers isn't big enough for you ok the other the other keywords that you can pass it to explain are things like timings of each node the stuff in yaml or json you can actually get those examples out of the documentation as well ok so moving on as you can see much of the query planning involves deciding what kind of joins and scans should be performed so we've only seen the nested loop I'm going to give you a few other examples so we're just going to go in a little deeper here so just to kind of create a little more I guess chaos or non-determinism we're going to insert more data into the branches table ok so I'm going to insert another 99,099 rows and then I'm going to run the same query explain and analyze where eid is less than 100,000 and this is the output that I'm going to get ok so now if you notice at this point it has decided that it's going to use a hash join instead of a nested loop and this is because the branches table is now bigger and a nested loop would not be efficient because previously it was just one row in the branches table and 100,000 rows in the accounts table so you just need to take one row in the branches table and scan the 100,000 rows in the accounts table and then you've got your join but now you've got 100,000 rows in branches, 100,000 rows in the accounts you don't want to, for every row in branches scan 100,000 times that's going to take you too long so what you want to do is you actually want to create a hash of one of the two tables and then take the values that you want and then compare it with the other table ok now as I mentioned earlier about the so here you can see the cost for the hash join is $48.38 oh by the way, sorry I didn't get to mention this the first number in the cost is the amount of well the first number in the cost and the first number in the actual time is the cost and time to the first tuple and then the second number is the amount of cost or time to get the full set the full set that returns to you ok so here $16.76 for the hash join is the setup cost you got to create the hash you got to parse the hash and by that time you get your first row and you can print that up to the screen ok and then $48.38 is the cost to get to the full set ok so here the hash join costs $48.38 and the reason why is that the query planner chose this join instead of nested loop is because it's less expensive now I can actually force the query planner say hey I don't want hash join so if you say hey query planner I don't want you to do hash join then it's got to think about what's next ok so here's what I did I say set enable hash join to off and then now I run the same query again and this time it says nested loop the cost here is $58.39 so now you can kind of see so the query planner had this choice nested loop versus hash join and it said $53.89 versus $48.30 the hash join is definitely cheaper so I'm going to go with that so as you can see the query planner is using that cost to pick the most optimal execution plan for you if you change the enable values on or off or if you change the cost values on or off or up and down you can actually trick the query planner you may or may not want to do that depending on what kind of data you're working with question so the question is if you did set enable hash join to off will it use this query plan for the remainder of I guess when it's up so the set command for these enable and cost values are only for that session so if you are going to just say hey I want hash join off it won't affect anyone else so they're going to continue being able to use hash join but if you change it in your cost file and then you restart or reload the database that's going to be applied globally you don't want to very quickly change those values in the cost file because you can affect other queries or other users so for your purposes you can actually do it for your session so that way you can kind of see what the query planner is doing okay so now going the other direction I'm going to seek less rows so I'm going to find values where AID is less than 100 so naturally that's going to be what like 99 values right now what happens if I do that now it chooses merge join okay because it decides hey it costs only 16 or well I didn't actually print out all the cost but you can see so there's there's the merge join which is what we see here where the nested loop and then the hash join is what we saw earlier so this is kind of a quick review of joins okay so the nested loop is what we saw earlier that like I said is you take a smaller set and then you loop for each value in the smaller set you loop through the bigger set okay it could my memory fails me I think it might actually be an order that is listed in the query so if you say select A join B versus B join A the nested loop decision might be different okay so note to myself I should check that for my next talk merge join so what that does is it will take two day sets that are sorted and it will kind of like a zipper match them all together so like you know one in one okay two and three no three and three okay and then join those together now this is good for large tables right but it will require that you sort it so if if you do have a large table and it's not sorted already it's going to create extra processing for the query planner to do now hash join is the final join that we have here now what that does is for a value that you're looking for you want to build a hash for a table you want to build a hash of all the values and basically by nature that's going to be sorted you can take your equality equality search value and you're going to go down that hash and find the right the right entry to do the join with so this is actually going to be even more start up cost because you got to do sorting and you got to do the hashing but once you do the execution of it it's actually very quick because you already you're able to find the equality pretty quickly so this is only valid for equality operators so you cannot do like hash join with a less than or greater than okay so these are the three loops they all have their own purposes so we're going to move on now so let's say I so now what I'm going to do is I'm going to create a little bit more I guess unpredictability so in the accounts table once you load a PG bench all 100,000 rows are going to have distinct AIDs but the BID is going to be one so it's going to be if you do where BID equals one you can actually pretty quickly get to that you're going to get all the values starting from the first row that you hit but here what I'm going to do is I'm going to set BID equals AID and the reason for that is because AID is indexed but BID is not so we're going to explore indexes here so let's say I update all the BIDs and now they're all distinct values and I do explain and analyze select star from PG bench accounts where BID equals one what you're going to get is the sequential scan and you're going to filter out 99,999 rows because only one value, one row has BID equals one and that's going to take 45.1 milliseconds which isn't that long in real world human time but in database and software time that's actually quite a long time and that's usually pretty unacceptable in a large production environment so what we do is we create indexes and by doing that we actually map out all the values of BID and then we actually tell where in the memory the actual row is so it's kind of like a shortcut to get to the rows that we want so I create the index and then I run the same query again and now I'm down to less than a millisecond about a tenth of a millisecond so as you can see having an index helps a lot so as I mentioned earlier so indexes help you get your information faster now the thing is the way that the index works like I said when you say I want a value where BID equals one it looks at the index, it scans through it finds the row that BID equals one and it says here's the address to the disk and I'm going to go to the disk and pull it out so you're still pulling stuff off the disk or out of your cache and then returning it to the user that that can be improved the way that you can improve that is by knowing that there is an even faster scan in Postgres which is the index only scan now you see the bottom query on the screen here when I do select AID from when I select AID from BID accounts where AID is less than 1000 the execution time drops from 0.8 milliseconds to 0.2 milliseconds now why is that because when I had my account AID scanned, it scans the AID primary key index grabs a thousand rows from the disk and then returns it to the user because I need star I need everything for that row now that can be sloppy because let's say you want only the AID because I just need an AID to do some kind of other lookup for another query if I select only AID then I'm going to look at my index for all the rows in the index that are less than a thousand and then I don't need to go to the disk anymore because the query only asks for AID so then I send that right away and by doing that I save 0.0.6 milliseconds okay so I guess the point of all this is to encourage you if you're a developer or if you are responsible for writing reports you do want to be very specific about what columns you return and what indexes you create so here I have an index on AID I just want AID I can actually do a quick index scan and send that to the to the client now as I mentioned earlier again those costs that you can tune random page cost is what affects the choice of an index scan because it's the cost of fetching a random block off of a disk so if you have an SSD you can actually tune that way low because it doesn't cost much to pull random blocks off of an SSD but if you have an old spindle drive you will have to tune that up and you will want to keep that higher and depending on the rpms of that disk you can tune it down or up and down okay so this is just to illustrate what happens when the cost gets too high when the cost gets too high because having a really really big index or some other reason you can actually the query planner will eventually choose hey I'm going to just get the sequential scan instead of the index scan so I'm explaining this to you because a lot of our customers have come to us and say hey I have an index but the query planner is not using it right and when we look at it closely we realize hey the query planner isn't choosing that index because you're trying to get too much data out of it okay so like for example 10,000 rows and you want AID is less than 10,000 you're going to eventually get all the data off that table so you might as well just do sequential scan rather than doing an index scan and then a random page pull right so just knowing the size of your tables and knowing what your query is trying to do will help you understand why the query planner made a certain choice okay so just a summary over all the scan types we went there are sequential scans you scan the whole table you're going to choose it because you're going to retrieve most of the table or not index scan you scan all the index and then look up the rows on the disk causes a random seek so you want to tune you want to be aware of that only scan is only the rows in the index and it doesn't touch the disk and then finally there's a scan that we didn't go through which is the bitmap heap scan so what this does is it scans the index and then it builds a bitmap of all the pages that it needs to pull so rather than going rather than going index here's the index pull the disk index pull the disk you're going to build a bitmap hey you know five of these ten rows come from this part of the disk and then two of these rows come from this part of the disk so it actually makes it more efficient that way alright so this is one part explaining the unexplained unexpected okay so some of the real world non-trivial things that explain can do now so one reason why you might be getting a slow query is probably because you have some bad statistics so in this example what I show you is when I do explain analyze on a query I get an estimated thirty five thousand three hundred sixty rows but when I actually run it I'm only getting a hundred and seventy five what happened is because remember what I said all the query plans are generated based on the statistics stored in the PG statistics catalog okay so whatever is stored in there right now is not very accurate because in reality there's only a hundred and seventy rows so what you do is you call the explain you call the analyze command and then you run it again and then now you get something a little more accurate estimated versus and it's a lot closer to the actual number of rows now again this is the estimation it doesn't scan I'm not going to I'm not going to say anything wrong but what's stored in statistics is generally an estimation because it cannot it cannot if you have a really really large table like billions of rows you can store everything in there in the statistics otherwise you might as well just have that table all over again so it does some heuristics or like histograms based on histograms based on calculations stores that in statistics so if you encounter this kind of situation where your statistics are way off that's usually a clue that you have not vacuum your out of vacuum is poorly tuned it's not running at all so you do want to make sure that you set out of vacuum properly now that's beyond the scope of this talk so please do google how do I tune out of vacuum and that will definitely help you a lot now there are some times where you're going to see that the query planner didn't choose an index that you expected it to use and this is sometimes it's caused by columns that are correlated but the database doesn't know about it so like for example if you have a table that says cities and then another table that has states and you want to do some kind of join Postgres doesn't know that a city is inside of a state and it wouldn't know to take advantage of some kind of index relating the two now the way if you need to create that correlation Postgres has this feature called create statistics which creates an additional set of statistics for the query planner to use that is again outside of the scope of this talk but please bear in mind that this exists and if you do need to make any kind of correlations in your queries this would be the way to do it another real world example of queries being slow and explaining being able to help you with that is insufficient memory allocation now in this top example right here you see that it took about 2.3 seconds to run this query and the reason for that if you look it says external merge disk 2600 2600 2664 kilobytes which is 2 megabytes so basically it created a 2 megabyte file on disk to do some sorting for you so sorting on disk is very slow that usually indicates that you have not allocated your work mem properly so work mem is a configurable that allows you to create space in memory to do joins and sorts and stuff like that as you can see once I've increased my work mem and I run the exact same query again it does that sorting in memory and the query is a lot faster it has dropped down to 2.1 seconds 2.1 seconds okay like the enable hash join example I showed earlier the work mem is also per session if you change it in your postgresfield.conf it could be dangerous if you set it too high you could actually cause out of memory errors so just keep that in mind one more example is index definition mismatch so here's my example of an index I created and I am indexing the AID column and a substring of the filler column I want the first character now if you look at the following two queries where I actually attempt to use that index the query planner doesn't use it and the reason for that is because I have defined the index a certain way and I'm calling I'm doing an operation that is not similar to how the index is defined so you see the two wrong examples and then the final example where I call the exact same way is using the index so this is useful because if you are wondering why your query is slow explain analyze you can see hey it didn't use this index scan instead or like a bitmap scan instead of an index scan now the reason again is because maybe the query itself wasn't written the way that would tell the query planner hey I want to use this index another example here using like JSON values if you want to create multiple indexes based on some way that you're going to call a column that's fine but bear in mind that you could incur a higher right cost every time you insert, every time you update you got to update that index so it actually reduces your performance in the long run a few other situations prepared statements so we have seen situations where if you do a prepared statement and you call execute on that prepared statement many times you could end up with decreasing performance because the way that it works is the first five executions of a prepared statement will use a custom plan it will take the value that you pass into it and then generate a new plan and then run against that value now again planning takes time and if you're going to call multiple times you don't want to incur that cost of planning so what Postgres does is it creates a generic plan on the 6th try and then uses that one but that generic plan isn't always the most optimal performance now in version 12 and later you can actually adjust the plan cache mode and you can tell it always use a custom plan I don't mind the cost of query planning so always generate a custom plan a lot more performance join order can affect whether you choose index or not and then the just in time compilation feature which is very useful for ETL has been a lot of our customers because it's doing additional work that is not necessary for their application so you might want to turn off the custom time the GIT in your and then see if that improves your performance from collapse limit join collapse limit that also affects your performance the from collapse limit is default 8 so if you have select star from table 1, 2, 3, 4, 5, 6, 7, 8 the query planner will take those 8 tables create a plan but if you have more than 8 tables the 9th table it actually just omits it and then tries its best to guess and that guess is sometimes not very efficient same thing for join so table 1, join, table 2, join, 2, 3 those kind of if you join too many tables together the query planner will try to collapse it all and then do some estimates and you end up with a really fudged up query plan the final one I want to share with you is ORMs so if you are using Django if you are using I think I forget the other framework but some applications will use an ORM and sometimes with ORM the way that it constructs the query is very opaque and you do not know what it is doing under the hood and in this example this was one of our customers they had a simple update statement which they said hey this is taking like 40 milliseconds we do it 100 times we are taking they do it like a thousand times and it takes over a minute to do all the updates on all the rows but then when they run the updates manually it only takes 0.25 seconds which is really fast so it is really puzzling why this update is taking and we were able to figure out what the problem was and we found out that it was taking 40 milliseconds and the reason for it and the explain analyzed output showed us is that there was a data type of mismatch so the table was defined with numeric values but then when we ran explain analyze on it you found out that the ORM was converting it from casting it to double precision and when it was casted to double precision it could not use it could not use the index this update uses an index but the other ones were using a sequential scan how did we do that? because if they are running the application that is doing thousands of queries in a second they can't step in and do explain analyze the way that we do that is by this feature this module that we have is called auto explain this is an extension of pushgres so you have to load it into your database now what it does is it prints explain and explain analyze to your log file so if you turn it on and then just run your script or what not it will print out all the explain explain analyze output based on the settings that you set and it can even do things like explain analyze on the nested statements or if you have a function that calls a select it will print the explain analyze of that select within the function which you can't do with just regular explain by hand so this is very very useful in addition to this customer example that we had earlier I recently had a customer who was was having some row level data security issues and we found out that they had a function that was trying to set an environment variable but that environment variable wasn't being set properly so they were using all values of that environment variable and that's why they were seeing stuff their clients were seeing stuff that they shouldn't have been seeing so explain analyze auto explain actually allows you to have that kind of ability to see can you turn it on and will it affect yes it does it creates a lot of additional I.O so you only want to turn it on when you're actually looking for an active problem so you can turn it on per session with load auto explain in your pskill session or I guess in your application or you can turn it on at a conf level which would be system wide I mean if you have to you should but if you can turn it off save yourself from performance that's better I'm sorry the best way to do is find out the problematic query and then turn it on and then get it to explain analyze output and then turn it off so that was it for my talk thank you guys for attending I hope that it was useful you have a question how do you determine if the queries are inaccurate or the statistics on the query that's a good question usually there's not a quick answer to that I mean usually you'll see the explain analyze output you'll see rows and the natural rows sometimes even if you do an analyze it's still inaccurate and that's usually because the table is too big and the histogram that it creates is not accurate there are some parameters that you can choose to affect the size of the histogram and hopefully make those estimations more accurate if you want to know what that is I can try to find that for you after the talk okay alright he knows what he's talking about yeah alright right okay so the question is is there a way to tell whether an index is being used or I guess another way to ask it is are there ways to identify indexes that aren't being used there is a catalog called pgstat user indexes so if you do select star from pgstat user indexes it will list out all the indexes that you've created and then it will show you how many times it's been scanned okay so if you find it's been scanned zero times I mean it's not being used at all and maybe you might want to get rid of it because user indexes yeah and then there's also pgstat user tables which will tell you no that's a number of tuples fetch number of dead tuples pgstat user indexes okay and then like I said you do want to identify those because every time you insert, update, delete you're making a right to the index and that is a performance at large scale okay so I guess the question is how how do you determine if a table needs an index so if you see yourself getting lots of sequential scans and the query performance is just not what you want you can try creating an index and see if that improves it and if it does then you can say hey I needed that index okay alright I think that's it for the questions so thank you again once again for joining the talk hope you guys enjoy the rest of the weekend hello hello is it okay? hello hello hello everybody thank you welcome here today we'll talk about postgres scale and artificial intelligence so it's a little bit of different topic about the pure databases and that so it's a mix of artificial intelligence and postgres specifically we are talking about database but specifically postgres scale so my name is Ibrahim Ahmad so I have been in postgres since 15 years and software industry it's been 20 to 24 years now so I have multiple books on postgres scale so postgres scale and artificial intelligence so the main agenda is here we will talk about what is the artificial intelligence so we will briefly discuss about the artificial intelligence and then we will discuss about how we can use that artificial intelligence in postgres and postgres in artificial intelligence so it's both ways so in artificial intelligence we can use a database In database, we can also use an AI. So we'll talk about the both aspects of that. And then, little demonstrations of how we can use database in artificial intelligence. Artificial intelligence, so what is artificial intelligence? It's a mimic of conjunctive function that humans associate with the human mind, such as learning and problem solving. So when we think about a baby, what is he doing? He just start learning, he doesn't know anything. He start learning, and they start solving the problem by using his knowledge. So we mimic the exactly the same thing artificial intelligence. We try to teach our computer, and computer will learn from the data. And then it starts solving the problems. So that's artificial intelligence. It's 100%, they try to mimic the human mind. So human mind, try to learn, and then solve the problem. So similar to that, the computer can do that the same thing. But it's artificial intelligence. It's not a real intelligence, but it can be used for that. So the Albert Einstein, the artificial intelligence is no match for the natural stupidity. A machine with that strong AI able to think and act just like a human is able to learn from experience. So the human also learn from their experience, what they have learned, what they have observed, everything. That's the human brain works. It start learning things from that, and they can start solving the problem out of the test. So I think the development of a full artificial intelligence could spell the end of the human race. That's the Stephen Hawking. The one who becomes the leader in this sphere will be the ruler of the world. So type of artificial intelligence. So we have three types of artificial intelligence. So first one is the AI, artificial narrow intelligence. So when we are talking about the narrow intelligence, then when we teach a computer a specific task, and then the computer starts doing that task, that's the only one task he's doing. So that's called a narrow intelligence. The computer is not doing everything for you. We just try to teach a computer a single task, and that computer starts doing that task perfectly. So that's called a narrow intelligence. Then we have an artificial general intelligence. So when we are talking about the narrow intelligence, it's behind the man. So when we are talking about the general intelligence, we try to match the computer intelligence with the human intelligence. That computer can do whatever a human can do. Whenever we have not achieved that goal, we have achieved the narrow intelligence. We teach a computer to do something, and computer can do that. For the general intelligence, the scientists are working on that to achieve that. Computer can do whatever a human can do. So that's a general intelligence. So third one. Third one is the artificial superintelligence. What is a superintelligence? It's a superintelligence is when you're talking about a computer, and a computer can do much more than the human. So when we are talking about the superintelligence, that means a computer, when you compare a computer and a human mind, a computer can do much more than the human. That's called a superintelligence. So it's a question how a computer can do much more than the human. The problem is that the human can learn, read, and start solving the problem. But the human cannot remember the whole huge amount of data. He can read that if you have records of billions of records, a human cannot memorize that billions of records. But computer can do that. Computer has that capability, can have the capability of human, and they also have the information stored in that, which human cannot store in their brain. So in that case, computer can work much better than the human. So it has a learning capacity, it has a problem solving capacity, and it has one more thing to accommodate a huge amount of data. And that's the edge of computer over the human mind. So we have a three, narrow intelligence, which is a single task, and we have a journal intelligence which is mimic the almost the same human brain. And third one is a superintelligence, which is beyond this human mind, just because it can hold a huge amount of data in that. So the previous one, the type of artificial intelligence, but we have a subfield of artificial intelligence. If we're just talking about the artificial intelligence, it's nothing, it's just a word that is artificial and the intelligence. But when we're talking about the real world, how we can achieve the artificial intelligence, that's one of machine learning, then the deep learning, a neural network, a conjunctive computing, the NLP and the computer vision. These all are the subfield of artificial intelligence. It depends what kind of a problem you have. You have to choose one of these technology or subfield of artificial intelligence. So machine learning. So it's like a machine is learning some kind of a book, some kind of data, some kind of thing. Then when it learns, so we will give them the problem to solve that problem. So how we can achieve that? So machine learning have a three types actually. The one is a supervised learning and one other is unsupervised learning. And third one is reinforcement learning. So what is the supervised learning? Supervised learning, you have a data, a test data, and you have the result of that data. So what that data tells you, you have the label, the data and the label. And when you start training your model using that data and you also have the label, the output of that and you compare your training data with the resulting data and you train your model according to that. So when you're doing that, something like that, if we train our model that if suppose we have a vehicle, we can say that it's a vehicle or it's a scooter, two-wheeler. So we can see that the one property of that it's a four-wheel, it's a four-door, it's something like the windscreen, something like these are the feature of one thing. And you said, so the target is, the label is, it's a car, it's a four-wheeler. And you have two wheels, it doesn't have a windscreen, something like that, and a scooter. We have the data and the output of that too. That what that data tells you. So we try to teach our model that extract that information and compare that with the resulting value and then we start learning. When the model start learning that and learn from that, so whenever it gets the data of four-wheel, windscreen, other than that, not so it will say it's a car, it's a four-wheeler. So we have data and the label, resulting value of that. We train our model on that. So how we use that? We use that in a classification, in a regression. So in classification, we can classify that, this is a car, this is a bike. We can classify that, whenever you put that in a bike, so if you say, oh, it is a bike. We just give the characteristics, it's a bike. We give the characteristics of car, it's a car. So we can use that in a classification and a regression. And what is the unsupervised learning? In unsupervised learning, you have only the characteristics of that. You don't have the resulting value of what it is. So we have a characteristics and we don't have the value. So what is the benefit of that? The benefit is that it's used for association, it's used for the clustering. What is the association? When you go to the art mart, so you have seen that whenever you buy a bed, you can see that where the bread is there, there is also, eggs are there. There, butter is there. Why is that? It's because of this association. They have the model that whenever a customer buy a bread, he usually also buy eggs there also. He usually buy a butter there. So usually they place that thing in a one bucket, then to enhance their sale and facilitate the customer. So from this model, that association, that you put the data that one customer buy bread, one customer buy bread. So you just train your model whenever he buy bread, he buy eggs, he buy butter, he buy something like that. So your model is trained. So you just put all the information in, all the things in one area. Similar like that, he was a souping, the cleaning stuff. They have that, so it's called association. And even the clustering, you can also cluster that, this is a chair, this is a bed, so it's just segregate that information. So it's a different thing, it's a different thing. So using the characteristics, it clustered the data. It's called unsupervised learning. So third one is a reinforcement learning. Whenever we are talking about the reinforcement learning, it's something like that's a hidden trial. It's used in robotics. What is the hidden trial? So whenever you, the robot is in the room, if you start going to the right side, he doesn't know it has a door or not, it start going to the right side. You see, he touch that and see that this is the wall. He come back, he go to the left side. He touch that, this is the wall. He go back there and he see, oh, there is a door. So he come back on the first place. Now the robot knows that whenever I go four step to the right, it is a wall. When four step on the left, it is a wall. But I go to the 20 steps, in front of me, it is a door. So next time whenever he wants to go, to cross the door, he will never go to the left or right. He just go to the door. It's called a reinforcement learning. He just roaming around in the room and he found where is the window, where is the door, where is the switch. So after some time, how much time he spent in the room, he knows everything about the room. So that's called a reinforcement learning. So how this machine learning works? So you have a data. So you divide your data in two parts. I want a training data and a test data. You prepare your data, you train your model based on that information. And then you test your data. And if your data is correct, that's okay. If your data is not correct, then improve it. So it's continuous process. Improve your data because you have the features, you have the results. So just compare that. That our result and the actual result, same or not. If it's not same, there is a delta on that and acceptable, then okay. If it's not acceptable, then retrain it. So we start training that. So how much data you have, it will better be trained. So we have one topic is the machine learning and then the deep learning. So initially we just put the data into the computer and we put the feature and then we extract the information. What if you don't know the feature of that too? You want to cluster that and you don't need the feature. That's the four wheeler or something like that. You don't know anything about that. You need to accept that feature too. Then here it becomes, because in a human, you don't know that information. You just observe that. So now we are actually mimicking the human brain. So human brain actually in the cell death is called the neurons or something like that. So they're just named similar like that. It's a neuron and a neural network. So it's like that. So neural network. So we have input layer. We have input layer. We put that input layer, the input values of that and then we have a hidden layer. We are not talking about in the detail how these neurons work and how many hidden layers you must have. It's a complete big topic here. Then you have an output. So you can have that output. If that output is correct, that's okay. If it's not correct, it's go back. Usually it's called a back propagation. So go back and readjust the values. So find out if the output A is output B. So I will give you an example with the next slide. So when we have a network of neural networks, so we have multiple layers on that method. That's called the deep learning. It's of extension. You can say that the extended term of neural network. So we have. So how you can use that? So in this picture, we have not, you can see that we have a lot of pictures out there. Some are black and white, color, they have a different hair color, something. We have this all pictures. So we want to identify which picture is what. So we have to segregate the pictures. So we have not provided that it has two eyes. We have not provided that it has no something like, we have not provided any information. We just put that information. And you can see that in the first neural network, you can see that it start extracting the eyebrows or something like that from that. And then on the next layer, it starts recognizing the eyes or something like that. And then you can see the face. It just start recognizing the face. And ultimately the final, here we just say that it has many pictures and these are the pictures. So this information can be segregated using the deep learning process. So we think about what is the difference between the deep learning and the machine learning. In machine learning, actually, we discussed that in machine learning, we provide the features. We, even in the supervised learning, in unsupervised learning, we provide the feature and in some cases provide the result or not, but it extract that information. But in deep learning, we are not extracting the feature. The extracting feature is part of the deep learning process. So in a first machine learning, you can see the input, input is a car and feature extraction. A human should extract the feature of this input value, that it has a four wheel, it has a windscreen, something like that. He has to input that information. Then it will say it is car or not a car. All right, and in a deep learning, we just give the input and it starts recognizing it's a car or not. So how we do that is a deep learning. We will put the input of car and we will put that input as a motorbike, so two things. So even just categorize that this is a separate thing and this is a separate thing. He doesn't know it's a car or motorbike, but it said that it's two different things. It's not the same thing because of that feature extraction. It will start extracting the feature. First feature is extract something like that and then it recognize that this is a different and then it's different. And we label it that this is a car and this is a bike. So we have very brief knowledge of what artificial intelligence and machine learning and deep learning is. So we now time to go to database. So here is an example of, you can see that. So does an integer, any integer have non-leading zero? You can say any zero. So if you see that 31903 has a zero. 82392 doesn't have a zero. So there are medical model who can do that. Even you can just string search it, you can use that. But we will do that using the machine learning, neural network. We will do that using that. Because we can do that other way because it's a simple example. I don't want to give a very complex example where a mathematical model cannot work there. But here mathematical model easily can work on that. But for example, we are taking this an example. So we have to train our model that on that, that he will detect that this number has a zero or not. We will give a random number to that and he has to predict that it's number has a zero or not. So Postcrestial has a different sport for the different languages. So Postcrest also have a sport for the PUL. This is called a PL PUL. We have PL Python, PL Java. So I used here a PUL. So you can also use the Python. So Python is much easier than that. But thanks for the Bruce actually was here but he's my friend so he's not here. So I took his example from PUL. So thanks for him. Oh, sorry. So it's available mom John main writing all the examples are there. You can find that you can download it and you can try it on your computer. Tensor, Tensor is actually as you call it as a vector. It can use to store the data. It's just a name of that. We are storing the data on that. So now we will input over data. We have data and huge data like not huge. So we can say the 100 numbers and random numbers, 100 random numbers. And we have the result that these number has a zero or not. We have that number too. So we will give 80%. Usually we divide that into 80% or the 70%. We get 80% data to our model, trained our model. Then 20 to 30% to test that, our model is correct or not correct. In 30 to 70%, we know that these number has zero or not. We already know that. We feed that into the computer. But for the rest of the 20 or 30%, we only provide the number not that it has a zero or not. So it was a comparison that when we give the 30%, it will predict that correctly or not. So here is the code. I'm not going in detail of the PUL code here. So how you run that, you can always go and try that on your laptop so it will take much more than that actually. So here I created a table of training set. The training set, we will insert a data of 70% or 80% of data from the actual data to the training set. And then we created some random integer values to just for the input to insert that. So here, you can see that our training set, we have a very small training set here, if you see that. So you see that we have a number 28762748, it doesn't have a false. So we have the value and we have the result. So both, you can see that we have the value and we have the result. The second number has a zero, it's true. And the third, four, so we have the number, we have the result of that too. The both the things we have because it's a training set. When we generate a data with a testing set, testing set has this values only and doesn't have the training output. It has only testing has, training has the both the column and the testing doesn't have that column, second one, it has only this column. So now we start training over. So we will generate some weight for that. So I'm not going to tell you, you can just go and copy and run that, so I will show you the result. Okay. Now we will test 100. We have other example now, okay. We'll test 100 now. So we already trained, we have provided that data. I have showed you that this is the data and this resulting value that it has a zero or not. It has a unit, we have that information and we trained our model using that code I've supplied to you, you can go and download that and train your model. So your model is trained. So I want to test that the 100 give the input 100, it has a zero or not. And you can see that the accuracy of that, 0.22193865. So we are now testing, its accuracy is only 22%. It says it has charged that 100 has a zero, only 0.22, that 22%. Why? And it says 0.77 or 77% charged, it does not have a zero. So why is that? The problem is that we have provided very low training data. We have provided around 100 values, 20 values. So our model is not well trained. If you provide that millions of values, then this accuracy will start increasing. So if you provide a very less value, your accuracy. So you can have a delta that if your accuracy is 90%, then your model is trained. Or you can say it's 80%, as a 99%, then your model is trained. But now our model is not properly trained. It's only a 22% accuracy is there. So just because we have not supplied sufficient data to train our model. So it's similar like of a child who used to be that it's a two years or three years, it doesn't have that information. He doesn't have that amount of data in his mind. So it cannot predict that if some, you take that children out of this mountain and you say this is a mountain or not. If he has never seen that mountain in his life, he can't say what it is a mountain. But he's born in the mountain area. You see, always recognize that it is a mountain. So that's how much data you provide to your data model to train it. So that's the accuracy you will get the maximum. So let's test the 101. So here the accuracy is 0.11, it's 11%. It's also decreased. You can see that it has a less zero here. It's accuracy more decreased there. So it's again the problem of the training of the model. So here we can see that even you can see that this doesn't have a zero. I said it's a 68% chances that there is a zero in that because the model is not properly trained. You have to train that to have a much more accuracy on that. So let's say that with the, we can, final table expressions are that, we test with the thousands of values, right? And the final accuracy is 15%, it's not good. Because we are not testing that one by one value. One value maybe can have 80%, other value can have a 2% accuracy. But you have a training data, you have the testing data. On testing, hold testing data, you have to average that how much accuracy you get on the testing data. You train your model on training data. And on testing, you average that how accuracy for each element and then average it out and you can see that how much accuracy you get. And you can set the threshold. We will use this model if that model achieves the 90%, 80%, 70%. Whatever your desired task is. So, I have used a database demonstration here with the AI, but why to use a database? Well, you have a data somewhere, you can use that in an AI, so why to use? So, machine learning requires a lot of data. I gave you the example that if you don't have a data, you don't get the proper trained model, right? So, machine learning requires a lot of data. And then most of the data are in your database. You're not storing your data into the CVS file. 90%, even I have seen, you can say the 99%, the library is using the CSV file or XML file or that. They're not using the database. Why? What the people are doing, they have the database, they have the data in their database. What they're doing, they're exporting that in the CSV file, run the machine learning or AR algorithm on that and then go back to that. Why? Why not using the database for that? That's the main purpose of this presentation. So, don't export and import your data from CSV file because your library supports the CSV file, so you have to do that. So, I have given you the example that you can use AI and machine learning, artificial intelligence, NLP, everything within your database. Use your data where it is. Don't try to export that in other formats. Why not machine learning where data is? It's in database, similar thing. Try to have your algorithm within your database and use your data what is in the database. Don't export it. So, that's the one point that you have the data. So, what are the major advantages of that? What are the major advantages of that? I told you there is a two point. I have discussed the one point. One point is that you add support to AI using the database. Now, you are using database support into AI because AI need data and you provide database to that. But there is other ways too that within the database you need AI. Database need AI. So, I will discuss the second. Use the previous activity as a training data. So, now AI using the data within the database. So, you have a previous activity. Like you have a user login, something like that. And you want to, something like you have phone calls. You have a phone call data. And they have some malicious activities going on the phone. So, your model is trained. What kind of an activity has done that phone number? So, your model is trained for the data. It immediately can alarm that this can be a malicious activity. So, you don't have to export your data to the CSV file or run your model trained, your model and come back. It's continuously learning from that. So, have seamless access to all your current data. Seamless, right? You have insert one data, it's put into the machine learning. So, maybe you can set the threshold when data is one million that again put that into machine learning. So, you don't have to export and import, export and import that. Export data time, import the model. So, you don't have to do that. Take immediate action on AI. Like, if you are exporting and importing that, you can't take an immediate action. Like, commit and transaction. Only if likely are non-froton. So, you have our projection and your machine learning as I was going to say that it doesn't look like fraudulent activity, commit that. If your model say that it's a fraudulent activity, don't commit it. You can set an alarm on that. So, you can continuously use that and take an immediate action without the interruption of the user. Don't require a human mind to detect that as malicious activity or not, machine learning. So, you have put the mind into the your database. That it can automatically predict that. AI can benefit from database transaction, concurrency, backup, so it can use that. Other benefits include complex data types, full tech search, GIS, indexing. So, these can be used. If you have a CSV file and you are running machine learning algorithm on CSV file, you don't have, you have only a data. But when you are running machine learning algorithm within the database, you have a full-fledged application, database server, that which has a full tech search, which has a GIS information, which can have indexing. So, you can, AI algorithm can utilize these kind of activities in that. Postgres can do GPU-based computation inside the database. So, you can also utilize that into your machine learning algorithm. So, other than that, general artificial intelligence used by database, user applications, performance adjustment. You can do a performance adjustment by using your trained model. Then, you have an optimizer plan. Then, you can have index creation and risk destruction. So, this is the second aspect. First aspect, I told you that you use database from, for your AI application. Now, you are using AI for your database. It's other way around. So, how? Like, you, you are, for index creation destruction, you can have algorithm that it predict that after sometime your table will be slow, just create an index. It will predict that it will create an index after three months. It can predict that this index is never been, it's not using that index quite frequently, delete that. So, now, database getting the aid from AI. So, database settings. Like, you have, you have setting the database and you just ask that what database setting I need. Usually, people do that using the hardware capabilities. And if you have a 100 GB of RAM, just the shared buffer in 4-square SQL is 40 GB or 30 GB. It is just using the information, human is using that hardware, is that capable? Just use that setting for that. But what if it depends on your data workload too? So, if you have an AI algorithm, it start learning that you're, you need more shared memory. You need more shared memory to have a better result, a better QD response. Just increase it, increase that. Maybe it can generate an alarm, you need a more bigger hardware for that. Your data, your optimization for the database is getting slow down to just have another hardware for that. So, AI can predict that. Similar, I just told you that it's a source usage. It can also, like if you have a database, and you can threaten, sometime it's happened your database is shut down because there is not enough space available for that. Oh, it's critical. So, if you have an AI algorithm and it start predicting that your hard days will be full after one year, three months and two days, it will predict that from your previous data. So, you have to be alarmed that I have to buy another hard days, I have to add more hard days data, hard days into the system after one year because after one year, two months, three days, it will... But naturally, people think that this month 10 GB, next month 20 GB, third month it will be 30 GB. No, it depends on data. Maybe it's a holiday, it's never been used. Maybe it's like a COVID situation. People have predicted that their database will be full after three months. No, it's never been used because nobody came to the restaurant which is filling the data. So, this kind of aspect always taking care in the AI. AI predict that if you see that you are using one GB daily. So, normal human man think, oh, seven days, seven GB. No, if it is that Saturday, Sunday is holiday. No data generation is there, maybe one MB. So, it will taking care of that. It taking care of the holidays. He start learning that, oh, okay, first of July is holiday in US. Okay, there will be no data. So, it will predict, he will see that because in previous day, he has seen that on first slide, there will be no data. He has a 10 years of data and see the first slide, there is no data, no data, no data. That's mean on next slide, there will be no data. Model with trained with the previous data. Now it predicts that your hard disk is full after that. So, this is the database activity. And even malicious activity on your database. You see that this IP address usually come from that area. This IP address is coming from north of Europe or something like that, central Europe. Usually these IP address, so model is trained. And you see that this IP address is coming from the America. It's not possible, it's a malicious activity. Maybe they are spoofing that. They are changing the IP address, something like that. So, the model is trained. You are not manually inputting that IP address. This IP address is from Europe. You are not manually inputting that. It's extracting that feature from that. That this information is coming from the Europe. This IP address is coming from the Europe all the time. And this time it's coming from the US. Similar like, it's not IP addressing one, anything. You see that 99% chances are there you are Google, using the Google. You see that when you are using the Google and you have a Tor or something like that, which is hiding your IP address. Sometimes we say you are a bot. Because first you use your computer and search it and it's coming from the France. And then Germany said, it's not possible that you are switching the time that frequently. So usually it just put that you are a bot. Just click that you are not a bot. So the algorithm is running that this IP address are coming from there. This is there. The search, usually the search is coming from that area. So they just aggregate that. And resource exhaustion. Definitely I already told that the resource exhaustion that it can predict that when this resource will be exhaust. The memory, something like that, the hard days, the CPU. So, that's it. Do you have any question? Yeah, actually in AI concept, you're running the algorithm. Yeah, if you're running the algorithm, I usually you see that normally the AI algorithm, we are using the Python, right? Even you are using the C is a separate question, where I will go to this C question. Python, and in database you are using the PL Python. It's actually calling the Python library in that. But it will give you the SQL interface, easy interface. Other than that, if you don't want to use that, you can directly use database, directly use the database and write a C application for that, write a C application. And using that database information, definitely it's a bit slower than reading from the CSV file. Definitely when you are reading from the whole file, like if you are reading the whole table from a database, it's slower than when you are reading the whole CSV file. But if you are curing database, specific amount of data, like if you have a CVS file, which has one year of data, right? So, if you are writing a C code to run the machine learning algorithm, you have to run that to 1991. You have a 10 year, 1991 data. But then you have to read the full file. But in the database, you have the capability of indexing or something like that, which will give you the 1991 data quickly. So here the index, the other thing of database involves. But in a CSV file or something like that, you have to read the full file. The Curie performance? Yeah, it's definitely when you are training your model, training your model, it's not very efficient. Maybe if you are training your model on previous 100 years of data, maybe it will take one month. You just start running and after one month. But when your model is trained, when your model is trained, your algorithm is trained in that. Now you put that, it's really fast now. But training is definitely a slow process. Usually you don't that. Usually you don't that. You are not doing that. You are training and you are curing. What we usually do that, if usually the training is an offline process, like you trained your model continuously in the background on a Saturday, Sunday life, Wednesday, Sunday you are using that. But while you are doing some interactive work, don't train your model at that time. You use that model. Use that model, don't train that. Because training is, people usually do that training on GPUs, not on CPUs because it's a complex process. Yeah, absolutely, that's a really good point. And on the redeplica you can do the training and the actual one you can do the, that's a good point. Yes, absolutely you can do that. You have to wrap that, you have to wrap that into the SQL. Like if I write that create procedure or create function and then you write a Python code in that, actual Python code in that and at the end you're a language PL Python. And when you run that function or procedure, it will actually run your Python code in that. Yeah, I think it's automatically loaded. PL Python is automatically loaded, as far as I know. Yeah, it's a common PostgreSQL extension. Yes, it's a common, because it's not, because it's part of the core of PostgreSQL. It's a country, yeah, it's a country, actually it's a country core. Like some extension are, when you build PostgreSQL is a part of the PostgreSQL and PL Python is one of them, actually. And PL PUL is also one of them. Thanks everybody, yeah, sorry. I have not that experience that. You can use that, but I have not experienced that on that. Thank you. So, if you have, we are looking for you, join us. Yeah, the Parcona, yeah. Now keep it up. We are open, you can go there and you can find the many jobs we are looking for the people. Thanks everybody, yeah.