 Thank you all for being here today. My name is Parfe Casana. I'm a data analyst at Winston Astronomy LLP in Chicago, Illinois And for this virtual use our 2020 talk, I'll be presenting our Postgres SQL as a data science database And just to get started here real quick many people well aware of the challenges that Associated in a data science space both if it's specifically in our environment or generally across many In the industry Just by orders of sheer magnitude Big data volume variety velocity and the veracity of data is a big challenge This is challenge of data munging and wrangling many we hear a lot of anecdotes of sixty eighty percent of Analysts time is just getting data into a nice manageable format for the end-use needs Documentation is a big challenge on keeping track of code books Data dictionaries and all the auxiliary files that are associated with the data the metadata essentially I'm keeping track of that for citation and annotation needs is a big challenge And there's reproducibility being able to replicate both peers and co-workers Solutions and confirm results very critical in the scientific method and then there's hardware resources simply owing to the big data challenge as well as the data intensity of data science algorithms modeling and reporting needs and Here is where I believe a relational database can help with some of these challenges Certainly, it's not the end-all be all solution, but it can really Ameliorate some of these hurdles on relational databases has historically been used for desktop and web applications But I believe that they really do have a place Specifically and dedicated for data science and even data analytics teams Relational databases primarily objective is data persistence Storing data both from historical to real-time And it's an added infrastructure You know setup that brings durability and security and and you can rest assured of data Losses is very minimal to none So, you know has replication and backup setup with it Storage efficiency you store less on disk using the relational and entity model Rather than a lot of the redundancy as seen in CSVs Text files or even the spreadsheets and then the centralization now you foster a multiple user environment Where your team can all be set up a one centralized platform For accessible needs, you no longer have to zip or or email or point to different network file system shares cloud Access points for data or all the metadata associated everything is resized in one central location for remote and local access and the scalability you no longer are Behold or limited to your client's machine. You can now leverage a back-end server database solution And while there's many relational database Essentially out there both for commercial and open source environments I believe postgres sql is the best almost optimal That it works straight in our environment both are both open source tools Both are extendable tools postgres sql You know, it's a it's the advanced level object oriented relational database system You can you can do very sophisticated Solutions set your own types and they all work great on the seamlessly and are And you can even run our in postgres for stored functions And so there's even more adaptive ability and that could be very conducive for data scientists And so um to showcase and really show show the power of postgres as a back-end to an our environment I set up an environment database Last year really the central objective was to study the human impact of global and local Biosphere and climate and this is being physical natural environment And this is this actually database is available as a docker container I'm a docker hub and this is this link in the upper right corner Is access to that and the database is pretty comprehensive. It covers multiple domains demographics economics biology both animal and plant species list You got land arable land to the national Seeing um snow and ice data center Arctic and Antarctica sea ice data Noah's ocean data and atmospheric data from NASA really um all these Metrics and see how they all interplay together On the premise of human population and and and activity via GDP So um with that set with that in mind, you know the first You know, you know As you know in the pipeline for a database connection is essentially connected to that back end And um one great aspect of our postgres is that there's multiple ways of connecting So you're not really, you know forced into one standard or one package Um, you can connect in the traditional ODBC. It's a very popular industry on database technology other database connection Originally was developed by Microsoft But adopted across the industry and you can do that with the ODBC package that runs off of the dbi family of database solutions in our Download the postgres driver beforehand And and then there's essentially point your your server database and all your credentials for their access another um Mode is jdbc java database connector You download that in advance and use it as the middle layer between r the client and postgres the server um And this also, you know, it can be and one great thing postgres does a great job in keeping up today Well, all the versions and even the standards jdbc and ODBC standards So you can rest assured that these are very sound and well maintained solutions And then um, you can just do a direct connector without a middle layer with The our postgres sql api another dbi family of packages For database connection and you just point all your credentials and then you expose a connection for query needs So once you get um your your setup, um, um across your various um environments Is is it now interact with the data? One great um aspect of using postgres sql is that you you really can start today I'm migrating your data. Um, once you set up the schema, um for this example with the global temperature data from nasa You can then point you can then migrate your data directly from csv So what you're already using it are you can use you use those same source files Um to populate your your your tables um add here, of course to the data types So and I want to stress again the the usefulness of copy copy is a postgres sql command That's part of the sql language. Um, other database requires a separate executable utility Just to get data into that database, but you you can you know, no You don't need to jump across different softwares just to get data into your your your your Your database. So postgres does a great Has a great facility in that aspect the very bulk fast load reader documentation um postgres has a comments Feature both at the table as well as the column level and for this you can this will help in that the documentation challenge annotating the the original organization that is you know, and then their Corresponding urls where the data derived and guess what it's queryable. You can run this query Outputs all of the comments from all 21 tables pointing to their sources as well and the url So so if you need it for every day and future needs you need a one-off request You need to search for something. You no longer have to jump To those metadata files and and navigate through file systems. Everything is is sits with the data So it moves with the data. So, you no longer have to worry about All that metadata, you know challenge tracking those files. Everything is is set up in one central No, um access point And then data analysis is really the next step once you have your connection Your your your schema and populating data. We can now start using your data One great aspect of using a back in on database engine is that you have to add Access to sql sql is a special purpose declarative language. It's been adopted by the relational database industry And and postgres is Keeps true to the standard and and as you can see it's a very compact statement that doesn't lie under the seams Here we're joining three different data sources and then and then um running aggregations on them In the counterpart in r for the single aggregation solution is to is to import three different data frames And then run a chain or iterative merge and then run multiple aggregation possibly also in a chain Just to get that um year level and month analysis, but Unfortunately, you know with with sql is one compact statement to get a nice Data frame table Or you can stay at the granular level um without aggregation and here's just tracking time Date and month across those same metrics carbon ppm by month us Energy consumption and then us co2 emissions and then I leave r to do that final step of the data visualization Specifically a time series plot of those three metrics together across time from 1970s to the 2010s So if you if you saw what I just did I left on postgres to do the data the heavy lifting Of the data and then just have r do the analytical Solution and both are really working in their specializations. And so that that's one great aspect of Connecting these two tools together. This is yet another example of a separate aggregation Renewable energy production consumption. So you can change your plotting needs. Um, according to this is now going to run iterative faceted type of time series Sliced and diced by their energy type biofuels geothermal hydroelectric solar waste Wind and wood energy total biomass and total renewable energy across again production and consumption across time 1970s through the 2010s And get more sophisticated. Um, this is now a dual aggregation using a common table expression Where we're running two different aggregations and then joining them together This is us primary energy consumption by the decade and sector um, and likewise, um us co2 emissions by decade and sector and then joining them together Um, get a nice um imported data frame and then um can render the final solution Of where our Works best us primary energy consumption by sector 1940s or 2010s commercial electric power industrial residential transportation sectors and then um similarly for us us co2 emissions So as you can see, you can be very sophisticated still working, you know Intending with a back end on resource parameterized queries are very popular industry on process or sending on data from the the Client are into a back end for a more dynamic data on setup There's a prepared statement where I pass literals transportation residential commercial sector parameters To to to render a nice long data frame a more dynamic suit rather than repetitive rewriting the same as qr Um, and so with that you're able to get a nice data frame for your plotting needs Here's a again of another faceted group Bar plot by the sector us co2 emissions 1970s to 2010 but this time it plot it plots Fuel types and see how they compare over time with amongst each other then databases can make um help you broaden your lens So for a more comprehensive coverage, um, and this time we're running multi-wagging up to seven still in one compact statement Using a common table expression. This is now parent human population with many other metrics Um, um that associated in the database all joined together By year and then we're able then to render a scatter plot World population across all the other global metrics world population with um, arable land world population animal counter IUCN's red list threat level data um human population and plan counts on the botanical gardens international plan assessments list human population arctic sea ice and Antarctica sea ice from the national sea and ice data center In the newest ocean data total carbon dioxide and ph scale trying to measure the acidification of oceans and then um global temperature Um from nasa and trying to see if there's any discernible patterns both positive or negative the is associated with between the population And naturally we do see that as literature suggests Um, and then one great aspect is that you can seamlessly move solutions. You can run a shiny um, um Solution simply with your data uh, uh, uh, with your database. Um, so so you can really um set up Very sophisticated analytical solution and move seamlessly across with a backing database running the same query same same setup no longer have to drag all those um Files data files or metadata everything's intact the one and then you can even work in other environments like python um with this flash um application Really powered by the same backing database So so you can really leverage and and really expand your your whole solution if you really use post-grader as a tool in your pipeline very robust big data solution facilitates access and data management you leverage a query engine with the power of Expressiveness of sql you support reproducible process overall, right? You're keeping things intact and For replication reproducibility And then overall with post-graders and and it are you really are working in a very streamlined open source ecosystem And then even set up with a Docker Um container, so I hope I was able to demonstrate and really advocate for many um to to move into a relational database solution Thank you so much for your time