 Hi there, it's Monica Wahee giving you my usual unsolicited advice about combining public health and data science. One focus I have is trying to teach people who analyze data about how applications are built and their structure. Nowadays, there is this expectation that people who analyze data can just figure out how to analyze data that originate in applications like web applications and apps on your phone that you can use. But actually, it can be very hard for people trained in biostatistics and public health to figure out how to analyze data originating in an app. That's why you will see I have a lot of educational content on my YouTube channel and my blog explaining applications to people in public health. I feel like if you study public health and you don't study business, you can easily miss some very basic concepts when it comes to applications. Okay, now we get to another basic concept that relates to applications and that has to do with the structure used to store the data in the backend. There are basically two structures, normal form, which is also called relational, and flat. Since normal form is popular now, let's start by talking about normal form. Normal form started getting popular in about the year 2000, and this was mainly due to technological advancement. Structured query language, known as SQL or SQL, had already been contemplated and invented, but it didn't work very well before 2000. That's because it is based on a structure with a lot of different tables connected by keys or indexes. If you want to join a bunch of tables when you execute a query, you are going to need a lot of technological power, and we just didn't have that power until about 2000. So now, let's briefly look at the right side of the slide and talk about our so-called legacy structure, which is the flat structure. I say legacy because it is old, but also because some of these flat databases are still running, and we are still using them in our daily work. But new databases set up today are never flat anymore, because it is just too hard to run a flat data system today. I mean, they are awesome in some ways, and that they are extremely fast, even today, and they are extremely stable. They literally never crash. So why don't we make all of the databases flat if they are so fast, even today, and they never crash? Well, like I said, it's because it's just too expensive. It is really expensive to keep the old ones running, and eventually, they will probably have to be replaced. The main problem with flat databases is it is a huge deal to modify them, even in minor ways. If you want to add another field to your database anywhere, it's like a Broadway production. The way I was taught to think of a flat database is like a big endless Excel spreadsheet, a big Excel sheet in the sky. Now imagine certain cells are cordoned off to hold a table of data, like maybe a list of providers at a clinic. We decide we're going to allocate a set of cells consisting of so many rows and so many columns, and that patch of cells needs to hold all the columns and rows we will ever need to express the providers at this clinic, ever in the history of the database. And all the other sections of this huge theoretical spreadsheet are cordoned off and designated for different data. So I thought I'd give you an example of the same data in two different formats so you could better visualize what I mean when I say normal data versus flat data. In one of my videos, I mentioned a ride sharing app. I take ride sharing trips in Boston, and the drivers told me that they can download their data from their trips using an app. So that's what's on the left side of the slide, my concept of what the ride sharing drivers might have access to as far as data from their trips. Now what the ride sharing company has in its data stores is a table with a list of drivers who are operating. Each of those drivers makes multiple trips, and each trip has multiple attributes, like time of trip, customer taking the trip, and the beginning and ending locations. But the driver is the same driver each time. Let's say the driver's name is Mariel, and her driver ID is 1234. Let's say Mariel drives a Black Honda Accord. So for each trip where the driver is Mariel, it's going to be driver 1234, and it's going to be a Black Honda Accord. Which is why that information about Mariel the driver is kept separately from the trip information in the normal setting on the left side of the slide. It's because Mariel's information about her is always the same and does not need to be repeated. But if you look at her trips, each trip has unique attributes. So each trip needs its own row in the ride sharing trips database. And how we link them is through a field called an index. Another name for this field is a key. So each of these databases has a primary key, which is a unique value for each row. I think for Mariel, in the ride sharing drivers database, her primary key is 1234, because that's her ID. And then let's look at the ride sharing trips database again. So that has its own primary key, which is unique for each row. But it also has a foreign key, meaning a field linking each row to the correct driver in the ride sharing drivers database. So for all of Mariel's trips in the ride sharing trips database, it has a foreign key field that says it's for driver 1234. Okay, so that's how the database is structured if it is normalized. If it is flat on the right side of the slide, what happens is that we basically start with the larger database, the ride sharing trips database. That's larger because there are obviously more trips than drivers. So we have that whole table, and then we tack on a few fields from the ride sharing drivers database. We already have the primary key, which is the foreign key in the ride sharing trips database, and for Mariel, that's 1234. But we can add other fields, like we can add her name, Mariel, and that she drives a black Honda Accord. But since there is only one big flat table, those values will be repeated in each row for Mariel. And that leads us to the challenge of dealing with data from databases that were not properly normalized. So you can't tell if you are dealing with duplicate data or what's going on. Or it also brings up the challenge of dealing with data that come from multiple relational tables, and it's hard to figure out how to put it together. These are data science challenges. If you are into health data analytics and you want to tackle data science challenges, consider taking my online applications basics course. This is one of the first courses you will take if you join my online group mentoring program aimed at people in health data analytics who want to go into data science. If you have a background in public health or health care and want to take your career in a more data science direction, then check out my mentoring program. I'll put links to the course and information on the mentoring program in the description to this video. Thank you for watching, and I hope you learned a little thing or two about using data coming from applications today. If you did, please hit the like button and leave a comment about what you learned. Thanks and have a great day. Thank you for watching this video, which is part of the Public Health to Data Science rebrand program. If you are interested in joining the program, please sign up for a 30 minute Zoom interview using the link in the description.