 Hello. Okay. Hello everyone and welcome to my talk on the Lost Art of Database Design. And so how many people here are DBAs? One, one, two, two-ish? Okay, how many people are developers? Yay, developers. Okay, so we're going to talk to you about that. How many people have heard of Percona before? Oh, okay. That's half. That's pretty decent. So I am the head of open source strategy at Percona. You can find me at M Yankovits. We have a podcast called The Haas. I'm the Haas head of open source strategy, Talks Foss for open source software. It rhymes. It's awesome branding. So you can drop me a mail if you want. But Percona is an open source database provider where we provide services, software, tooling around MySQL Postgres and MongoDB. And so what I'm here to talk to you about is something that many of you may like, others may hate, but let's talk about databases. Databases for a lot of folks are uncool. Now I don't know how you feel about this. I see you taking a deep breath like, oh wait, this is wrong. But a lot of people think databases are boring, especially when you talk from a development perspective, because they're the necessary evil. When you talk to people who are designing new applications, the databases tend to get thought of last, right? Developers and managers, you know, executives, they often think code and the end product. They don't often think about the infrastructure side of things. Who cares about database design here? Does anybody really care about database design? Okay, a couple of people. Well, I'm going to try and make the rest of you care. So you know who knows that databases are uncool? This guy, that's right. He knows that databases are uncool. There are a lot of vendors out in the database space who know that it's uncool and are trying to convince you that they're not a database and that their product is cool, therefore. So you might hear a lot of marketing fluff around databases and database design, things like schema-less databases, oh, you can use our ORM and you have to write zero code for the database. You can just use our API. You can just store it native JSON, right? The database is a service journey where it's just fully managed. You click a button and it just works. All of these are trying to sell to that preference where developers want to think less about the database. They want to think more about their application. And so there's a lot of buzz around new terms and technologies that try and solve this uncool factor. Okay? Welcome. Welcome. So database design is kind of like, oh, yeah, evidently that it locks if you shut it. So isn't that awesome? Okay. So when we talk about database design topics, we often talk about how database design is like the plumbing for your house, right? So think about how whenever you bought a new house, has any of you really spent a lot of time looking at the plumbing? Like other than maybe, you know, you use the toilet or the sink or something, but do you really crawl under the crawl space or into the basement and look at the plumbing? Not many people do. And in fact, a lot of people who build their houses think of this as just this, you know, magic thing that happens underneath the hood. It only works until it comes back and bites you. And then that's going to cause you significant issues. It's kind of like that house that has a great curb appeal, but when it has a bad foundation, it eventually can fall down, right? And so we want to avoid that from happening. And so this is why we need to focus on database design that is, you know, optimal for your application, because not only will you get a better foundation for your application, you're going to get better performance, use less space, have lower costs, have better security, you're going to have easier migrations and release cycles and a better user experience. All these things are really important. So let's talk about some of those database design topics and the things that most people overlook and that we should really consider. Now I'm going to start with Captain Obvious here. If you have been designing applications and have databases at their core forever, then you probably already know some of these. So you might be like, oh, that's so obvious. But you would be surprised how many people miss these. This is something that happens over and over again, right? And so let's start with some of these obvious ones, okay? Schema and design, okay? Now if you are using a relational database, if you are using a non-relational database, thinking about the data structures is incredibly important and it is the foundation for everything else. There are different databases that have pros and cons that you can work with. Each workload that you might deploy might benefit from one database over another, but it will also benefit from the design in your schema. If you are running something that's aggregating a lot of data, having things or buckets or tables that can aggregate that data for you, pre-populated or running materialized views, things like that, that can really help there. If you have the right data types and the right schema, you're going to have more space efficiency. And I'm going to talk about that in a second. And then if you have the right indexes, you're going to have better performance. But when we talk about schema design, remember how I said that the marketing folks like to tell you schemas don't matter anymore, schema-less. I'm going to say that schema-less doesn't exist. Maybe some people out here in the audience are running schema-less databases, but here's the funny thing about schema-less databases. It's really about unstructured data, and everybody likes unstructured data because they hate waiting for structures to change. So when you have to do a schema migration and you've got to wait forever for the table to alter or to move data around, it's a pain in the rear. So people look at the schema-less side as a way to break themselves from it. But the flexibility that comes with schema-less comes at a cost. And all of the vendors that you look at, like, for instance, MongoDB, they'll tell you that. They'll say, you shouldn't have to worry about your schema, but you should validate your schema afterwards. And so whether you're paying the price to validate what's in your tables and in your structures in the database or outside of the database, you're still doing some validation. It's still a best practice. And so when we talk about this, when you don't have this and just let it kind of grow, things slow down exponentially very, very quickly. Things get slow, they get a little over bloated, they cause all kinds of issues. And so you have to be careful when you start to look at these unstructured setups because there are downsides to them. And one of the big downsides I mentioned is space. And space is obviously the final frontier, but it's also the first frontier for performance, right? So when you think about a database design or structure, a lot of it comes down to how much space is being consumed within the database. And it's not just your queries that are potentially slow. It's also things like backups or your replicas, your dev test environments. There can be a substantial increase in the overall size of your environment if you're not careful. Because space is cheap, but it is not free. And that's a very important thing to realize. Laziness does cost money in this case. And I have seen where we've seen like just from a cost perspective, even though disk is cheap, we're able to do some like schema tuning. And have a massive decrease in the amount of costs per month, right? So when you look at an AWS environment and you get half off your bill just by optimizing your schema, that's a pretty significant savings. But unfortunately, we're all also data hoarders, right? So we are. We love to hoard data. I don't think that anyone out here has ever heard from any sort of management. No, no, you can get rid of that data. I mean, if you have, you're one of the few. Because everybody wants data kept for longer, kept for an infinite amount of time and kept more of it, right? You just want more and more and more. And so when you look at that mentality of storing everything, you really have to understand what that long term impact is. Now I'm going to give you an example that's out of the MySQL space. But we work for a social media company and they use your email as their driver for their entire website. So when you log in, you set up your email, they'll go ahead and they'll connect you with other people who are in your email box. They'll find all kinds of things out about you. And every query, every access pattern for this particular social media company was using an email for their primary key. Now, in MySQL, there happens to be an interesting idiosyncrasy with using an email primary key. Primary keys in MySQL are stored in every subsequent secondary key. So the larger that field is, every other index you add also includes it. So in the case of something that is a Varchar, let's say 200, you've got a very, very wide column. And even if you have another column that is, let's say, a true-false that you're trying to index, it will include that email address as well. And so you can see massive, massive amounts of wasted space. You also see this with a lot of ORMs where ORMs are using UUIDs, right? So if you're using a UUID for your ORM as the primary key, it's generally a Varchar 32, so a 32 character field. That's really large and that causes a similar issue. So in the case of the social media company, what we ended up doing was converting them to an auto increment. We used a unique key as the email address. Then we also used a numeric hash because every time you have that primary key and you need to search for it, it's also in memory, right? So if we use a numeric hash for it, that's actually smaller footprint in memory. So we could fit more in memory, we could fit more on disk, and we actually saved them like, it was like $10 million a year on hosting. It was ridiculous how much money and performance they got back from that effort. And when you look at that example from MySQL, understanding how this looks, you've got the Varchar 32 at the top. And let's just say you're storing the string 123456, which is a numeric, but people store that as a text more frequently than I would like to admit. Storing 100 million data points is two gigs worth of space, whereas if you store that as an integer, it's only 381. Now, you might say two gigs that's cheap, but multiply this by hundreds, if not thousands of columns in your database, and this really, really adds up. And so this is something you have to be very careful of. Similarly, when you have, each database has their own data types that are specific for their environment, so you can have other data types that can benefit from this as well. So for instance, if you were going to use, let's say, a country name, you can store that as a Varchar, and that will work. But as a Varchar, that's a very large field. It's something that's going to take up a lot of space. There are ways that you could then maybe use a lookup table and have an integer lookup, and then you're going to store the integer. Or you could use an enum field. And so this is where it's really important to understand that each database, whether it is MySQL or Postgres or Oracle or SQL Server, they're going to have special data types for you to use. And it is really, really important that you understand where you can use them. For instance, like the Inet data type. That's specifically for IP addresses. Because a lot of people, hey, they need to store IP addresses. But IP addresses aren't naturally a thing. And Postgres, there's a UUID field, or an integer type, or a good integer type. Again, all of these things are things that you can use to help reduce the cost of things. Do I hear a band? Wow, okay. I guess, thank you for coming here instead of the party room. But okay, so moving right along. So as we think about these different data types and how data is going to be accessed, you also have to think about the different access patterns that you have. And so each of the different access patterns that are out there, they may do similar things. But in the end, how you're accessing and manipulating data is critical to building out that schema and the design and figuring out where you need to add things. So that leads you to a discussion on indexing, right? So understanding that you shouldn't index everything. That's one of those, if anybody attended my 10 deadly talks, don't overindex or underindex, you have to get it just right. But you have to understand how your data is going to be accessed on a regular basis and optimized for that access pattern. Look for those common search patterns and make sure that you're adding indexes in a way that makes sense. And realize it's going to change. So just because you released something today that's working, doesn't mean tomorrow it won't. I mean, how often are people releasing nowadays? It seems like some people continuous deployments, right? They're releasing every day. Other people, they're released still weekly, monthly, something like that. But every change that you do will make changes to your access patterns. And you have to understand that. Now, I want to mention that probably one of the best advice that I can give when we talk about designing schemas, designing databases is not to be an idiot. So, captain obvious, right? But keep things simple. And I have over designed things like just crazy over design. And I think that a lot of people overthink how they want to implement something because they think that in the future it is possible that something is going to be used a little differently or they're pre-thinking future use cases. I used to work at this company while I was doing consulting for this company who did voting. And so they were doing like online voting. So when you would watch a presidential election, you would then vote on whether you liked the person or you didn't like the answer and so they would track all this. And so they were having some significant issues. And the designer, the architect for this application thought, I don't know what sort of demographics I might have about these people. So he decided to use a bit mask for all of his demographic information in a Varchar field, okay? So you pull out this really long string, okay? And if it's in the first character, a one is male, a zero is female. If it's in the second character, it's going to tell you whether they are Republican or Democrat. And so it went on so on so forth. And so he broke that up that way. Well, there are ways to create either columns or create different ways to get that level of flexibility. But by putting it in a bit mask, which is a very design development or application centric thought process, it actually caused more issues than you would think because that can't be effectively indexed. So picking out just the 54th character of this to see if it's a zero or a one, that's not an efficient pattern that the database can handle. So understanding what you can do in terms of handling or passing down to the database and letting the database handle it is incredibly important. And so that's why you have to remember, okay? That database companies have been spending lots of money, lots of effort. The communities have been spending lots of time and effort to create these database features that are designed to solve these problems. And overthinking this and thinking that you can do a better job than a lot of what's out there already pre-built in the database is one of those things that makes me scratch my head. Now I see this often. In fact, there was another company who had this awesome idea to make a social network for your music. So every time you listen to music, it would find other people listening to the same music and back and forth. But they didn't trust that the database could join records and join two lists together. So they used the database to upload a list, your playlist. And then they had a list already off to the side of all the music that was out there in the world. And then they loaded it into a Java app and they looped through it. And then every time they found one, then they went out and they pulled out another table and then they looped through that. And it was all done in Java. Now, this was a brilliant application, I'm being sarcastic. But the brilliant application that this was would take a list of playlists every night, move them over, and then it would process them. Because of how they had built the infrastructure for this and because they decided to rewrite joins, to process a thousand playlists took seven days. So that means that if they have a thousand users, they're already seven days behind when the next day happens, right? And so it just is a perpetual issue that continually causes problems. So realize that there has been a lot of time, money, and effort by many people in the database ecosystem to develop things like the right indexes, like the right data types, like joins, like encryption features. And so while you can write your own, the question is should you? And that is incredibly important to have that answer. And if the database doesn't have this feature that you really need, the question is, are you using the right database for what you're trying to do? There are a lot of different database choices out there, right? So you can use MySQL, you can use Postgres, you can use Mongo, you can use Oracle, you can use Cassandra, you can use, I mean, so many, and all of them have unique benefits and features. And so if there's something that you're trying to do that's better as a graph database, and you're trying to do it in a relational database, ask yourself why, right? So don't force an application into something that doesn't fit, and don't try and make these design decisions to overcome those limitations if you can avoid it. So when we talk about that access pattern again, there's a couple things that keep on coming up, all right? Just in case thinking is very poor design, okay? I don't know if anybody has thought about this or has done this, but let's do the select for updates, right? So select for update locks your table, it locks the data, because it thinks you're going to update it. But what happens when you don't update it? It stays locked until you release it. And so a lot of people actually are doing a select for update. This is actually one of the biggest issues we see from our managed service team, is, oh, the database is locked. It's slow because people selected all this data for update and then never decided to do anything with it. That's a wasted cycle, okay? Now, I've also seen where people are selecting starred needlessly. This happens a lot with ORMs, where ORMs want to return everything. So I was working with a big online auction house several years ago. And their ORM, which was Ruby on Rails at the time, they had to select star for everything. And what was great was, not great, they had this notification pop up that every time the price changed, they wanted to update like somebody and just pop up a little value. Well, the actual callback to the database selected the entire auction listing, returned it back to the system, threw everything away about the price, and then threw the price up. So they actually saturated the bandwidth between the application servers, database servers with all this wasted space. So just in case or just because design is really bad. And so understanding that, but also thinking about what's the common usage that's going to happen is important as well, right? So if you are going to be using this for certain features, there are ways to get data that is pre-aggregated. You can normalize some data. You can make sure that your access pattern is optimized for what you're returning. And you can also look at using external components, but realize that there are external components and events that happen that are going to impact your system and could change how your data access works. So for instance, if you work for a company that does accounting during tax season, you're way busier than any other time of the year, right? And so that means you're going to have an event. If you work for a company that is really active during Super Bowl Sunday, Super Bowl Sunday sometimes has a huge event. Black Friday, right? Shopping season, single day in China. These are days that events happen that anything that you plan for goes out the window really quickly. And so think first about those access patterns and make sure that you have designed for them. Now, before you decide to start coding, okay? Realize that any decision you make before code is written into a terminal screen or a, you know, editor, that is going to have the biggest impact on your system. So thinking up front, it's going to have a monumental impact on the performance, the scalability, the security of that system. Because after all, think of it like this, if a drive fails, you can pull the drive out and replace it. If you need a bigger instance, we all have the capability to upgrade to the next instance size. But if you design the piece of crap application, you have to redesign the whole thing, right? It's not so easy to band aid design decisions. So keep in mind that the schema side is important. Those access patterns are important. But there is more to think about, right? Thinking about beyond just the schema, beyond just the, you know, the code side of things because it's all connected, right? What you choose to do, how you choose to interact and build your application, it's going to matter. And it starts with deciding and talking about the stack that you're going to use and understanding all the idiosyncrasies and all the interactions between different components, okay? You know, sometimes you don't have a choice. Sometimes you're forced into, you know, deciding on a stack. You're going to, you know, use, you know, this database, this, you know, framework, this application. But what you are storing and how you build your application is going to matter. Now, let's use this example. I like to use this example as a gaming app. So when you start to design in paper, right? You're like, okay, I'm going to build a new mobile app game on my phone or, you know, for the iPhone, I'm going to have some application servers. Maybe I'll have some cash later on. But, you know, this is kind of the simplistic model that a lot of people think and they start with when realistically it looks more like this where there are lots of external services, there's lots of different databases and there's lots of different components that are going to impact the performance and the overall scalability of the system. And how many people play like, you know, games? Like game on their PC or Xbox? Yeah, yeah, fair number. Okay, how many people have ever, you know, got the game like first week and there's nothing but problems, right? Yeah, and how many people love that experience? They live for that experience. Okay, the fun thing is this is the number one issue that causes game problems, like game launch problems. So week one, when you launch a new game, what tends to happen is you have, you know, the database, this core really worked out well. So this is battle tested. This is 100%. You've gone through this process a thousand times. You have, you know, 10,000 beta testers who are on this and it just works. But then you add in all this other stuff, right? So matchmaking, oh, we forgot to benchmark that. We forgot to test that. That's this extra, you know, service off to the side. Or the leaderboard, you know. These are the things that cause most of those outages. It's crazy how it's a different process than most people think. And so when you talk about, like from a, you know, database design perspective, you know, and you talk about the data that's in your core database, you also have to think about all these ancillary systems and how they're going to have to be used or accessed in order to get your application working. Now, as management changes, you're also going to start to see that there are different people who come in and will change that stack. So I don't know if anybody's experienced this. How many people have gotten a new boss and they come in and they go, we're going to switch cloud providers or we're going to switch, you know, oh, I hear laughs. Or we're going to switch programming languages. Like, I'm a node guy. No, I'm a go guy, you know. And they do. They come in, they make those changes. And so you have to realize that even if you're setting this up right away correctly, things are going to evolve and change. And so you have to avoid some of those stack politics. But as those things change, realize that even the small things can matter. So I did a benchmark study on different versions of Python and different MySQL client libraries. Right? And so you would think that Python versions, like who cares about Python versions? But when you look at the difference between the number of users that were being able to be handled in MySQL for Python 3.10 versus 3.97, certain drivers saw a massive regression in overall, you know, scalability. And one driver didn't. So if you just happen to be using the driver that is your unlucky driver, or you decided to, you know, set up a new system and it was using it, and you were using 3.10, you would have a potentially pretty hefty penalty for that. Right? And so small things matter. And so you have to realize that the interconnectivity of all of these different components could have an overall impact in the scalability of your systems. Now, we all try to put our trust in technology, but are we putting too much trust in technology? Because a lot of people aren't really thinking about the implications of what they're deploying, especially on the database side. So I've had several conversations this week at our booth where people come by and will be like, you know, so what database do you use? Why use SQL? Oh, you use SQL Server? No, I use SQL. SQL Server? No. MySQL? No. Postgres SQL? No. SQL. And it's like, well, what do you mean? And it's like, well, I just click the button. It starts up. And it's like, but so you don't really understand what's underneath the hood. And, you know, we often put that trust into the different technology components without understanding what happened when we deploy and use those technologies. Right? And here's the fun thing. You can make anything work right. If you have enough time and effort. Right? Any database, any cloud provider, any libraries, any programming languages, you can make it do all kinds of crazy, unholy things. Doesn't necessarily mean that you should. Right? And the success or failure is generally not predicated on our technology choices. It's how we design these systems and how we build them to use those components that's going to matter the most. And it's very, you know, important to get that infrastructure right. And right now, that's even more true in the database space because there is a database out there for almost every workload. Okay? So if you have, you know, the, you know, the need for analytics, if you have the need for time series, if you have the need for logging, there are specialized databases for each one of these. And if you use them for something that it's not intended for, it could work. But, you know, that's really sometimes more like, you know, putting the square peg in the round hole. It can work, but doesn't necessarily, you know, work well. And so if you haven't taken care of that engine and you haven't set that up correctly, then that awesome car that you built, because you might have this great application framework, you might have this great UI, you might have this great idea, but that doesn't necessarily mean that it's going to perform well. And because you're putting a really tiny motor in a sports car, it's just not going to work. So trust but verify when we talk about those technology stacks, right? Make sure that you are understanding the components and really understand who's responsible for that type of work. Because who manages the infrastructure and who is building the applications and who's interacting. You know, that is incredibly overlooked, right? So who's responsible for those types of components? Now this is not necessarily a design decision, but understanding how the systems work together just like that big picture that I drew of all the different components, that's going to be your key to success. And so, you know, we say like, oh well what about the cloud because, you know, again, there's these people who are out there saying, hey, we know databases are uncool, so shouldn't you just deploy in the, you know, the cloud and then it's just all magic. Well, when we talk about that, you have to understand that there's a shared responsibility model for all cloud providers. Which is a nice way of saying it's on you except for certain components that we agree are on us. And there is a fine line. So you are responsible for that architecture design of your systems, okay? You are ultimately the person in charge of that. You are in charge of configuring and tuning. You're in charge of the optimization. The cloud providers make sure that your hardware's there, that the systems are provisioned. But all of the hard work, all of that design work, that's on you. Now, how many people are running their databases in their cloud native deployments yet? Anybody doing Kubernetes databases? Only one, two. Okay. So, you know, everybody's running cloud native, right? And so everybody's doing microservices. And the fun thing is every microservice wants its own database. So that means that, you know, if you used to have one monolithic database, now you have 10,000 monolithic databases because they're all forgotten into the ether. But each of the databases that you might deploy are going to have a different level of scalability and usability if you're deploying via Kubernetes. And you're trying to deploy in a cloud native environment. So, you know, MySQL and Postgres, they're great at certain things, but they're not necessarily great at, you know, that scaling and that sharding or that capability to extend and expand. But MySQL and Postgres are very battle tested and they're used, you know, for a lot of applications where the data reliability is absolutely a requirement. So, keep in mind, if you're going to be deploying cloud native databases, there's some things to consider. Now, how many people have heard the term NewSQL? Oh, okay, one, two, okay. So, you know, NewSQL is, you know, the idea that you can scale your databases automatically and it will automatically handle everything behind the scenes. So, remember I talked about space, right? So, typically, the more you store in a database, the slower things get, okay? The more memory that's going to be used, the more disk that's going to be consumed, the more CPU cycles you're going to use. So, bigger databases equal slower databases. So, when we talk about NewSQL, what they're really looking at doing is taking the concept of sharding, okay, which is taking your dataset and breaking it into smaller datasets and distributing it throughout different database servers and different quote-unquote shards and then presenting that as one unified database. And so, they're trying to do this automatically, but sharding's been around for, phew, gosh, you know, 20 years, 30 years. I mean, sharding's a concept that's been around since I've been doing stuff. Anybody sharding their data? Okay, as you get bigger, it's something that's often looked at. But you've got a couple different NewSQL, you know, players out there, TidyD, Cockroach, Ugubite. They offer great extendability, you know, because they're handling everything behind the scenes and trying to mask the complexity from you. But there are blind spots when it comes to application workload, just like any of the other databases or any of the other designs, you can see that these, if you write things the wrong way or you're using it for the wrong use case, it can definitely slow down. And then there are other technologies like VTES or Citus, if you're using Postgres or MySQL, that are designed specifically to be a framework on top of your database that does sharding over what your common database technologies are. So if anybody's heard of PlanetScale, Microsoft owns Citus now, so big companies behind me. But as your data gets larger, you're going to have to look at how you handle that. And so these are effective ways or tools that are already out there. So finally, I want to leave you with this. Don't over-engineer things. The number one issue that I hear over and over again from people who come to us for help in the performance space is that we are over-engineering systems and that is causing the slowdowns. We're overthinking the complexities that need to be there. And that's my presentation for today. Questions. Happy to take questions. You're looking or you're thinking of questions, but you don't have any. That's okay. That's okay. We don't have to take questions. Oh, questions. Oh, okay. So the question is, because I'm supposed to repeat. That's right here. Is there any sort of criteria for deciding on which is the right database to use? Number one, what's the skill set of your developers and what are they already comfortable with? Okay. So introducing a new technology to developers who are already like either fans of Postgres, Maria, or MySQL and trying to get them to learn something brand new generally has disastrous consequences because you just don't get the buy-in and there's all kinds of weird things that happen. In fact, we worked for a large Fortune 500 company and they're like, thou shalt move off of Oracle to MySQL. And you know what all the Oracle developers did? They go, no. And the management said, well, we're going to make all of the other managers bonuses directly related to how much they move off. And you know what happened? They got no bonuses, right? Because it was something that they couldn't win the hearts and minds of those who had to do it. Now, when you look at other features, you know, MySQL and Postgres have very similar features. So they are both good at similar workloads. There are a few things that one can do over the other. Maria is going to follow very closely with MySQL, although it does have some add-ons and some bolt-ons if you're willing to pay or use some of their extended licenses. So you can kind of categorize those as the relational. And so that's more of, if you have a need for the relational, I go with what is the most comfortable for the customer client. And then if there's a specialized need, for instance, if they need to use, you know, Microsoft's, you know, cloud instances, and they need charting, well, then Citus is probably a better choice and then that's only available in Azure. If they're going to use MySQL and they need to use, you know, something else, then I might go with PlanetScale for that or VTest, so I'll deploy there. But then when you look at, like, document DBs or graph databases, those are very specific to what you're trying to store and how you're trying to access it. Huh? Yeah, yeah. So, okay, so the question was, is there any tools that we can use to look at the efficiency of your design? So there's a few, actually. So if you're using Postgres, for instance, there are, you know, index statistics views that you can use to look at what indexes are being used and which ones aren't and how often they're being used. In MySQL, there are similar indexes. There are tools, so Percona, for instance, has Percona Toolkit, which runs on MySQL that can look for inefficiencies around that as well. How many people, how many people use Postgres here? Okay, MySQL. Okay, so mostly Postgres. Okay, so, you know, you might be familiar with, like, PG statements to collect statement information on the queries and looking at query access patterns. That is helpful. We just released PG Stat Monitor, which extends that even further and it allows you to store in bucket, you know, time slices. So those are tools that can really help identify things. We also built a tool that's for query analytics that's out there in the open-source space called PMM, so Percona Monitor Management. It's got a query analytics function in it. So it can help find, you know, slow queries and what might be missing in index as well. So, yeah, I mean, there's a lot of variety of tools. When it comes to actual schema design and choosing the right data types, that's a little less, you know, that's more in the iffy space for most. Ah, agnostic design, yes, yes. So this is where there's the trade-offs, right? So there are specific, you know, features of each database and if you're worried about that level of lock-in, it's a bit trickier, right? Because, oh, um, so, okay, so agnostic design is where you want it to work on almost any database so you can have more portability, right? So you can migrate from one, you know, to the next. Um, and so, you know, for instance, in Postgres, there's Babelfish now that will help convert SQL Server to, you know, Postgres. But what's funny is a lot of the tools that are out there to do those types of things, they're kind of one-way migrations, right? So they want to get you in and then keep you in. It's the Hotel California model. Um, yeah, yeah. You can check in, but you can't leave. So I think that it's a little more difficult to take that agnostic approach. Are you trying to do that also for, like, cloud providers and other things? Yeah, so it's hard because you have to lock into a technology or two and then once you start to get into the high availability structures and the backup structures, they differ so much between how you would backup, you know, let's say, MySQL or Postgres or SQL Server that you end up getting some lock-in even if you don't want to. And that's where I think the better choice is to look at choosing an open source tool that doesn't require a vendor underneath. So Postgres is a great example, right? You can get support and services for Postgres from pretty much anybody, right? So you want to use it in any of the clouds. You can have portability between clouds. You want to use it on your own. You can use it on your own. You want to get commercial support. The EDB folks were here. ScaleGrid, um, you know, so all kinds of other stuff. Okay? Cool. All right, thank you. Yeah, with the mics. The sound lock, good. Come stand beside me and let's check the, you and I don't conflict. So say something. Hi, I'm Alyssa. I'm Mark. Okay, how's the sound coming out there? Good. Okay, well enough recorded. Well enough for recording, good. Little, okay, a little strong for me. Maybe I put it back here. Is that better? All right, okay. So the problem is if I put it here, I start, okay, there. Right, and that's probably unhealthy. So me sounding like something I'm not won't work. Okay, good. Oh, and you're welcome to put that right here. That helps. Yeah, I'll just use the arrow keys. That's easy enough, good. Hey, it's time. I propose we start on time and honor the clock. Yeah. I'm from not far outside Boulder, Colorado where the National Institute of Standards keeps their clocks. Let's start on time. I'm Mark. And I'm Alyssa. Welcome, and we're here to share our experiences with you about expanding open source in Africa. So at its heart, what's the problem we're discussing today? And the problem is best summarized by two numbers. 16. So over 16% of the world's population lives in Africa. That's one out of six inhabitants live in Africa. But two, that other number, is the percentage of GitHub active users that are coming from the African continent. So it's more like one in 40 active users on GitHub that are coming from the African continent. So Africa is underrepresented in the open source. We know that there is currently 1.4 billion people living in Africa. So that number is comparable in size to India. The Indian subcontinent is about 1.4 billion population as well. So what this means is that there is potential for Africa to be a big time contributor, much like India. From the data, the open source world needs Africa's contribution. What you see here on the graph is that on the top most bar, North America has about one-third of GitHub active users. Asia has another roughly one-third. And it's slightly less than one-third Europe. And now if we continue down the chart second from the bottom, Africa at 2.3%. So there's a long way to go for African contributors per capita to be anywhere near at parity with North America, Asia, or Europe. We've got lots to do to reach out to people in Africa who can help with open source. Now Africa really is a great place to do open source. It's a series of developing economies. The economies are growing there. We see lots of technological growth in Virginia and Nigeria. Lots and lots of strong technology in South Africa. We see increasing education happening amongst the people who are in Africa. And software skills are matching to that education. We're delighted to see it. It's good to see that kind of growth along with improving network access as their access to the internet continues to steadily improve. So Google Summer of Code has started 17 years ago. It's a worldwide program. It's a three month internship where it offers internships to college students to work in the open source. It has over 18,000 students participated in the program. Over 700 mentoring organization, aka open source organizations that supported and participated in this program with over 17 mentors, excuse me, with over 17,000 mentors involved in the program. It is a three month funded internship where Google pays both the mentoring organization as well as the interns. Now Google Summer of Code is a little bit complicated for African contributors because it's a worldwide program which means as they apply to Google Summer of Code they are competing with top contributors from universities in India. Top contributors from universities in the United States and in Europe. And so it's more challenging to get into Google Summer of Code because of its nature as a worldwide program. Google Season of Docs has a slightly different focus. Google Season of Docs is attempting to bring skilled technical writers to open source projects to contribute to the open source projects documentation. Whereas Google Summer of Code is really looking at how do we help brand new contributors arrive, what Google Season of Docs is doing is saying let's take skilled technical writers and have them help these projects. We've used Google Season of Docs as well, but it's a different focus than Google Summer of Code. So Outreachy, what is Outreachy? It's a worldwide internship where it focuses on open source and open science. It provides internship to people that are subject to systematic bias and impacted by the underrepresentation in the technical industry where they are living. Interns receive $7,000 stipend for a three month internship where they work remotely one on one with a mentor. Historically, Outreachy has been around for a while. They've been in operation for 12 years. They have served over 1,000 interns and the internship completion rate is quite high. 19 out of 20 of those interns successfully complete the program. Now, the successes do also do in part due to the collaboration of mentors, coordinators, sponsors and projects and all these combined helps to make the program successful and it has been quite successful. She-Code Africa, now where Outreachy, Google Season of Docs and Google Summer of Code are all three worldwide programs She-Code Africa is a different program in the sense that it is intentionally and specifically focused on Africa. What happens in She-Code Africa is we do once a year contribute on as an open source boot camp that focuses on involving women from Africa in open source. It's a one month project. Each of the women that selected for this project receives a $500 US stipend paid to them in country where they are. In every one of these cases the components look the same. We need sponsors, right, in any one of these programs. There's funds that need to be provided in order to sponsor the organization. We need open source organizations, projects that open source organizations that are willing to host the effort. We need projects within those open source organizations that are willing to do the work and mentors or skilled trainers that are willing to coach these new contributors as they come on to the project to become contributors. So remember that sponsors, projects, mentors, and organizations are crucial for the success of these. Without any one of those four, it just won't work. Now how does this effort help open source projects? So I'm going to take my hat off of lobbying for why we should be helping Africa more and talk about how the Jenkins project in my specific case is benefited by our interactions with women contributors and other contributors from Africa. So the challenge here is what benefit does Jenkins receive because it's involved in these outreach programs. We've seen strong contributions to Jenkins core and to Jenkins plugins because we've received more contributors from these outreach programs and they've solved truly valuable problems for us. We're also delighted that we've learned better how to do group mentoring. We've come to understand that there are certain things that will help and other things that are less helpful as we work with these new contributors. We've discovered problems in our onboarding process, in our what does it take to become a developer process. All those things, thanks to being involved in these kind of outreach activities. So the Jenkins project has participated in two contributons, six Google Summer of Code and one Outreachy. We've mentored over 45 new contributors to date. Now what this means is Mark has mentioned earlier there's been improvements and additions to our documentation to our API, to our plugins, to our core and so on. And also Mark also mentioned that there were lessons that we learned and things that these programs has taught us. And also that these programs also came with some challenges as well and Mark will talk us through that in later slides. Now one of the incentives for Jenkins as a project and for other open source projects is the real contributions. Let's look at some specific examples here. Jenkins users want to manage their systems with configuration as code. They want to track in a repository the exact configuration. That thing listed there called the plugin installation manager is critical to that activity. Let's me track Jenkins plugin versions precisely. That's a Google Summer of Code project. From multiple years ago provided by a student it's now used in tens of thousands of Jenkins installations. That student's work lives on and is still actively maintained. Likewise, many of us are GitHub users. We want to have a strong connection between Jenkins and the GitHub Web Interface so that when I look at the GitHub page I can see the results of my Jenkins job. The GitHub checks API from a Google Summer of Code project is the thing that does that. Again, we've had students first-time contributors doing this work in ways that has helped specifically large numbers of Jenkins users. We see the same thing with the GitLab branch source. It's a project that allows us to do multi-branch Jenkins pipelines with GitLab thanks to, again, a student contributor doing that work. So we're very grateful for the results of these outreach programs and glad to see how they succeed. Now, how does it work? What actually happens in the project as we do these things? What we see is that we start in the top left here with organizers. These are for example, the people at Google who run Google Summer of Code or the women who run She-Code Africa contributon. As organizers, they start the whole process proposing an outreach program. This outreach program that they propose needs funding and they don't have funding initially. It needs projects. It needs people to mentor and it needs contributors who are willing to be mentored. This is the beginning. The organizers then go looking for sponsors. This is commonly companies or foundations who are willing to put cash out to provide to allow the contributors to be paid. It's crucial in this environment that these brand new contributors must be paid. We cannot rely on them donating their time when they're fresh in their career expecting that they'll just give it away for free and mentor them. The crucial nature of the funding from sponsors arriving all the way to the contributors is part of what makes these programs a success. Now that third piece there, the mentors, without it none of the rest of the pieces are relevant. We have to have skilled mentors from inside the project to coach these people otherwise they are left floundering not knowing what to do or how to do it. So remember these are the items. Organizers, sponsors, mentors and contributors. So Google Summer of Code this year has four projects. And came with that was four college age students that are driving the efforts and working within these four projects. We currently have 10 mentors for four of these students and five out of those 10 are currently alumni. They're formerly GSOC and they came back and wanted to pay the poor board and become a mentor. So five of those mentors are former GSOC students and five mentors are longtime contributors in the Jenkins project. Now this is the best we've seen to date of return rate for previously successful students coming back to act as mentors. It had been more typical we get one or two. This year we're really pleased that half of those were previously students who had done a successful project. So Outreachy, the Jenkins project has been involved in one Outreachy project. We had two students, one mentor for those two students. The internship was worth three months of work and they worked on audit login plug-in for Jenkins. The Sheikot Africa Contributon we've done for two years. This year we took on two projects, each with two contributors on that project. So gives us six contributors with four mentors. These new contributors to Jenkins spent a month working on things that the Jenkins project needed. They were paid $500 at the end of their successful completion. We were really pleased with it. Let's talk about who these people are because it helps you see a little bit knowing who they are. In the top left, that's Afi. She's from Ghana. Top right, Catherine from Kenya. Middle right, Peace-Okafor, so Afi is a Java developer. Catherine's a documentation person. Peace, on the center right, is a front-end developer. Nafisa in the bottom left was our project manager. So Sheikot Africa provided us a project manager and contributors to help support it. And each of them received that stipend to make sure that they were funded for their month of work. Soma and Sophia, the two at the bottom, are both front-end developers, if I remember correctly. And did excellent work for us in a one-month time that we worked together with them. Good connections and good contributions. There were four of us. One from France, one from the Boston area, one from North Carolina and me from Colorado. And we dealt with time zones. We dealt with complications related to networks. We guided and tutored these new contributors. We had to teach them some things, right? Many of them, GitHub was a little bit surprising for them. They hadn't dealt with it and they had certainly never contributed to an open source project. We had to be the ones who evaluated their results, who coached them on how to investigate problems in their configuration or to review their changes. All things that helped them and helped our project. And the four sponsors for this program were CD Foundation, the At Company, CloudBees and Deploy Hub. They provided the cash to keep the funds for these efforts. So basically the funds were used to pay the contributors for their stipend. It also supports the organizers. And in addition to sponsors, of course Mark mentioned that we need a place for these folks to work and that work is in the open source. So this year's projects were Jenkins, Layer 5 and Deploy Hub. They brought people in to propose ideas. They work on the ideas. The contributors work on the ideas. And then the project would provide mentors. So the mentor then in turn provides their expertise, their guidance, feedback and evaluations. So we wanted to share with you that it's not all roses and happiness in terms of dealing with these kinds of outreach projects. There are plenty of challenges, plenty of compromises and complications that we have to handle. For the contributors, so this is their first experience typically with an open source project. They've never done this kind of remote thing before. They may be fresh out of the university or still in a university. That experience alone says oh, they were accustomed to a classroom environment where the people are experts telling them what to do. They arrive on an open source project where they're expected to be much more self-powered, self-motivated and ready to ask questions. So part of the challenge is we have to be sure we're coaching them on how to communicate. How to encourage them to ask questions. Addition to that, time zone can be really challenging. Those of us who are on the US West Coast, it's a long way to Ghana. It's multiple time zones to reach out to Ghana. Likewise to Nigeria. The communication channels to these African countries are oftentimes not as reliable as those of us who are in the United States are accustomed to dealing with. North American networks we've been working in the internet for a long time and they are just reliable. In Africa, not necessarily. Many times these project participants dial in from their cell phone. They join a zoom call from their cell phone because that's what's working. Or other times we have to switch to use Google Meet instead of Zoom because for whatever reason in that particular day Meet was working better than Zoom. Adapting to communication problems is just part of the challenge. The other is that they're on a short time scale and they're trying to accelerate their learning. So for them it's daunting, they have to learn a lot very, very fast. Now the mentors have those challenges plus they've got to learn to adapt to coaching people with widely varying levels of skills. We had documentation writers, we had front end developers with no Java experience, we had back end developers with hardly any JavaScript experience and in each case some specific coaching was needed to help them reach their best potential in the time that we had. For the organizers, okay this one my heart goes out to the organizers so my friend Zinab is one of the organizers, she's from Nigeria and her challenge is to find open source projects that are willing to participate with her and find sponsors who will pay funds to allow them to run the operation. Then they have to do all the processing work, how do they distribute the funds out to the participants across all the African nations that are serving. All sorts of things that I'm grateful as an open source project not to have to do, I have no idea how I would pay someone funds in Nigeria that we would like to get to them I just don't know how to do it but Outreach, Google summer of code and Google season of docs all know how to do that. Now as a project what we've found as the Jenkins project is one of the most daunting things is finding mentors it's a pretty common behavior that open source project participants participate because they're interested in something that benefits them, it's a good thing guided self-interest is really helpful however that does not naturally lead them to be mentors. So we have to sort of beg, cajole sometimes plead could you come help us mentor you're already existing passion for the thing that you're doing. Mentoring is complicated it's not a natural thing for most developers to find a way to mentor. The other challenge is we've got to find good projects that fit within the time scale and the skill level of these new contributors. Project definition can be challenging as well, how do you find something that you can do successfully in a month on an open source project and is useful. So let's talk about the results that we've gained from this. So for Google Summer of Code the results we received were plugin installation improvements github performance improvements github checks were added multi-branch git lab pipelines were supported. Google season of docs wrote the Jenkins installation documentation for Kubernetes. Zenop was actually our writer for that. And she taught us all sorts of interesting things about how non-portable our documentation build system was. We thought we'd created a portable documentation system and she proved conclusively that we had not. All the dockerization we did, all of the things we thought were making it so portable ultimately surprised us when she tried to build on Windows. She is still involved in the project we're delighted with that project that process but it educated us about it removed one of our points of kidding ourselves our points of delusion where we say we thought we had a portable documentation system it turns out it wasn't as portable as we thought. So she told Africa Contributon we received over 400 plus applications from women from 10 different countries. The sad thing is that only one out of 10 of those applications are accepted. And this is due to the lack of sponsors, due to the lack of projects and mentors. However, for the folks that are in the program 97% of them are happily satisfied with the program 87% of the mentees do successfully complete the program. For Outreachy 1% of the alumni continues to contribute to the open source 44 of them are employed continues to contribute to the open source as part of their job. Now because Outreachy supports STEM, 22% of past interns are STEM students and 61% of those interns have found jobs in large enterprises like Intel, Facebook, Google, Microsoft, so on. So what's next? So what's next? Ask your company to sponsor. We mentioned earlier that cash donation is crucial. It helps fund these programs and it helps to keep the life on for these folks. And if you have DNI programs in your company be a part of that. Ask them to join your company. Ask them a platform so they can speak to bring awareness and visibility. And then donation wise what I typically do is always include the sponsorship as part of my budget planning the year before. So I can always make sure that they recover them in terms of sponsorship in the next year. So this one is for me was a quite instructive experience. How do I ask my employer to get really involved in putting cash into funding one of these outreach programs and finding it is specific to that employer. Some organizations do their donations through an entirely independent foundation. Others have an organization that's dedicated to diversity and you need to get on their queue finding who it is in your company and how you ask the question is already an important and valuable step. Once you found that then there's the piece of now you've got a lobby to them why this organization is better than that organization and they'll make their trade-offs and compromises. So ask your company to sponsor help them make diversity and inclusion real. All too often we get organizations that talk an awful lot about diversity and inclusion and this is a chance to put cash into a diversity initiative. Now the other part of this is cash funding is great but we need projects, we need ideas on which people can work and so my suggestion here is propose ideas to your favorite open source project. Now mine is Jenkins therefore I put the Jenkins local. You pick your favorite project whatever that is suggest to them small projects that could be done by new contributors supported by mentors. Test automation is a good one. Ask yourself does my open source project have enough tests? Most of the projects I've seen the answer is no. I know that the answer is no with Jenkins. We do not have nearly enough automated tests. It's a great task for a new contributor. Help them write tests. Help us get the benefit of their having written tests. Likewise online help, documentation specific small features are good fits for these kinds of projects. Next, be a mentor. Offer your time and your skill in your project to coach someone else. I admit it, it's not the same as writing code. You're trying to help somebody else as they write code. And if your love is pure and simple code writing, this may not be the choice for you. But boy, we need mentors. Yes, indeed we do. And that covers the talk that we had for today. Do you have any questions you'd like to ask before we close? So we're actually continuing with all four. So we will continue to use let's bring them up here. So we'll continue to use Google Summer of Code. Absolutely. This one is very active but it's worldwide. All right. We did one Google season of docs. We probably won't do a second Google season of docs because the engagement level they expect for Google season of docs does not fit real well with the Jenkins project. Others may but for us it was, they're expecting a higher engagement level than we're ready to commit right now. Outreachy, we will likely reengage with them in the future. And she called Africa, yes, absolutely. We did it last year. We've done it again this year. We expect to do it again next year. Good question. So the question was, is there anything on the Jenkins site that guides these sorts of efforts? And there are. So what you'll see on the Jenkins website, Jenkins.io is a segment on outreach. And in that segment on outreach, there are each of these four is actually and under each of them, there are typically things that are new, new contributor friendly issues that they're encouraged to assist with as a way to get started. And that those, again, that's another place where a contribution is needed. It's difficult to identify which issues are well suited to new contributors. You need to be aware of the project well enough that you can say, ah, that's an easy one. Oh no, that's a hard one. Good point. So you note that in the FreeBSD project and in many there are tags or labels that are applied to bugs to say that this is a candidate for an early contributor. And we have those exactly. And that's so you query our bug tracker and you see, oh here are the first time contributor friendly issues. Absolutely. And we like that. That's a good way for them to find and find something interesting to them and work on the thing that's interesting to them. Right, right. So good point is that it's much cheaper or it's healthier for the project if we identify good first issues as we're creating the issues or triaging them. Yes, very good. Other questions. Yes, please. Good question. So is there a size requirement or what are the requirements or expectations of projects that participate? Did I state your question correctly? So if we look at Shikot Africa, their project sizes are several of them are actually quite small. So the Layer 5 project and Deploy Hub are actually relatively smaller projects. Not enormous. There's certainly nothing on the scale of the Jenkins project or the FreeBSD project or any of the Linux distributions. And you can work well with them so long as you're willing to dedicate someone to help them. Now, others, for instance, I would not embark on Google Summer of Code unless I had, well in the Jenkins organization we had three organization admins running this year to support these four projects and the work is substantial because it's very competitive. What's happening is we're taking cash from Google effectively and paying contributors to work on Jenkins and other projects have exactly the same desire. They want Google's cash to help pay contributors to their project. So it's more challenging to do an effective Google Summer of Code project than it is to do Shikot Africa contribute on, for example. Right. Yeah, so Alyssa is one of our org admins. I should be quiet and let you speak up on this one, Alyssa. Well, so for Google Summer of Code, as a mentoring organization, you do have to apply, just like the students themselves have to go through an application process. So we do the same thing. They have a form that we fill out questions, questionnaires that we need to fill out and then they choose. They go away for a couple weeks and then they come back and they make their announcement. I don't know how they choose their mentoring organization, but I have seen there are organizations where they weren't accepted. But we've been fortunate enough that we've been in the program for six years now. So that kind of bias some kudos or, yeah, been there. Right. So did that address your question? And I think if I can also add, the other thing that's really helpful is that there's a lot of prep work that needs to be done by the project prior to the application for the mentoring organization. And I think Google looked into all of that as well. Are your web pages up to date with regards to your proposal ideas? Your communication channels? How are you prepping students, potential students, right? So they look holistically at all that and they just want to make sure that. And I do, too, that when we get into this program that we are successful as well as the students. Yes. So your question is have we considered sending someone physically into Africa to act as a mentor rather than being in time zone here locally? I have not in part because we've found it so difficult to locate effective mentors already. I can't imagine telling them yes, you'll need to relocate to Africa for a month. My company would not fund me, for example. They happily fund me mentoring for a month because it's two to four hours a week. If I instead said, oh, I need to fly into Africa, have accommodations there, they're safe, reliable, etc., they would not know how to do it because at least my company does not actively sell into these portions of Africa and therefore we have to contact with anyone in these pieces of Africa right now. I am certainly open to that. Let's connect by e-mail and have the conversation. Let's look at the map here to get a hint of what Sheikot Africa has done in terms of their country coverage. They've accepted applications from 10 African nations, both West Africa and Southern Africa and into the United States. So able to accept applications and provide funding for women in these, okay, the darker color there is the ones I believe where they actually had contributors accepted and paid funds out there. The others are where they had applicants and did not accept anyone into the program at that point. But they're ready to pay out in those locations when they are open to other potential countries as far as I understand that they are. They've certainly shared with me that the banking challenges are real. Transferring funds from one country to another in Africa is every bit as complicated as transferring funds from one country to another here. So international transfers are complicated but that's part of what they provide as an organizer. Any other questions? So you're okay if I restate your question for the recording. I think what you're asking is are there steps that open source projects should take to better adapt to people in Africa who may have specific challenges with internet connectivity or with internet bandwidth. Is that a fair way to say it? And I think there are and I think one of the hints, there are two or three. One is that we discovered yes we did have some high expectations for bandwidth because you run maven compile and it downloads the internet in order to populate its cache. And that was okay we're not sure how to fix that one. Then the next learning phase was oh and now we've got a build system that requires docker images and they're nice and portable in that sense but now you're going to download the hundreds of megabytes of docker images and so we had to forewarn these participants start this download and we had to stop assuming that during our meetings they could download an artifact and have it ready. It was a flawed assumption because of course I'm spoiled here in the U.S. in North America I want it and it's already on my desktop and it was oh you want that 90 megabyte Jenkins war file that has all its dependencies in it 90 megabytes coming into the country is going to take a while. Right. Exactly. So that was a piece. Now the other piece is one that we're still evolving and that's to have preconfigured development environments that are available and hosted remotely for them. In our case we've been using Gitpod a facility that allows us to define with a little bit of yaml what the IDE and development environment for that thing is. They press one button and it says open in Gitpod and what it does is spins up a machine on Gitpod servers and they've got a running IDE in this case running Visual Studio Code and they can do development right there without having downloaded anything except the interaction between the web page and them. So yeah good question. Any other questions? So how much of what they code is relevant to them in country is that what you're asking? So what we found and again this is specific to the Jenkins project but what we found was most of them had no experience with an automation server whatsoever so no experience with continuous integration. The concept was completely foreign to them. We were introducing them to a brand new concept. Here's this Jenkins controller. This Jenkins controller lets you automate things. This process the month that we spent together with them on the project educated them to a facility they didn't know existed and that was beneficial to them right? They will take that into their employment now. They'll take that into their university work. Did it match with any specific in country needs they had? They didn't ever detect that. All I saw was that I'm a firm believer that they will benefit professionally by knowing that there is such a thing as continuous integration and as they use it they will be better for their employer and better for their own career growth. Right. And we certainly did for instance with Sheikot Africa we did host sessions where we gave tutorials and so I did a tutorial. This is how you use Git and how we use the project and I was astonished at the number of these women who said thank you thank you thank you for doing this. I thought I was saying things that everybody knew and I was telling them something that of course they'd already previously experienced and their answer was no they had not experienced it. It was brand new to them and they were happy to have someone introduce them to the concept of a poll request the concept of an issue on GitHub and how you interact with other people so very positive. Did that answer your question? Any other questions? All right well thank you very much thanks for being at Scale19x thanks again have a great trip home we'll see you.