 Today, we're talking about hybrid databases. Databases have really changed a lot in the past 40 years, especially in the open-source arena, there's been a lot of developments that are rather interesting. Now, oops, of course this dies. Now, I want to start off with a quick history lesson for those of you without gray hair or not prematurely bald. The first computers really only had, we're talking about electronic computers, not the people who actually did computations and linear algebra, had only memory. Later came punch cards to pinch, tape your tapes, real tapes, and all the other fun bits. They're a long, long, long gone. And disk drives, if you could find them, were rather expensive, very slow, and physically very, very large. Now, you couldn't, you had to read everything from beginning to end. It's like you couldn't stop a book in the middle if you wanted to know something, you had to start at page one and go to page n. So if you're processing some data, you went all the way from A to Z all the time, really fun. Well, somewhere along the way, people cited, well, I just don't want to do that. I want to be able to skip around in the data. There's got to be some way how to do that. So the first step was to force all records to have a uniform length. Because if everything's a uniform length, you know, if it's 100 bits long, you know, at bit 101, the second record starts. So how do you do that? Well, if you want to get to the records that you want in a set of data, they start figuring out ways of how to get hashes or keys into each record. And use that index as a map to get to the records that you wanted. And in the mid-50s, early 60s, they came up with the idea called index sequential access method. This where the records were made into logical size blocks, all the same size. And that was originally done so you can skip ahead on tape drives. Anyone here actually use the tape drive? A couple of you. Big bulky devices. If you're really lucky, you might have 2400 bits per inch on the tape. Not really that dense. So the great thing about ISAM is that you didn't have to read in everything to find a record. You could skip around. But it was limited in size. Usually had a single key. Any sort of data relationship between different tables or different records was very, very messy. And you were very much limited by memory on the machine. This gentleman, Edgar Codd, was an IVM researcher who was asked, find a better way to go out and get our data. And we're mainly looking at ways not to duplicate stuff because disc drives are so expensive. And we're looking for a way on how to break data into little component bits to make it easier to find little bits of information. So he came up with something called structured query language. Ironically for him, IVM had another product that wasn't structured query language that they wanted to sell. So they kind of tried to stifle what he was doing. And the only reason we have SQL today is because he kept talking to customers. Everyone here talks to your customers, right? So if your bosses are telling you no and you want it in anyway, talk to your customers. And once I mentioned earlier, his main goal was the minimum duplication of data. So as people started moving away from ISAM, they were looking for better performance. They were looking for things that would scale. And the main thing that everyone wanted at the time was transactions. They were able to make sure things like bank account transactions went through properly and weren't aborted in the middle of the incident. So part of that came up with the idea of normalizing your data. If you're going to take someone's address, you're going to break it down into a similar format for everybody. So you'd have a first name, last name, street address one, maybe street address two, city name, postal code, country code, all that. Normalizing data is very rigorous. It's very hard to do at a large scale. That's why a lot of folks don't do it properly. But if you really want performance of a structured query language database, that's the way to do it, because you have to force your data into these little relations to be able to tie things together. Now, all this was driven by the cost model. This was something that Oracle developed in the early years. And the idea was if you know what it costs to go out and get this piece of information, but there's multiple ways to do that, make the cheapest way to do that. Cheapness here being defined as discreeds. This is all based on past statistics. And like past statistics, they can be misleading. Much like a GPS that you know that the 8123 was open yesterday, and it was the fast way to get back to your house. But today there's some sort of protest there, and the optimizer doesn't know about that. So the cost model isn't always the best. So most people like me from database companies coming, talking to folks like you, we're telling you go out and normalize your data. So experts like me have been preaching this for upwards of 40 years, telling us you really need to make sure that you package your data properly, and that you have to set up complex relations with your tables. Usually this requires that you have to have a DBA to come in, set up the relation, set up the table, set up the indexes, set up the permissions to go out and write to the data and get things going. So you had to go through the DBA as a choke point. Well, about 10 years ago, the SQL world started coming along, and they were frustrated on things like my SQL's inability to quickly change the schemas of their tables in a fast fashion. And of course with any new technology that comes out, it's always marked as the end of the last technology. So for many years I was hearing that, oh, relational databases are going to be gone, NoSQL is going to take over everything. NoSQL databases are really good, they have a lot of functions, but they've until recently lacked a lot of functions. Acid compliance for transactions is just now getting implemented. Rigor to your data, you can't really do with a schema-less data. A lot of NoSQLs don't really have good logical operators, doing exclusive ORs or ANS are going to bug them up. But the good news is you didn't need a DBA, you didn't have that choke point anymore. You could literally listen to your boss who said, hey, don't worry about your coding, go out and start coding, we'll tell you what the data, we'll tell you about the product later, just go ahead and start coding. I have this little check mark on my Gantt chart. Go ahead and code, we'll tell you what you're doing later. Anyone working in a company like that? And the other great thing about NoSQL is you had the ability to have mutable data, which means you can change it on the fly. You have an IoT sensor that suddenly starts giving you a new bit of information. You can instantly throw it in the database without having to rely on the DBA going up to set up the column in the index and all the other stuff and getting the queries right. So in the past couple of years, the NoSQL vendors have added things like indexes, more logical operators, they're adding transactions, Mongrel is having some teasing issues there. Meanwhile, most of the SQL vendors have gone out and added JSON data types. Oracle, Microsoft, MySQL, Postgres. Postgres being Postgres, of course, added two types of JSON operators or data types. And the relational vendors are looking for ways to leverage this feature. And about a year and a half ago, I was talking to one of our engineers, Dr. Charles Bell, and I said, well, how are you using the NoSQL features in MySQL and other databases to your advantage? And that was kind of the embryonic start of this talk. Well, that's the end of the history lesson. Let's go on to the fun stuff. Two and a half years ago, three years ago, MySQL introduced the JSON data type. This is basically a one gigabyte payload per column. MongoDB, for reference, is about 16 megabytes. We have 24 to 30 supporting functions depending on the release you're on. And it's probably the main reason more people have upgraded from MySQL 5.0, 5.1, 5.5, 5.6, to 5.7, and later to 8. And on top of it, it's also the heart of our new protocol, which we cleverly call the X protocol, our new API, which we cleverly call the X API. And it gives you the ability to use MySQL without the SQL. So let's make up a totally spur-less fantasy business where I'm going to stop doing what I'm doing now and go in and start making guitars. But being a bithead, I want to go out in there and set up all my own behind-the-scenes infrastructure. So for my guitar shop, I'm going to have all sorts of databases for accounting, employees, customers, parts suppliers, and products. But we're going to narrow this down to just a product database for the guitars. Now, guitars have been around for a couple hundred years and everyone kind of knows this familiar shape. You might even have one of these sitting around in your closet, gathering dust, hopefully not, but hopefully you're playing it on a regular basis. Now, there are different types of guitars for different needs. Some people like the classic Martin acoustic. Other people like electric guitars. They even like electric guitars with two necks. Maybe you don't like two necks, but you want the extra strings. Maybe you have other issues. I just love the expression on his face. Or maybe you have some other quality or quirk in your personnel that needs to be satisfied by having five necks on your guitar. Now, if you're making custom guitars, you are going to have a paying customer. Usually with cash, he's going to come up and saying, I want something like that. Anyone recognize this guitar and or the hands of the person holding it? This is a 60-year-old Martin ONN classic guitar. The owner of it bought it because his previous guitar got stepped on. Now, you may not be able to see where you're sitting, but this guitar has a sound hole, which you'd expect. It also has another hole there. That's from 50 years of playing with guitar picks and actually wearing the wood away. This is Willie Nelson's guitar called Trigger. There's actually a gentleman in Austin, Texas, who every few weeks has to go through and maintain this guitar. So for those of you who bitch about having a legacy code base, please remember this picture. So, time to get to work. I've decided that my database is okay. I'm going to go out and get my wood, and I'm going to actually start carving up the wood to make guitars. But I want to be able to record the information on the guitar. I'm going to want some sort of serial number. Usually for most manufacturers, they have a serial number with an embedded date. The serial number might note whether it's an acoustic, electric, a certain body style, materials, parts. There's a lot of parts to go in a guitar. The tuning pegs to tighten the strings, the strings, the frets, the glue, the screws, the wiring switches to pickups, the strap holders and all the other stuff. There's just a lot of parts. So, naturally, we'd want to go out there and fully normalize all our data, which when you realize all the subcategories, all the subparts and all the things you can break down a guitar, suddenly you have a whole bunch of stuff. And somebody's going to call you up and say, tell me about my new guitar. Here's my serial number. Please tell me all about my guitar. And a simple thing like the body might be basically a wooden box. It might not even be wooden. It could be aluminum. It could be plastic. It could be any sort of materials. By the way, the Gibson Guitar Company three or four years ago got contact by the US government and said, all that Brazilian rosewood, all that ebony you have, prove to us that you have the proper light and scents for those who would. So you have to actually keep track of all that information. By the way, that almost broke the Gibson Guitar Company. They couldn't prove or they had all their stuff due to sloppy bookkeeping. Also, if you have electric guitar, you might have one or more pickups. You might make them in-house. You can actually go out and buy the various parts and put them together. You can buy them from any number of vendors. They have different qualifications and assembly features. They come in different colors, wiring options, and I think it's real messy. So as you can see, we quickly go down the rabbit hole of the normalized data where we start off with our table for the guitar. We look at where the body is. We look at the neck construction. Have another link to the tuners. The fretboard might be different material. Might be the same material. Might have multiple necks. Might have different frets. So we've gone through all this and we haven't even talked about the colors or the inlays. I mean, there's a whole bunch of variable options that if you're going to sit there and normalize this data, you are going to be there at a long time doing this. And if I'm making guitars, I don't want to sit there and play making tables. I want to actually go out there and make guitars. Now, in engineering terms, they call this finite element analysis. You keep digging down and digging down and digging down until you get to the nth degree. If you're a small business, you can't afford to do this. If you're a small developer working in a small house, you can't go down this rabbit hole for your customers on every time or you're just going to be stuck in the database forever. Now, this is the database world known as a many to many join. Each time you do this, you have to dive into the index table and from the index table dive into the data table. So here we dive into one index to one table into a second index into another table, which might actually be several records there. And yet again into a third one. So we have at least six dives into the indexes and data, not exactly a cheap operation. So, what I'm going to do is I'm going to use the JSON data type and I'm going to come up with a minimal normalized set of data and anything else, kind of a grab bag, I'm going to throw into that JSON column. I have a gigabyte of information I can store in there. If I need more than that, I can add in another column of the same data type. Now, great thing about the JSON is that it's mutable. If someone comes in and says, hey, change the color from DuPont number red to purple number 15, I can do that on the fly. I don't have to go back and really worry about hitting the database for the relational tables. I can do it on the fly and the JSON data. If I want to pull it out, I can later. So since the JSON data is mutable, I can add out of this stuff. So if they come in and say, I want abalone and lace someplace on there, it's very easy just to plug that in. The data, once again, is mutable. So if I later need to normalize some piece of information, like the production data, it's like we want to know the date started, location when it was roughly finished, when the paint started, when the paint finished in the final assembly, and most importantly for someone with a small business to when it shipped to the customer. And normalize that as needed. Now, since I'm here from MySQL, I got to talk about our MySQL document store. This is a schemeless JSON NoSQL database. The good news for most of you is you no longer have to embed strings of SQL in your code. Yeah, got one woo. We also use a very modern style for the API. I hope you're going to like it. And also the big option is that you can go in to the same data at the same time from the NoSQL side to the SQL side. So it gives you the best of both worlds. Now, this is what our architecture looks like. On the left-hand side, your application talks to a MySQL connector, which hopefully all of you is either MySQL I or PDO. If you're using the old MySQL and you're on PHP 5.5 or earlier, please bow your head and disgrace and leave the conference. On the right-hand side, I actually have a customer near me that's running MySQL 4.1 with PDP PHP 5.3 that I wrote 15 years ago, but it only does one application and they know it's bad and they won't let me upgrade it. On the right-hand side, we have our new shell with our new protocol, our new CRUD and SQL supporting protocol. By the way, our standard protocols are aging panes. Our new protocol is built on Google Protobufs, so it's not tightly typed, but it's a lot better, so it's going to help reduce SQL injection. It has a lot of other features for managing things like NODV cluster. By using our new shell, if you're running MySQL 5.7, you type check for upgrade, put in your host parameters and it will tell you what you need to change before you do a 5.7.8.0 upgrade. But the most important thing for what we're talking about today is the new protocol that you talk to the JSON collections. I'm not sure how bad this is to read in the back, hopefully you can see it on the side thing. Here's our new shell. By the way, the new shell has three modes, JavaScript, Python, built-in interpreters for that and SQL. A lot of neat features in there. So I come in here. I create a collection called DaveGuitar. Dave.GuitarAdd. Put it in the first record. Away we go. All you have to do is that you can use it to prototype data to get started. By the way, the API is pretty much the same Python, JavaScript, PHP, C++, whatever language you want. And if I do Dave.Guitar.Find it will come back and give me out all the information. I can also index the data. I can either do it from within document store from the SQL side, so we can actually index the data for faster searches. On the back inside behind the protocol, the optimizer does basically the same thing to your data. It makes it very efficient for you to be able to get information out. Now I can also, with the new API, only ask for certain fields if I want. I can say I just want the type field. I can also go out there and if I don't specify as I didn't hear, the type equals less Paul I can actually go out there and just say, find me all the records and give me the type and sort it by the type. The guys who designed our API were very thorough and they looked what's out there and they also were looking at what we support for the SQL side. And this is just one railroad diagram. You can see for fine that we have all the standard things you're used to. Search conditions, age over 25 fields. I want the age and postal code. Group by having sort, limits, offsets, and locks. So this does support transactions. So for those of you who are tired of having your IDE blow up on stuff like this because it can't help you, our new API I think is a little bit easier to read, a little bit easier to comprehend. So rather than doing a select from people where job like dollar job and age greater than dollar age, we can say find job like variable job and age greater than variable age, so we can do those variables in there executed. By the way, this also works with tables. So if you want to use the new API and you want to do it with your existing tables or new tables you set up, it's very similar. The thing is instead of find you now have the keyword select, so we're going to select name and district from a table called city where the district is like a variable named district we'll plug in the name Texas into that variable. We only want the first 25 records go out and do the fetch all. This is something you can do with our NoSQL software is very complex analytics. This is something you can't do in Mongo or the other NoSQL. This is actually a very, very interesting piece of code. JSON table is a function that lets you take this unstructured NoSQL JSON data and temporarily turn it into a relational table. Once you have an relational table you can use all the other relational tools out there for analytics. In this case what we're doing is we're going to rank a whole bunch of grades for restaurants in New York. We're going to rank them by the cuisine, giving the highest average score of the top order and then we're going to feed that into a common table expression. CTEs are a fast way to write sub-query. Sub-query is going to be a pain in the butt to write. CTEs are easier to write, easier to comprehend and easier to modify. In this we're taking our NoSQL data turning into pulling out the average grades for the restaurants and then ranking that by the average score and giving the highest one by cuisine. If you want to know the best Indian restaurant in the five boroughs of New York or the best Italian this will get for you. If you don't want to rewrite your old queries but you want to use the new protocol we can clean up a little later so you can plug in your old queries and take advantage of the new protocol. The new shell comes with three modes JavaScript, Python and SQL. I was in the Dominican Republic last weekend working with a Python group starting a new conference there and there are some things I couldn't do with Python they're able to be rather easily. It's all there on your screen so if you want to somehow ETL your data or change it on the fly you can write ad hoc reports. Also we have a bulk loader for JSON by the way this bulk loader now understores JSON B so if you have stuff you're importing from Mongo we know how to handle that. It's very fast in this case in just over two seconds it read in 25,000 records very quick very neat. So hybrid databases what is it a good idea or not well it's not really a panacea for best performance out of a database when you have hundreds of millions of records you're probably going to want to use third, fourth or fifth normalized form you want to have everything fully normalized you want to have everything indexed properly and you're going to have very carefully developed queries because you have millions of transactions going on at any one time unfortunately that's not the business case for most of you these days anymore so if you work in a world where you have that many transactions and nothing really changes you don't really need mutable data this is great non-hybrid databases are great but if you're in an environment where you don't know the data structures or they're changing all the time you really don't have a big need for data you're talking hundreds of thousands of records not millions or billions and you really have a data that influx most of the time something new is coming in, something old is going out things are changing all the time you might want to start looking in at a hybrid database by the way, mySQL 8 has been out for about a year now big features were talked about yesterday by Gaby and her talk I won't really go into any of those details except for resource groups you can actually nail certain CPUs on your system hopefully it's a multiple CPU system for a batch group or a fast payment group or something like that and in your query in a comment you put in there this goes to this resource group the optimizer will dedicate those queries and make sure it only goes to those CPUs so you can actually dedicate part of your server by the way, if you're dealing with JSON in mySQL please buy my book I think I gave away my last copy yesterday it's 120 some odd pages it's very concise if you've read the mySQL manuals on the JSON data type they're not exactly the clearest things in the world and I have some very interesting and very concise examples also if you're looking for more information on the PECL extension for the XDEV API for PHP I have a link up there for the manual on how to get it running have another link for the mySQL document store also a link to the XDEV API user guide also two weeks ago in Miami I had an XDEV API tutorial the links are out there and the information is actually on my github account under PHP XDEV API and with that I've run through my spiel what questions do you have? yes sir you mentioned that there is a small impact on performance if you're using you're not normalizing data and just dumping it into a JSON type string how much would it be for example and can you index the data that is inside the JSON? you can index the data within the JSON you can also use a generated column to pull that data out of the JSON and put it into a materialized column within the SQL relational tables performance I'm not seeing anything more than 3-4% unfortunately Oracle doesn't want me giving benchmarks as I'm not the benchmark guy the benchmark guy keeps telling me yeah I'll get it to you next week since middle of February so I don't really have any hard numbers I can give you but my gut feeling is 3-4% thank you any other questions? it's a quiet group so for fall text search which way is being preferred to use is being still using old unnormalized database or using session through JSON as fast as a performance wise for full text search which is better the new protocol or the old protocol right now I'd say the old protocol but give us a few months we're working on other things by the way we just released 8015 two weeks ago 8016 will be out in April and for those of you who've always wanted check constraints which is something that we've accepted but never actually implemented that's going to be out there in 8016 but for now if you're doing especially in ODB full text search use the old protocol now you can actually read in that data in the JSON pull it out with a generated column and do the full text search on that which is probably the way I'd recommend to do that okay Hi PHP and MySQL they have a wonderful love story PHP in the console only Python and JS um actually I got the question last week why wasn't Perl in there um I don't know I know a lot of the analytics folks are doing a lot of stuff with Python and of course as a PHP programmer that means you only know about 18 JavaScript frameworks um it's one of those things I don't know about PHP in there um I think PHP is a little more heavy weight than either Python or JavaScript right now um I can talk to the guys who are in charge of that I'm not sure why they didn't put it in PHP but basically my career for the past 25 years has been thanks to um Monti Weedness, David Axmark, and Rasmus Lerdorf because I've basically been doing those for the past 25 years so