 Okay, so the next talk will be tour the PostgreSQL data types by Andreas Chabon, and he will introduce himself, I guess. So, can you hear me? Does that work? Okay, welcome. Yeah, let's talk about some of the Postgres data types we have. Short note about myself, my name is Andreas. I work for a company called Pivotal, before that I worked for EMC, and Pivotal is actually a subsidiary of EMC, so I moved between EMC and Pivotal and so on. And I work on Green Plum and Postgres projects. Okay, how many of you are using Postgres? Okay, other way around, how many of you have never used Postgres? That's good. Anyone knows more than like 10 data types in Postgres? Okay, that's like 10 hands, that's good. This talk is for you. Anyone who is here to get a seat for Simon's talk? Unfortunately, this talk is cancelled, I'm sorry. Okay, first question, how many data types do we have in Postgres? I take answers, 40, 63, any more? Only one. Only one. That must be Vatsha. Yeah, if I look into PG type, in my database I get 361 right now. Unfortunately, this also includes any table types we have, because they go into the same table. And it includes any references to another type here, so if I include this one, I get 82 data types in my database. Still a lot. That's a long list. We don't want to look at all of them, because about half of them we don't care like what's the TS range or what's the SQL identifier and so on. But we have some very interesting data types here, we want to have a look. Luckily, the Postgres documentation, which by the way is very good, provides a list of data types we want to use. And this table in Postgres 9.6 comes with 41 data types. Okay, well, we asked this question, how many data types are you using? Most of you are not even using 10. Let's see what we have for you. So we have text types in Postgres, we have numeric types, dates and so on. We even have some special types handling JSON or Boolean binary data. And we start with text types. How many do we have? Any idea? Two, which is? Well, in the end, we have varchar, we have charm, we have text, but internally they are all the same. It's one data type internally, it's just external representation. And by the way, unlike some other databases, all text types in Postgres are case-sensitive. So if you do like and you want to have a case-sensitive complication, a comparison you need to use, I like, not like. If you use varchar or text, you can store strings up to roughly one gigabyte of text. If you specify a char without any length specification, it's actually just one byte. Probably not what you want to do. So with char, you want to specify your length. With varchar, it's optional and it does not include any white spaces at the end. The other interesting case, if you look at the white spaces at all, I have five white spaces here and I cast them to varchar and char five and ten and one. And you see the lengths here, if I have my varchar five and my varchar ten, they all end up with five white spaces here. If I have casted it to char, I end up having ten char's because in char, if I specify a length, Postgres will actually fill up white spaces under the specified length. Okay, how does Postgres store any text type or by that any variable length type? Back in the good old days, we only could store any string up to eight kilobyte roughly. We have so-called pages and every page is eight kilobyte size and Postgres aims to store about four rows minimum in every page. So it's roughly 2k we can have in the tables. Also define some many columns we can have in a table. It's one and a half thousand columns per table because every row has to fit in a page. So how can we store a gigabyte of text in eight kilobyte? Well, we can try to compress it. We can increase block size to a gigabyte. Or we can just go and have text be a pointer into yet another table, maybe just store multiple rows for every text. That's what Postgres is doing. It's totally transparent for you, but it means you have two tables now. If you do a vacuum in a database, occasionally you see it's not only vacuuming your table, but it's also vacuum so-called toast tables. You see them flowing by if you want to vacuum them both. Because then you have to increase the page size to one gigabyte. Here you have a pointer into yet another table which is also using eight kilobyte pages, but it's using multiple rows until you have your gigabyte or whatever data you want to store. Which also means you can have viable lengths for Java, Java, or Numeric and so on. And Postgres is handling this for us. So what about this VATSHA or CHAR 255? Whoever run into such a table definition having a VATSHA 255, about 20, why? Any ideas? I haven't heard ORC before. I have heard Informix before. I have heard SQL Server before, because back in the 90s they only could store 255 bytes in one column. So this problem doesn't apply to Postgres. If you run into a table with CHAR or VATSHA 255, go and ask the question why someone defined a table this way. It probably doesn't make sense. It might come from that 255 is actually 2 to the power of 8 minus 1, or 8 bits. And MySQL has an interesting limit. So if you have an index key, it can be 3 times 255 bytes. Not more. Not like in Postgres where you can have a gigabyte for index. They can only store 3 bytes with EF8 and this limits them to 765 bytes. Yet another reason why they have 255 here. Okay, let's move on to Numeric types. How many do we have here? So we have integers, we have floating points and we have numeric. This can be kind of confusing because if you come from ORC, most people on ORC use numeric to store any kind of integers and such, but you don't want to do this in Postgres, most likely you want to have an integer, not a numeric type. On the other hand, I've seen customers storing any kind of money data in a floating point. Which is kind of funny if you wanted an auditing later on because floating point is always rounded up or down. It's never exact. Okay, we have small end which can store 2 bytes, which requires 2 bytes and can go from minus 32k to plus 32k. Most likely you have an integer which is minus 2 billion up to 2 plus billion, so it's always a signed integer. And we have a ridiculous large big end which uses 8 bytes internally. If you have tables which makes a small end and you have multiple small end columns, you want to look how you align the table. Like if you have small end integer, small end integer, you would expect it's 12 bytes, it's actually 16 bytes, because it's always aligned to 4. You can start an integer on 2 bytes. So Postgres is actually losing 2 bytes here and losing 2 bytes here. If you just rearrange your table, small end, small end and then integer, integer, you have 12 bytes. If you have a million rows, okay, it's a few megabytes. If you have a billion rows, we're talking gigabytes here. Then we have floating point numbers. So we have wheel, it's 4 bytes, which has a precision up to 6 digits. And we have double precision which goes up to 15 digits. It always means it's a rounded number. It can store almost everything as you can see here. So we have 6 digits here. And, okay, it shows us the correct number here. Once I add one more digit here, you see it's rounded. This will really hurt you, so I have 7 digits here, and it gives me back a totally different number. If you try to store any kind of money or such in a floating point number, your sides will be worn. Don't do this. The same for double precision here, so 15 digits, it's okay. It gives me back the same number. Once I go to 16 digits, it starts rounding my number. And same problem, I have 60 digits here. It starts giving me back another number. It's rounded up here. So what can I do if I have to store something like money or such? I have to use numeric. Numeric has a precision up to... Oh, let me go back, sorry. We also have a money type in Postgres. You don't want to use this one at all. There are like two users in the world who use this in total. I think it's two or three times that it was called deprecated. It was resurrected, patched again. It uses 8 bytes of storage, and it can store up to 92 quadrillion. In comparization, it's still 4 digits less than we have the current US debt. So you still have a way to go. If you want to store money, you really want to use numeric. Numeric data type, you specify your precision by saying, okay, that's my total number of digits I have, and that's my number of digits I have after the decimal point. So I have 10 here. You can see 1, 2, 3, 4, 5, 6, 7, 8, 9, 10. It's a bit confusing, but you get around it. On the other hand, numeric is handled in software, so there's no hardware support like if you have integers. Integer can be handled in a CPU or just done. It's not the same for numeric, so it will be slower. But you get your precision. Okay, anyone knows this guy? No? Not in person? Yeah, I think so. It's been a while, anyone knows this puzzle? How many, or how much is it? Minus 1. Yes, so if you try this with the double precision here, 2 to the power of 64 minus 1, double precision gives us this very inaccurate number. If you try the same one with numeric, 2 to the power of 64 using numeric, we get a very exact number here, and it's still only using 20 digits. We have 980 left. So if you ever find a use case for all of the thousand digits here, please let me know. Okay, let's move on. We also have sequences. In other databases, you might tell about it. It's called auto increment or so on. If you define a column in Postgres as here, Postgres will automatically create a sequence for you and set it as a default value for this column. It will also attach this sequence to the tables. If you drop a table, it will also drop a sequence for you. It doesn't have to be, it can be. Sequences by default start with 1 and increase by 1, but you can change this. So you can have a sequence which does steps of 5 or goes backwards or starts with a thousand, whatever. Some people use this if they want to count values in a query. They just create a sequence in the query in a transaction, count something and then drop it. Works this way. A sequence is not transactional. So anytime you receive a value from a sequence, it's gone. It's never rolled back. So you will have gaps. Like if you roll back a query and you received 50 or 100 sequence values, they are gone. There are a few functions for sequences like you can use current wall to get a current value, but it will only work if you already used or consumed the value in your current session. If not, this will return an error. You cannot custom the next value of a sequence. So this will just increase by 1 and give you back the value. And then you can use current wall to retrieve the same value over and over again. So you can set a sequence to a new value by just using set wall. And some people like to get rid of the gaps they have in their tables on the sequence, like when they had to roll back a huge transaction and they have a huge big gap. You can easily do this by saying, okay, set wall and get my maximum value from my sequence. Never relay on a sequence name. Usually it's sequence, table, ID, and a prefix of sequence, but it doesn't have to be. There's a function which you can use to get a sequence. You specify a table name and a column name and Postgres will return the sequence name to you. And then you can go and say, select star for my sequence, because it acts like a table. And it gives you all kind of information like what's my current value, can my sequence actually cycle over, like if I have a small serial number, I can go up to 32K. What happens if I go over this? By default, Postgres will just reject it. You can change it and say, okay, I want to go over and start from the beginning. All right, any point if you have questions, please just ask. Where's my mic? Can you please repeat the question? Yes, I will do. PG gets a sequence, gets you the old sequence. The old sequence, yeah. If you want to see which sequence is used, you have to look at the default value for this column. So you have two things. You have a sequence, you need a sequence name, and you have a default value. And in Postgres, if you actually specify something for this ID column here, it will not use the sequence. So if you want to see which sequence is used in a column, you have to look at a default value. You can also go and have a sequence used in several tables, like a global counter. This works as well. So it's not limited to one table. Okay, what's this? Any ideas? How did you figure out this one? 90 degrees south latitude. It's south pole. Okay, what's the time there? Right now. That's a valid answer, yes. All of them. Actually, that's warm. UTC is warm. Okay, let's find out. We have several date and date times and interval data types in Postgres. So we have a time stamp with and without time zone. Usually people use the shortcuts here, time stamp and time stamp, TZ. We have a time with and without time zone. Obviously without date. And we have a date, obviously without time zone because it's just a date. Never go and use a watcher or a chart to store a time stamp. Please don't. If you use a time zone, a time stamp with a time zone, Postgres will actually handle any kind of summer time, winter time changes for you. I'll show you a few examples. It will not handle any leap seconds for you because the operating system has no idea about leap seconds. It's not stored anywhere. It happens occasionally like last New Year's Eve, but it's not stored anywhere. Okay, here we have a simple time stamp. Actually I use a date here and cast it to a time stamp so you see I get back my midnight value for the same day. I even can use this silly format they used in the United States by having a month's first and then my day, Postgres can do it. Now I use time stamp TZ here with a time zone. I specify a time zone UTC and you see there's difference. First of all I use PM here and I specified it's V25PM and it ends up giving me back 525. That's because my time is always returned in the time zone I have on my database server. Not where your client is, but the time zone your database server is running in. So if you want to have less trouble, set all of your database servers to the same time zone. If not, it might be confusing. Right here, it's two hours difference. You can see it's August, it's summer time. I get two hours difference here. Here I use the date in winter, in December. You see it's one hour difference so Postgres gets the daytime saving right. This works with any time zone name we have in our operating system database. I can just specify a time zone here by name and it knows how to handle any daytime saving for this time zone including all the changes they have like Russia a few years went to summer time and never went back and ever since then they just stay on summer time all year which is nice except they have no summer. If you just specify a number for a time zone obviously there cannot be any daylight saving time handling. So if I want to return a date in a specific time zone like this here it's always returned in my own time zone where my server is running I can also set my new time zone here by just saying set time zone then it's returned in a time zone I specified here. Postgres is doing all the math for us we don't have to do it. It's very handy. But it's quite complicated to set a time zone every time so I can just say give me a date back in this specific time zone this can be part of the query it couldn't be any easier. I can also do math with dates and times like here I have two timestamps and I get back the difference which is an interval now I have four days difference between these two timestamps or like here I just turn them around I get minus three days because this one is smaller than this one. Works with times as well so it's not only days but it works with times I can have my difference in days and times between two timestamps. If you want to know how long back someone inserted something in the database just do a timestamp minus now now minus timestamp and you get back time, science, when this happened. Okay another interesting use case here I have February 28 and I want to know if the date was a leap if the year was a leap year or not. I just add an interval of one day if it's a leap year I get back February 29 if it's not a leap year I get back first of March it was it was you can divide it by 400 it was Are you questioning Postgres? Get out Okay we have 2001 1st of January minus 1st of January so I get back 360 days which means 2000's a leap year again I use 2002 1st of January minus 2001 1st of January I get back 365 so 2001 is not a leap year Okay coming back to this question what time at South Pole? South Pole is Scott Ammons station actually belongs to the US but it's supplied from New Zealand so for easiness they just use the same time zone as New Zealand which is NZ so by now we learn how to carry time for New Zealand so we select now at time zone New Zealand but this is well it's not good enough at least for us because South Pole got its own time zone so we can actually query select now at time zone Antarctica South Pole and if you look into the operating system database you see that we have like 9 or 10 different time zones defined under Antarctica so every station down there got its own time zone usually aligned to which country is operating this station there was a question the question was if I can use this construct a table creation time to show a date always in the time zone no I'm not sure but I don't think so because Postgres is storing times in UTC it's always a question of transformation once you retrieve a value but feel free to try it out okay let's move on to some hipster stuff here well XML is no longer a hipster we have chosen for this now so it's actually something from the 90s where you could define a document and describe a document in a certain structure and Postgres allows you if you have compiled in an XML library to store XML native in your database and validate if it's a valid structure or not it will validate a syntax like if you're opening and closing text are correct it will not validate any schema definition DDT you can store a whole document or you can just store fragments of an XML document it works both ways I'll show you in a moment unlike in the standard Postgres is always assuming the encoding you specified in your client encoding these days most people use UTF-8 not all, not everyone so it's always assuming that the document you transfer is encoded in the same client encoding you have specified for your database except if your client encoding is actually SQL ASCII then it's using whatever specified in the XML document and you cannot search in XML data types you can just retrieve some but not search in them I have a few examples for you we use XML path to store a document here we have a full document using XML and you see it returns my full document here I can also store just a part of the XML document and if I want to retrieve it and use it somewhere else I can use XML serialize have my XML content here and I get back a string which I can use in my application but now for the real hipster stuff we have JSON in Postgres as well it's used in many web frameworks these days I think in the end of things you have which transfers data it's using JSON all of this is no SQL database like Cassandra and MongoDB and so on is JSON native and Postgres can do it as well, why not we have two different data types here we have JSON B which will decompress whatever you store in it and use the path structure and we have the order JSON data type which will store whatever you write into it and will decompress and validate it once you retrieve it and unlike no SQL databases everything you do with JSON and JSON B in Postgres is transactional so if you roll it back it's gone very simple example here so I have my string here because I have JSON I need to quote it here for JSON and just cast it to my JSON type so I get my data type back here I can also have arrays so I specify my array here the number of values full array back with all of my types I can also use key value pairs which is what JSON databases or no SQL databases do usually key value, key value, key value and I get my result back as a key value pair here okay that's easy but I can also look into my structure so I can select let's say my key number two no it's not number two my key two which belongs to the string number two so I only retrieve this element out of my JSON data type or I want to compare whatever I have in my database is this value in my JSON string I have in my database of course I can use this operator and just look this is here yes it returns true or I can see if this value or this key is in my data somewhere df yes it's included here I get my true back I can also look up a key if a key is in my data or not because we know JSON is all about not having a structure of my data so I don't really know what is in my data if I want to figure out if there's something if there's a specific key or a specific value in my data I have to look it up I have to verify it if I have a structure database like Postgres I know what columns I have in my table it doesn't work this way in no secret good one of the more advanced features we have we can actually use an index on JSON b types I can actually take some on a picture and send it to get hash please so we have a gin index which is generalized inverted index in Postgres but it only works on JSON b not on JSON what I can do can create an index on my table specify my values I want to index here usually I just access this name column here using gin and then whenever I query my table here my name field in it it will use this index to speed up my query which other database, no secret database can use an index on JSON good let's move on we have a few more so we have a real boolean type in Postgres if you have anything which is supposed to store two faults please use a boolean not a small end I've seen this before a few times so we have a so called boolean in a table and we want to migrate it and we end up with values like 257 what does it mean for a boolean you can use values like 30 or yes or no to store the values but Postgres will always return either true or false one of these two and of course now if you have it so one of the more interesting use cases here I can have a partial index on a boolean column well I can have a partial index on any column but here it makes sense so let's say I have a lock table hundreds of millions of entries occasionally there is an error I have a boolean flag here so most likely I am not interested in 99% where there is no error but I really want to find all my errors very quickly so I define myself an index on I forgot my index definition Microsoft Power Point has encountered a problem I'm sorry company I'm sorry ok I just generate my data here I have like 2% error values in my table and if I have a conditional index on my column I can use this index to find out here is my index where error is true I can use this index to find all the values where my error is true and in comparison let's look at the size of this index my full index on this table is roughly 3000 pages and my conditional index is 57 pages and out of this roughly 3000 pages 2700 I will never ever use because it's my 98% of non errors I can use this one good we also have a bit data type it makes sense if you want to store config values like 2 folds and don't want to spend 1 boolean per column so you can have multiple bits in one field so we have bit which is a fixed length and we have bit varying which is a up to this length you specify here if you don't specify your length it's always a bit 1 it's like a boolean and you can use all the logical operators we have like and or and so on on this values so I have a table with my bit field here and I can actually use a bit operator in Postgres to specify bits so I don't have to do the math and use my decimal value I can have Postgres storing this for me like here I have my data and I use an end obviously I only get back the bits where in both values in my data in my question here bits are set if I have an or bits are set in data or set in my query I can have XOR I can do all the bit operations we know from C like I shift to left here which basically means it's multiplication by 2 you see it's 21 here I shifted it by 1 so I get 42 as result everything in a database I don't even have to pull out my data for this and I can search for specific bits in my field because that's what we do with databases we search in it so I want to find this bit and I get back my data where this bit is set obviously I get all the other ones as well but I don't care about them okay do you ever store binary data in Postgres or any databases for that matter? why are you storing binary data? images? I have to do this Postgres has a binary data type for this it's called byte A and if you just want to store binary data in text there's a problem Postgres is written in C in C a string ends with a binary 0 no bit set if you try to send this to the database you get a pass error at some point because there are string ends here all the programming languages we have like Python, Perl, Java, whatsoever they all have functions to handle binary data why not show in any samples here how you can transfer binary data in a database please look up the documentation for the programming language you have there are functions how to store and retrieve binary data from a byte A column please don't try to write your own parser I have seen this but Postgres has two different types how to present binary data the old format is escape if you set byte A output to escape you see we get escaped here but it's very hard to pass because this can be anything from 1 to 3 or no escaped values so a while ago they changed it to hex format every value is a 2 byte hex value here actually it's a 1 byte so they all have 2 digits that's very easy to pass and it's quicker by the way we also have network address bytes 3 of them so we have INET which stores any kind of host and optional network we have CIDR which only stores a network definition and we have a MAC address field which can store in 6 bytes a MAC address so we don't want to go and use like a char or watch or anything to store IP addresses because then you have to deal with IPv4 and IPv6 addresses Postgres can do it for you we have the right data types any kind of log parsing you do any kind of network data you want to store please use one of these types or like here I store an IP address in INET or I just store an IP address without a network definition so I get back my slash 32 and here I just store part of a network and Postgres will make an assumption what kind of network I'm looking for so it's a 24 8 bit for the network and again if you do log parsing or such I can use an index on such a column so I have my INET table here I can create an index on it and I store a number of values here and just want to find out which of these values belong to a certain range or I can say where IDR is in my range here and you see I only get 2 out of the 3 values back I had here and the square we are supported by an index so it should be very fast that's my index thing good, last but not least we have arrays an array is not really a data type but it's a combination of types so we can store multiple different types in an array it can have any dimension it can be multi-dimensional it's very flexible, think about it like it's jason in one column before we had jason in Postgres people used array to handle this sorry so here we have a 2-dimension array my first dimension my second dimension here and if I select this as an array using the array constructor here you see I get my 2 values and in my second dimension here I can also ask Postgres how many dimensions I have so it tells me this is 2 here I can ask for any element in my area which makes it very easy to access whatever I have I just want to have this one first element here or I want to have my element 2 up to 3 so it will return these two it tells me how many dimensions I have it tells me what is my length so in this case I ask for my total length here for the first element it's 2 so I have 2 values in my array I can really loop over whatever I have in my array I can search in my array or I can say okay if this element is deaf it returns 2 because it's in my array I can prepend and append to an array there are many more operations please look up in documentation but these are the more interesting ones it's a bit confusing because prepend actually has this parameter before the array and append has it afterwards if you think about it makes sense if you think about which one it is you can create a string out of an array so if you want to use this one in your application just use array to string or you can create a table or a number of rows out of your array by just using unnest good last but not least ever created your own data type well I did we have a few options here we have a composite type we have a composition of different types we have an enum type and of course you can write your very own data type but you have to write some C code for it and you can use a few extensions we have or a few more like postures and so on let's quickly look into this one so I create my own composite type here you can see it's a type I actually store four different different data types in my type I create another type in my table it's a very simple example here so I have two different currencies I have a date and a value and I just go and store my conversation rate between euro and US dollar and for every day I can have one row in my table holding all the data together and of course I can select it it returns a row I can also just pick what I want to have out of my composite type and last but not least enum enumeration is really a list of values and you should consider if you need an enum or not because this list should never change give you a good example how many different gender values you know how many? 17? anymore? well Facebook defines if I wanted to create a gender type think again you can create a type like here as an enum I have my four different values here then again I can use this type as a data type in my table I can insert by using the values into my color column here I can also select by using the color so it really does one to end look up for me in a background good that's it for here we have a few more you've seen we have like 40 more data types we did not look into all of them but it's the overview of the most common used you will find in Postgres thank you okay so we have time for one or two quick questions please speak in the mic thank you right at the beginning you said you can lose space to alignment requirements of the columns in a table now I seem to remember that years ago the physical layout of a table was decoupled from the logical order of the columns so I thought this problem was solved wasn't it? no you have this in databases which have columnar storage but in Postgres it's just aligned the way you have it created it's all in one page one more question hi thank you for the great talk one question will it ever be possible to use indexes on arrays or use other people's primary keys as forging keys in arrays in the same way as it would be in normal database fields I think many people just moved on to JSON over arrays because it's much functionally better than in arrays and you can use all the fancy index stuff in JSON which you can in arrays well we are out of time basically anyway so thanks Andreas again thanks