 Hello, everybody. Thank you for joining us today from the Virtual Vertica BDC 2020. Today's breakout session is entitled, Putting Complex Data Types to Work. I'm Jeff Healy. I lead Vertica Marketing. I'll be a host for this breakout session. Joining me is Deepak Majetti, Technical Lead from Vertica Engineering. But before we begin, I encourage you to submit questions and comments during the virtual session. You don't have to wait. Just type your question or comment in the question box below the slides and click Submit. It will be a Q&A session at the end of the presentation. We'll answer as many questions as we're able to during that time. Any questions that we don't address, we'll do our best to answer them offline. Alternatively, visit VerticaForms at formedupvertica.com to post your questions there after the session. Our engineering team is planning to join the forums to keep the conversation going. And also as a reminder that you can maximize your screen by clicking the double arrow button in the lower right corner of the slides. And yes, this virtual session is being recorded and will be available to view on-demand this week. We'll send you a notification as soon as it's ready. Now let's get started. How do you debug? Thanks, Jeff. I'm excited to talk about the complex data types work we've been doing at Vertica R&D. Without further delay, let's see why and how we should put complex data types to work in your data analytics. So this is going to be the outline or overview of my talk today. First, I'm going to talk about what are complex data types in some of these cases. I will then quickly cover some file formats that support these complex data types. I will then deep dive into the current support for complex data types in Vertica. Finally, I'll conclude with some user considerations and what is coming in our Tenno release and our future roadmap and directions for this project. So what are complex data types? Complex data types are necessary data structures composed of primitive types. Primitive types are nothing but your int, float, and string, wall, binary, et cetera, the basic types. Some examples of complex data types include struct, also called row, array, list, set, map, and union. Complex data types can also be built by composing other complex data types. Complex data types are very useful for handling spa data. There are some examples on this presentation on that use case and also they help simplify analysis. So let's look at some examples of complex data types. So the first example on the left, you can see a simple customer which is of type struct with two fields, namely field name of type string and field ID of type integer. Structs are nothing but a group of fields and each field is a type of its own. The type can be primitive or another complex type. And on the right, we have some example data for this simple customer complex type. So it's basically two fields of type string and integer. So in this case, you have two rows where the first row is Alex with name Alex and ID 1001 and the second row has name Mary with ID 2002. The second complex type on the left is four numbers of type array. The array has the element type string. So array is nothing but a collection of elements. The elements could be again a primitive type or another complex type. So in this example, the collection is of type string, which is a primitive type. And on the right, you have some example of this collection of array type called four numbers. Basically, each row has a list or a collection of four numbers. On the first row, you have two four numbers and on the second, you have a single four number in that array. And the third type on the slide is the map data type. Map is nothing but a collection of key value pairs. So each element is actually a key value and you have a collection of such elements. The key is usually a primitive type. However, the value can be a primitive or a complex type. So in this example, both the key and value are of type string. And then if you look on the right side of the slide, you have some sample data. Here we have HTTP request of where the key is the header type and value is the header value. So for instance, on the first row, we have key type pragma with value no cache, key type host with value some host name. And similarly on the second row, you have some key value called accept with some text HTML. Because, yeah, they actually have a collection of elements, arrays and maps are commonly called as collections in many documents. So we saw examples of one level complex type. So on this slide, we have nested complex data types. On the right, we have the root complex type called web events of type struct. Struct has four fields, session ID of type integer, session duration of type timestamp. And then the third and the fourth fields, customer and HTTP request are further complex types themselves. So customer is again a complex type of type struct with three fields where the first two fields name ID are primitive types. However, the third field is another complex type of phone numbers which we just saw in the previous slide. Similarly, HTTP request is also the same map type that we just saw. So in this example, each complex type is independent and you can reuse a complex type inside other complex types. For example, you can build another type called orders and simply reuse the customer type. However, in a practical implementation, you have to deal with complexities involving security, ownership and life cycle dependencies. So keeping complex types as independent has that advantage of reusing them. However, the complication with that is you have to deal with security and ownership and life cycle dependencies. So on this slide, we have another style of declaring a nested complex type. This is called inline complex data type. So we have the same web driven struct type. However, if you look at the complex types, they're embedded into the parent type definition. So customer and HTTP request definition is embedded inline into this parent structure. So the advantage of this is you won't have to deal with the security and other life cycle dependency issues with the downside being you can't reuse them. So it's sort of a trade-off between these two. So let's see now some use cases of these complex types. So the first use case or the benefit of using complex type is that you'll be able to express analysis more naturally. Complex data is to simplify the expression of analysis logic, thereby simplifying the data pipelines. In SQL, it feels as if you have tables inside a table. So let's look at an example and say you want to list all the customers with more than 1,000 website events. So if you have complex types, you can simply create a table called WebEvents and with one column of type WebEvents, which is a complex type. So we just saw WebEvents. It has four fields, session, customer, and HTTP request. So you can basically have the entire schema in one type. If you don't have complex types, you'll have to create four tables, one essentially for each complex type, and then you have to establish primary key foreign key dependencies across these tables. Now, if you want to achieve your goal of listing all the customers with more than 1,000 web requests, if you have complex types, you can simply use the dot notation to extract the name, the contact, and also use some special functions for maps that will give you the count of all the HTTP request grid in 1,000. However, if you don't have complex types, you'll have to now join each table individually, extract the results from a subquery, and again join it on the outer query, and finally you can apply a predicate of total requests which are greater than 1,000 to basically get your final results. So complex types basically simplify the query writing path. Also, the execution itself is also simplified. So you don't have to have joins if you have complex types. You can simply have a load step to load the map type, and then you can apply the function on top of it directly. However, if you have separate tables, you have to join all these data and apply the filter step, and then finally another join to get your results. So the other advantage of complex types is that you can process semi-structured data very efficiently. For example, if you have data from click streams or page views, the data is often sparse, and maps are very well suited for such data. So maps are semi-structured by nature, and with this support, you can now actually have semi-structured data represented along with structured columns in any database. So maps have this nice feature to encapsulate sparse data. As an example, the common fields of a click stream or page view data are pragma, host, and accept. If you don't have map types, you will have to end up creating a column for each of these header or field types. However, if you have map, you can basically embed as key value pairs for all the data. So on the left, on the slide, you can see an example where if you have a separate column for each field, you end up with a lot of nodes, basically sparse. However, if you can embed them in a map, you can put them into a single column and have better efficiency and better representation of sparse data. Imagine if you have thousands of fields in a click stream or page view, you will need thousands of columns to represent the data if you don't have a map type. So given these are the most commonly used complexity types, let's see what are the file formats that actually support these complex data types. The most file formats popular ones support complex data types. However, they have different sort of variations. For instance, if you have JSON, it supports arrays and objects which are complex data types. However, JSON data is schema less, it is row oriented, and it is text based. Because it is schema less, it has to store redundant keys on every row. The second type of file format is Avro, and Avro has recurred enums, arrays, maps, unions, and a fixed type. However, Avro has a schema, it is row oriented, and it is binary compressed. The third category is basically the parquet and arc style of file formats where they're column nodes. So parquet and arc have support for arrays, maps, and structs. They have a schema, they are column oriented, unlike Avro, which is row oriented, and they're also binary compressed, and they support very nice compression and encoding types. Additionally, so the main difference between parquet and arc is only in terms of how they represent complex types. Parquet includes the complex type hierarchy as separation definition levels. However, arc uses a separate column at every parent of the complex type to basically represent the nullness. So apart from that difference in how they represent complex types, parquet and arc have similar capabilities in terms of optimizations and other compression techniques. So to summarize, JSON has no schema, has no binary format. It is columnar, sorry, it is not columnar. Avro has a schema, it has binary format. However, it is not columnar, and parquet and arc have a schema, have a binary format, and are columnar. So let's see how we can query these different kinds of complex types, and also the different file formats that they can be present in and how we can basically query these different variations in Vodika. So in Vodika, we basically have this feature called FlexTables where you can load complex data types and analyze them. So FlexTables use a binary format called VMAP to store data as key value pairs. FlexTables are schema-less, they are weak typed, and they trade flexibility for performance. So what I mean by schema-less is basically the keys provide the field name and each row can potentially have different keys. And it is weak typed because there is no type information at the column level. We will see some examples of this weak type in the following slides. But basically there is no type information, so the data is stored in text format and because of the weak typed and schema-less nature of FlexTables, you can implement some use cases. Like you can trivially implement needs like schema evolution or keep the complex types fluid. If that is your use case, then the weak type and schema-less nature of FlexTables will help you a lot to give you that flexibility. However, because it is weak type, you have a downside of not getting the best possible performance. So if your use case is to get the best possible performance, you can use a new feature of the strongly typed complex types that we started to introduce in Vertica. So complex types here are basically strongly typed complex types. They have a schema and then they give you the best possible performance because the optimizer now has enough information from the schema and the type to implement optimizer in such a column selection or all the nice techniques that Vertica employs to give you the best possible column performance can now be supported even for complex types. And we'll see some of the examples of these two types in these slides now. So let's use a simple data called restaurant data as a running throughout this following slide to basically see all the different variations of Flex and complex types. So on this slide, you have some sample data with four fields and essentially two rows if you sort of load it in, if you separate them out. So the four fields are name, cuisine, locations and menu. Name and cuisine are of type, watch are locations and essentially an array and menu is an array of a row of two fields, item and price. So if the data is in JSON, there is no schema and there is no type information. So how do we process that in Vertica? So in Vertica, you can simply create a Flex table called restaurants. You can copy the restaurant.json file into Vertica and basically you can now start analyzing the data. So if you do a select star from restaurants, you'll see that all the data is actually in one column called raw and it also you have the other column called identity, which is to give you some unique row ID. But the raw column basically encapsulates all the data that is in the restaurant.json file. This column is nothing but the V-map format. The V-map format is a binary format that encodes the data as key value pairs and the raw format is basically backed by the long-war binary column type in Vertica. So each key essentially gives you the field name and the values, the field value and it's all in, however the values are in the text representation. So say now you want to get better performance of this JSON data, Flex tables has these nice functions to basically analyze your data or try to extract some schema and type information from your data. So if you execute compute Flex table keys on the restaurants table, you'll see a new table called public dot restaurants underscore keys and then that will give you some information about your JSON data. So it was able to automatically infer that your data has four fields namely cuisine name, cuisine locations and menu and could also get that the name and cuisine or watch are, however since locations and menu are complex types themselves, one is array and one is array of a row, it sort of uses the same V-map format as is to process them. So it has four columns, two primitive off-type watcher and two are V-map themselves. So now you can materialize these columns by altering the table definitions and adding columns of the particular type it inferred and then you can get better performance from this materialized columns and yeah, it's not in a single column anymore. You have four columns for the prior restaurant data and you can get some column selection and other optimizations on the data that Vodika provides. All right, so that is, so your flex tables are basically helpful if you don't have a schema and if you don't have any type information. However, we saw earlier that some file formats like Parquet and Avro have schema and have some type information. So in those cases, you don't have to do the first step of inputting the type. So you can directly create the type external table definition of the type and then you can target it to the Parquet file and you can load it in via an external table in Vodika. So the same restaurants.json, if you transfer it to a transrestaurant.parquet format, you can basically get the fields however the locations and menu are still in the V-map format. All right, so the V-map format also allows you to explode the data and it has some nice functions to extract the fields from V-map format. So you have this map item, so the same restaurant data if you want to explode and you want to apply predicates on the fields of the arrays and the arrays of pro, you can have map items to explode your data and then you can apply predicates on a particular field in the complex type data. So on this slide it's basically showing you how you can explode the entire data, the menu items as well as the locations and basically give you the elements of each of these complex types up. So as I mentioned, the menus, so if you go back to the previous slide, the locations and menu items are still in the Vod binary or the V-map format. So the question is if you want to get performed better on the V-map data, so for primitive types you could materialize into the primitive style. However, if it's an array and array of pro, we will need some first class complex type constructs and that is what we will see that are added in Vertica now. So Vertica has started to introduce complex data types where each complex type is sort of a strongly typed complex type. So on this slide you have an example of a row complex type where, so we create an external table called customers and you have a row type with two fields name and ID. So the complex type is basically inlined into the table, into the column definition. And on the second example you can see the create external table items which is a nested row type. So it has an item of type row which itself has two fields name and the properties is again another nested row type with two fields quantity and label. So these are basically strongly typed complex types and then the optimizer can now give you a better performance compared to the V-map using this strongly typed information in your queries. So we have support for pure rows and extra rows in external tables for parquet. We have support for arrays and nested arrays as well for external tables in parquet. So you can declare an external table called contacts with a full number of array of integers. Similarly you can have nested array of items of type integer. We can declare a column with that strongly typed complex type. So the other complex type support that we are adding in the thinner releases support for optimized one-dimensional arrays and sets for both rows and as well as parquet external tables. So you can create an internal table called full numbers with one-dimensional arrays. So here you have full numbers of array of type int. You can have one-dimensional, you can have sets as well which are also sort of one-dimensional arrays but sets are basically optimized for fast lookups. They have unique elements and they are ordered. So you can get fast lookups using sets. If that is a use case, then sets will give you very quick lookups for elements. And we also implemented some functions to support arrays, sets as well. So you have apply min, apply max, which are scale out that you can apply on top of an array element and you can get the minimum element and so on. So you have support for the additional functions as well. So the other feature that is coming in Tenno is the explode arrays functionality. So we have implemented a UDX that will allow you to, similar to the example you saw in the map items case, you can extract elements from these arrays and you can apply different predicates or analysis on the elements. So for example, if you have this restaurant table with the column main watcher locations which is an array of watcher and the menu, again an array of watcher, you can insert values using the array constructor into these columns. So here we are inserting three values, lily speeder with locations came to Spidberg, menu items cheese and pepperoni. Again, another row with name, restaurant name Bob Stacco's location Houston and Totila Salsa and Taj Mahal, the third example. So now you can basically explode the both arrays and extract the elements out from these arrays. So you can explode the location array and extract the location elements which are basically Houston, Cambridge, Spidberg, and New Jersey and also you can explode the menu items and extract individual elements and then sort of apply other predicates on the exploded data. All right. So let's see what are some usage considerations of these complex data types. So complex data types as we saw earlier are nice if you have sparse data. Data has clickstream or has some page view data then maps are very nice to have to represent your data and then you can sort of efficiently represent in a space wise fashion for sparse data using map types. And complex data types as we saw earlier for the web request count query, it'll help you simplify the analysis as well. You don't have to have joins and it'll simplify your query analysis. As I just mentioned, if your use cases are for fast lookups then you can use the set type. So arrays are nice but they have the ordering on them. However, if your primary use case to just look up for certain elements then you can use the set type. Also you can use the V map or the flex functionality that we have in Vertica if you want flexibility in your complex type data type schema. So like I mentioned earlier, you can trivially implement needs like schema evolution or even keep the complex types fluid. So if you have multiple iterations of your data analysis and each iteration we are changing the fields because you're just exploring the data then V map and flex will give you that nice ease to change the fields within the complex type or across files and you can load complex types with fluid fields basically different fields in different row into V map and flex tables easily. However, once you basically treated over your data you figured out what are the fields and the complex type that you really need you can use the strongly typed complex data types that we started to introduce in Vertica. So you can use the array type, the struct type and the map type for your data analysis. So that's sort of the high level use cases for complex types in Vertica. So it depends on a lot on where your data analysis phase is. If you're early then your data is usually still fluid and you might want to use V maps and flex to explore it. Once you finalize your schema you can use the strongly typed complex data types to get the best possible performance. So what's coming in the following releases of Vertica? So in Tenno which is coming in some time now which is the next release of Vertica basically we're adding support for loading parquet complex data types to the V map format. Parquet is a strongly typed file format basically it has the schema it also has the type information for each of the complex types. However if you're exploring your data then you might have different parquet files with different schemes so you can load them to the V map format first and then you can analyze your data and then you can switch to the strongly typed complex types. We're also adding one dimensional optimized arrays and sets in ROS and for parquet. So yeah the complex types are not just limited to parquet you can also store them in ROS however right now we only support one dimensional arrays and sets in ROS. We're also adding the ExplodeUDX for one dimensional arrays in this release so you can as you saw in the previous example you can explode the data in arrays and you can apply predicates on individual elements for the arrays data types. So you can apply for sets so you can cast them trivially to arrays and you can export sets as well. So what are the plans passed that you know release? So we are going to continue support for strongly typed complex data types right now we don't have support for the full in the tail release we won't have support for the full all the combinations of complex types so we only have support for nested pure arrays or nested pure ROS and some are limited to parquet file format so we will continue to add more support for subqueries and nested complex types in the following releases. And we're also planning to add this V-map data type so you saw in the example that the V-map data format is currently backed by the long war binary data format or the column type because of this the optimizer really cannot distinguish which data is actually a long war binary or which is actually data in V-map format. So the idea is to basically add a type called V-map and then the optimizer can now implement or support optimizations or even syntax such as dot notation and if your data is columnar such as parquet then you can implement optimizations just key push down where you can push the keys that are actually querying in your analysis and then only those keys would be loaded from parquet and built into the V-map format so that way you get sort of the columnar selection optimization for complex types as well and that's something you can achieve if you have a different type for the V-map format so that's something on the roadmap as well and then unless join is basically another nice to have feature right now if you want to explode and join the array elements you have to explode in the subquery and then in the outer query you have to join the data however if you have this unless join it will allow you to explode as well as join the data in the same query and on the fly you can do both and finally we are also adding support for this new feature called UDE Vector so that's on the plan too so all our work for complex types is essentially changing the fundamental way Vertica executes in the sense of functions and expressions so right now all expressions in Vertica can return only a single column out except in some cases like UDT and so on but the scalar functions for instance if you take a UDT scalar you can get only one column out of it however if you have some use cases where you want to compute multiple computations so if you want to have multiple computations on the same input data say you have input data of two integers and you want to compute both addition and multiplication on those two columns so this is for example but in many machine learning use cases you have similar pattern so say you want to do both these computations on the data at the same time then in the current approach you have to have one function for addition one function for multiplication and both of them will have to load the data once basically loading data twice to get both these computations done however with the UDVector support you can perform both these computations in the same function and you can return two columns out so essentially saving you the loading loading these columns twice you can only load it once and get both the results out so that's sort of what we are planning to implement with all the changes that we're doing to support complex data types in Vardica and also you don't have to use this over clause like a UDT transform so UD scale that's like UD scalars you can have UD vector and you can have multiple columns returned from your computations so that sort of concludes my talk so thank you for listening to my presentation now we are ready for Q&A