 Hello everybody, and thank you for joining us today for the virtual Vertica BDC 2020. Today's breakout session is entitled, The Shortest Path to Vertica, Best Practices for Data Warehouse Migration and ETL. I'm Jeff Healy, I lead Vertica Marketing. I'll be your host for this breakout session. Joining me today are Marco Gester and Mauricio Saliti, Vertica Product Engineers, joining us from the Amir region. But before we begin, I encourage you to submit questions or comments during the virtual session. You don't have to wait. Type a question or comment in the question box below, the slides that click submit. As always, there will be a Q&A session at the end of the presentation. We'll answer as many questions that we're able to do on that time. Any questions we don't address, we'll do our best to answer them offline. Alternatively, visit VerticaForums at forum.verdica.com to post your questions there after the session. Our engineering team is planning to join the forums to keep the conversation going. Also a reminder that you can maximize your screen by clicking the double arrow button in the lower right corner of the slides. And yes, this virtual session is being recorded. It will be available to view on demand this week. Send you a notification as soon as it's ready. Now let's get started. Over to you, Marco and Mauricio. Hello, everybody. This is Marco speaking. I'm a sales engineer from Amir, as Jeff said. I'll just keep going. This is the agenda. Part one will be done by me. Part two will be done by Mauricio. The agenda is, as you can see, big bang or piece by piece, migration of the DDL, migration of the physical data model, migration of ETL and BI functionality, what to do with stored procedures, what to do with any possible existing user defined functions, and migration of the data. Part two will be by Mauricio. Do you want to talk about it, Mauricio, or should I do it? Yeah. Hello, everybody. My name is Mauricio Felici, and I'm a Berlio corporate pre-sales like Marco. I'm going to talk about how to optimize data roles using some specific vertical techniques like table flattening and live aggregate projections. So let me start with a quick overview of the data browser migration process we are going to talk about today. And normally, we often suggest to start migrating the current data browser, the old data browser as is with limited or minimal changes in the data browser architecture. And yeah, clearly we will have to port the DDL or to redirect the data access tool and the ETL platform, but we should minimize in the initial phase the amount of changes in order to go live as soon as possible. This is something that we also suggest. In the second phase, we can start optimizing the data rows with, again, with no or minimal changes in the architecture as such. And during this optimization phase, we can create, for example, a log projections for some specific query or optimize the encoding or choose some of the resource pools. This is something that we normally do if and when needed. And finally, and again, if and when needed, we go through the architecture designed for the data browsers using full data techniques in order to take advantage of all the features we have in the data. And this is normally an iterative approach. So we go back tuning some of the specific feature before moving back to the architecture design. We are going through this process in the next few slides. Okay. In order to encourage everyone to keep using their common sense when migrating to a new database management system, people are often afraid of it, it's just often useful to use the analogy of a house move. In your old home, you might have developed solutions for your everyday life that make perfect sense there. For example, if you're old, send Bernard Dogg, can't walk anymore, you might be using a forklifter to heave him through your window in the old home. Well, in the new home, consider the elevator and don't complain that the window is too small to fit the dog through. This is very much in the same way as Maurito was starting to make the transition gentle. Again, I love to remain in my analogy with the house move, picture your new house as your new holiday home, begin to install everything you miss and everything you like from your old home. Once you have everything you need in your new house, you can shut down and sell the old one. So move piece by piece and go for quick wins to make your audience happy. You do big bang only if they are going to retire the platform you are sitting on. We are really on a thinking ship. Otherwise, again, identify quick wins, implement, publish them quickly in Vertica, read the benefits, enjoy the applause, use the game reputation for further funding and if you find that nobody is using the old platform anymore, you can shut it down. If you really have to migrate, you can still go to big bang in one go only if you absolutely have to. Otherwise, migrate by subject area, use a group or similar clear division. Right. Having said that, you start off by migrating objects. Objects in the database, that's one of the very first steps. It consists of migrating first to places where you can put the other objects into, that is, owners, locations, which is usually schemers. Then once you have that, you extract the tables and use, then you convert the object definition, deploy them to Vertica and think that you shouldn't do it manually. Never type what you can generate, automate whatever you can. Uses and roles, usually there is a system table also in the old database that contains all the roles, you can export those to a file, reformat them and then you have a create role and create user script that you can apply to Vertica. If LDAP Active Directory was used for the authentication of the old database, Vertica supports anything within the LDAP standard. Cataloged schemers should be relatively straightforward with maybe sometimes a difference, Vertica does not restrict you by defining a schema as a collection of all objects owned by a user, but it supports, it emulates it for all time sake. Vertica does not need the catalog or if you absolutely need the catalog from the old tools that you use, it is usually set, it is always set to the name of the database in case of Vertica. Having had now the schemers, the catalogs, the users enrolled in place, you move the data definition language of the existence of the source. If you are allowed to, it is best to use a tool that translates the data types in the DDL generated. You might hear the mention of ODB, a tool written by Mauritius by the way. Several times in this presentation we are very happy to have it. It actually can export the old database table definitions because it works with ODBC. It gets what the old database ODBC driver translates to ODBC and then it has internal translation tables to several target dbms flavors, the most important which is obviously Vertica. If they want you to use something else, there are always tools like SQL plus and Oracle, the show table, command and terror data, etc. Each dbms should have a set of tools to extract the object definitions to be deployed in the other instance of the same dbms. If I talk about use, you usually have the new definition also in the old database catalog. One thing that you might use a bit of special care, synonym is something that Vertica emulates in different ways depending on the specific needs. I said let's stop you on the view or table to be referred to or something that is really needs that other databases don't have. The search path in Vertica that works very much like the path environment variable in Windows or Linux where you specify a table, an object name without the schema name and then it searches first in the first entry of the search path then in the second and then in the third which makes synonyms usually completely unneeded. When you generate your ddl, we remain in the analogy of moving house, dust and clean your stuff before placing it in the new house. If you see a table like the one here at the bottom, this is usually a corpse of a bad migration in the past already, an ID is usually an integer and not an almost floating point data type, a first name hardly ever has 256 characters and if it's called higher dt it's not necessarily needed to store the second when somebody was hired. So take good care in using while you are moving, dust off your stuff and use better data types. The same applies especially to string. How many bytes does a string contain that contains four Euro signs? It's not four, it's actually 12, in utf8 in the way that Vertica encodes strings. An ASCII character is one byte but the Euro sign takes three and that means that very often you have when you have a single byte character set of the source, you have to pay attention, oversize it first because otherwise it gets rejected or truncated and then you will have to very carefully check what their best size is. The best promising, the most promising approach is to initially dimension strings in multiples of their initial length and again ODB with the command you see there with the minus iu2,4 will double the length of what otherwise was a single byte character and multiply by four the length of characters that are white characters in traditional databases and then load the representative sample of your source data and profile using the tools that we personally use to find the actually longest data type and then make them shorter. Melissa might be talking about the issues of having too long and too big data types. On projection design we live and die with our projects, you might know. Remember the rules on how default projects have come to exist. The way that we do initially would be just like for the profiling loader representative sample of the data, collector representative sets of already known queries from the vertical database designer and you don't have to decide immediately, you can always amend things and otherwise follow the laws of physics, avoid moving data back and forth across your notes, avoid heavy IOs if you can design your projections initially by hand. Encoding matters, you know that the database designer is a very tight-fisted thing. It would optimize to use as little space as possible. You will have to think of the fact that if you compress very well you might end up using more time in reading it. This is a test that we'll reach around once using several encoding types and you see that the RLE, the roll length encoded, if sorted is not even visible while the others are considerably slower. You can get those lights and look at them in detail. I will go in detail now here about it. BI migrations move usually you can expect 80% of everything to work to be lifted and shifted. You don't need most of the pre-aggregated tables because we have live aggregate projections. Many BI tools have specialized query objects for the dimensions and the facts and we have the possibility to use flat and table that are going to be talked about later. You might have to write those by hand. You will be able to switch off caching because Vertica speeds up everything with live aggregate projections. You have worked with more lab cubes before. You very probably won't need them at all. ETL tools, what you would have to do is if you do it row by row in the old database, consider changing everything to very big transactions. If you use insert statements with parameter markers, consider writing to named bytes and using Vertica's copy command. Math inserts, yeah, copy command, that's what I have here. As to custom functionality, you can see on this slide that Vertica has the biggest number of functions in the database. We compare them regularly by far compared to any other database. You might find that many of them that you have written won't be needed on the new database. So look at the Vertica catalog instead of trying to migrate a function that you don't need. Stalk procedures are very often used in the old database to overcome their shortcomings that Vertica doesn't have. Very rarely you will have to actually write a procedure that involves a loop, but it's really in our experience very, very rarely. Usually you can just switch to standard scripting. This is basically repeating what Mauricio said in the interest of time. I will skip this, look at this one here. Most of the database, data warehouse migration tasks should be automatic. You can automate DDL migration using ODB, which is crucial. Data profiling, it's not crucial, but game changing. The encoding is the same thing. You can automate it using our database designer. The physical data model optimization in general is game changing. You have a database designer. Use the provisioning. Use the old platforms tools to generate the SQL. You have no objects without their own if this is crucial. And as to functions and procedures, they are only crucial if they depict the company's intellectual property. Otherwise you can almost always replace them with something else. That's it from me for now. Thank you, Marco. Thank you, Marco. So we will now continue our presentation talking about some of the Vertica data warehouse optimization techniques that we can implement in order to improve the general efficiency of the data warehouse. And let me start with a few simple messages. Well, the first one is that you are supposed to optimize only if and when this is needed. In most of the cases, just a lift and shift from the old data warehouse to Vertica will provide you with the performance that you were looking for, or even better. So in this case probably is not really needed to optimize anything. In case you want to optimize or you need to optimize, then keep in mind some of the Vertica peculiarities. For example, implement delete and update in the Vertica way. Use live aggregate projections in order to avoid or better, in order to limit the group-by-executions of a certain time, use double-flattening in order to avoid or limit joints. And then you can also implement in Vertica some specific Vertica extensions like, for example, time-cities analysis or machine learning on top of your data. We will now start by reviewing the first of these ballots. Optimize if and when needed. Well, if this is okay, I mean if you get when you migrate from the old data warehouse to Vertica without any optimization, if the performance level is okay, then probably you don't need to optimize anything. But if this is not the case, one very easy optimization technique that you can ask is to ask Vertica itself to optimize the physical data model using the Vertica database designer. How? Well, DbD, which is the Vertica database designer, has several interfaces. Here I'm going to use what we call the DbD programmatic API, so basically SQL functions. And using other databases, you might need to hire experts looking at your data, your data browser, your table definition, creating indexes or whatever in Vertica. All you need is to run something like these as simple as six single SQL statements to get a very well optimized physical data model. You see that we start creating a new design. Then we add to these new design tables and queries, the queries that we want to optimize. We set our target. In this case, we are tuning the physical data model in order to maximize query performances. This is why we are using my design query in our statement. Another possible alternative would be to tune in order to reduce storage or a mix between tuning storage and tuning queries. And finally, we ask Vertica to produce and deploy these optimized designs. It's a matter of literally, it's a matter of minutes. And in a few minutes, what you can get is a fully optimized physical data model. This is something very, very easy to implement. Keep in mind some of the Vertica peculiarities. Vertica is very well tuned for load and query operations. On Vertica Bright, a rose container to this file, if a rose container is a group of files, we will never, ever change the content of this file. The fact that the rose container files are never modified is one of the peculiarities and this approach let us to use minimal locks. We can have multiple load operations in parallel against the very same table, assuming we don't have a primary or unique constraint on the target table in parallel, as I said, because they will end up into different rose containers. Select in read committed requires no lock at all and concurrently with insert select because the select will work on a snapshot of the catalog when the transaction starts. This is what we call snapshot isolation. The copy recovery, because we never change our rose files, are very simple and robust. So we have a huge amount of advantages due to the fact that we never change the content of the rose files in the rose containers. But on the other side, deletes and updates require a little attention. So what about delete? First, when you delete in Vertica, you basically create a new object, a delete vector. It could be either a delete vector in the rose or in memory. And this delete vector will point to the data being deleted so that when the query is executed, Vertica will just ignore the rose listed in the delete vector. And it's not just about delete. An update in Vertica consists of two operations, delete and insert. Merge consists of either insert or update, which in turn is made of delete and insert. So basically, if we tune how the delete work, we will also have tuned the update and the merge. So what should we do in order to optimize delete? Well, remember what we said, that every time we delete, actually we create a new object, a delete vector. So avoid committing delete and update too often will reduce work, the work for the merge out, for the mover merge out activities that are run afterwards. And be sure that all the interested projections will contain the column views in the delete predicate. This will let Vertica to directly access the projection without having to go through the super projection in order to create the delete vector, and the delete will be much, much faster. And finally, another very interesting optimization technique is trying to segregate the update and delete operation from query and insert work load in order to reduce load contention. This is something we are going to discuss on this, obtained using partition operation. This is exactly what I want to talk about now. Here you have a typical that allows architecture, so we have data arriving in a landing zone, where the data is loaded as it is from the data surface, then we have a transformation layer writing into a staging area that in turn will feed the partitions block of data in the green data structure we have at the end. Those green data structure we have at the end are the ones used by the top access tools when they're under queries. Sometimes we might need to change all data, for example, because we have late records, or maybe because we want to fix some errors that have been originated in the data sources. So what we do in this case is we just copy back the partition we want to change or we want to adjust from the green area at the end to the staging area. We have a very fast operation which is copy partition. Then we run our updates or our adjustment procedure or whatever we need in order to fix the errors in the data in the staging area. At the very same time, people continue to use the green data structures that are at the end, so we will never have contention between the two operations. When the updates in the staging area is completed, what we have to do is just to run a swap partition between tables in order to swap the data that we just finished to adjust in the staging zone to the query area, that is the green one at the end. This swap partition is very fast, is an atomic operation, and basically what happens is just that we have to exchange the pointer to the data. This is a very, very effective technique and a lot of customers use this. So why Flotten Table and Live Aggregate Forjections? Well, basically we use Flotten Table and Live Aggregate Forjections to minimize or avoid joins, this is what Flotten Table are used for, or goodbye, and this is what Live Aggregate Forjections are used for. Now, compared to traditional data warehouses, Vertica can store and process and aggregate and join order of magnitude more data. Vertica is a true columnar database, joins and group by normally are not a problem at all. They run faster than any traditional data warehouse. That said, there are still scenarios where data sets are so big, and we are talking about 8 bytes of data, and so quickly going that we need something in order to boost group by and join performances. And this is why Vertica introduced Live Aggregate Forjections to perform aggregations at loading time and limit the need for group by the period on time, and Flotten Tables to combine information from different MNPP at loading time and again avoid running joins at queer runtime. So Live Aggregate Projections, at this point in time, we can use Live Aggregate Projections using four built-in aggregate functions, which are sum, min, max and count. Let's see how this works. Suppose that you have a normal table. In this case, we have a table unit sold with three columns, PID, date, time and quantity, which has been segmented in a given way. And on top of this base table, we call it anchor table, we create a projection. You see that we create the projection using the select that will aggregate the data. We get the PID, we get the date portion of the time, and we get the sum of quantity from the base table, grouping on the first two columns, so PID and the date portion of the time. What happens in this case? When we load data into the base table, all we have to do is to load data into the base table. When we load data into the base table, we will fill, of course, the projections that, assuming we are running with K61, we will have two projections, and we will load the data into those two projections with all the detailed data we are going to load into the table, so PID, date, time and quantity. But at the very same time, and without having to do nothing, any particular operation or without having to run any ETL procedure, we will also get, automatically, in the live aggregate projection, the data pre-aggregated with PID, the date portion of the time, and the sum of quantity into the table named total quantity. PID is something that we get for free without having to run any specific procedure, and this is very, very efficient. So the key concept is that during the loading operation from the ETL point of view is executed again the base table, we do not explicitly aggregate data, or we don't have any ETL procedure. The aggregation is automatic and will bring the data to the live aggregate projection every time we load into the base table. You see the two selects that we have in this slide on the left side, and you see that those two selects will produce exactly the same result. Running select PID date time sum quantity from the base table, or running a select star from the live aggregate projection will result exactly in the same data, you know. This is, of course, very useful, but is much more useful the fact that if we, and we can observe this if we run an explain, if we run the select against the base table asking for this group data, what happens behind the scene is that basically the article itself recognizes that there is a live aggregate projection with the data that has been already aggregated during loading phase, and re-brides your query using the live aggregate projection. This happens automatically. You see, this is a query that's run a group bind against unit zone, and that decide to re-write this query as something that has been collected against the live aggregate projection, because people like this. This will save a huge amount of time and effort during the ETL cycle. Okay? And it's not just limited to the information you want to aggregate. For example, another query like select count this thing, you might know that count this thing basically is a group bind. We'll also take advantage of the live aggregate projection, and again, this is something that happens automatically. You don't have to do anything to get this, okay? One thing that we have to keep very, very clear in mind. That's what we store in the live aggregate projection are basically partially aggregated data. So, in this example, we have two inserts, okay? You see that we have the first insert, that insert in four rows, and the second insert, which is inserting five rows. Well, for each of these inserts, we will have partial aggregation. You will never know that after the first insert, you will have a second one. So, Bertha will calculate the aggregation of the data every time you run the insert. This is a key concept. And this also means that you can maximize the effectiveness of this technique by inserting large chunk of data, okay? If you insert data row by row, this technique, live aggregate projection, is not very useful because for every row that you insert, you will have an aggregation. So, basically, the live aggregate projection will end up containing the same number of rows that you have in the base table. But if you every time insert large chunk of data, the number of the aggregations that you will have in the live aggregate projection is much less than the base data. So, this is a key concept. You can see how this works by counting the number of rows that you have in the live aggregate projection. You see that if you run the select count star from the solved live aggregate projection, the query on the left side, you will get four rows. But actually, if you explain this query, you will see that he was reading six rows. So, this was because every of those two inserts that we ran previously inserted a few rows and three rows in the live aggregate projection. So, this is a key concept, live aggregate projection keeps partially aggregated data. This final aggregation will always happen at right time. Another, which is very similar to the live aggregate projection, is what we call top-cape projection. We actually do not aggregate anything in the top-cape projection. We just keep the last or limit the amount of rows that we collect using the limit over partition by order by close. And this, again, in this case, we create on top of the base table two top-cape projection, one to keep the last quantity that has been solved and the other one to keep the max quantity. In both cases, it's just a matter of ordering the data in the first case using the d-time column in the second case using quantity. In both cases, we fill projection with just the last rows. And again, this is something that we do when we insert data into the base table. And this is something that happens automatically. If we now run, after being served, our select against either the max-quantity-top-cape or the last-quantity-top-cape, we will get the very last data. And you see that we have much less rows in the top-cape projections. Okay, we told at the beginning that, basically, we can use four built-in functions. You might remember, min, max, sum, and count. What if I want to create my own specific aggregation on top of the data? Some of our customers have very specific needs in terms of live aggregated projections. Well, in this case, you can code your own live aggregate projection user-defined functions. So you can create the user-defined transfer function to implement any sort of complex aggregation while loading data. Basically, after you implemented this UDPS, you can deploy using the pre-pass approach. That basically means that data is aggregated as loading time during the data ingestion or the batch approach. That means that the data is aggregated when the full mover is running on top of it. Thanks to the member on live aggregated projections. They are limited to the built-in function against some max, min, and count. But you can code your own UDPS, so you can do whatever you want. They can reference only one table. And for batch version B4, 9.3, it was impossible to update or delete on the anchor table. This limit has been removed in 9.3. So you now can update and delete data from the anchor table. Live aggregated projection will follow the segmentation of the group by expression. And in some cases, the batch optimizer can decide to delete the live aggregated projection or not depending on the fact that the aggregation is consistent or not. Remember that if we insert and commit every single row into the anchor table, then we will end up with a live aggregated direction that contains exactly the same number of rows. In this case, delete and delete or using the base table would be the same. So this is one of the two fantastic techniques that we can implement in Vertica. This live aggregated projection is basically to avoid or limit group by. The other, which we are going to talk about, is flux and table and this is used in order to avoid the means for joints. Remember, Vertica is very fast-running joint but when we scale up to petabytes of data, we need to boost and this is what we have in order to have these problems fixed regardless of the amount of data we are dealing with. So what about flux and table? Let me start with normalized schemas. Everybody knows what is a normalized schema. There is no Vertica-related stuff in this slide. The main scope of a normalized schema is to reduce data redundancies. The fact that we reduce data redundancies is a good thing because we will obtain fast and more bright. We will have to write into a database more chunks of data into the right table. The problem with these normalized schemas is that when you run your queries, you have to put together the information that arrives from different tables and this requires to run joints. Again, joints in Vertica normally is very good to run joints but sometimes the amount of data makes it not easy to deal with joints and joints sometimes are not easy to tune. What happens in the normal, let's say traditional data rows is that we denormalize the schemas either manually or using an ETL. Basically, we have on one side in this slide on the left side the normalized schemas where we can get very fast writes. On the other side, on the left, we have the wider table where we run all the pre-joints and pre-aggregation in order to prepare the data for the queries. And so we will have fast writes on the left, fast writes on the right and fast read on the left side of these slides. They're probably in the middle because we will push all the complexity in the middle in the ETL that we'll have to transform the normalized schema into the wider table. And the way we normally implement these either manually using procedures that we coded or using ETL this is what happens in traditional data warehouse is that we will have to code an ETL layer in order to run the insert select that will read from the normalized schema and write into the wider table at the end the one that is used by the data access tools we are going to use to run our queries. So this approach is costly because of course someone will have to code this ETL and this is low because someone will have to execute those batches normally overnight after loading the data and maybe someone will have to check the following morning that everything was okay with the batch and is resource-intensive of course and is also human-being intensive because of the people that will have to code and check the results. It's ever thrown because it can fail and introduce a latency because there is a gap in the time axis between the time t0 when you load the data into the normalized schema and the time t1 when you get the data finally ready to be queried. So what we did in that to facilitate this process is to create this flattened table. With the flattened table first you avoid data redundancy because you don't need the white table on the normalized schema on the left side. Second is fully automatic. You don't have to do anything. You just have to insert the data into the wider table and the ETL that you have coded is transformed into an insert select by that automatically you don't have to do anything. It's robust and this latency 0 is extremely fast as soon as you load the data into the wider table you will get all the joints executed for you. So let's have a look on how it works. In this case we have the table we are going to flatten and basically we have to focus on two different clauses. The first one is you see that there is one table here dimension value 1 which can be defined as default and then the select or set using. The difference between default and set using is when the data is populated. If we use default data is populated as soon as we load the data into the base table if we use set using we will have to refresh. But everything is there. I mean you don't need any ETL you don't need to code any transformation because everything is in the table definition itself and is for free and of course is latency 0. So as soon as you load the other columns you will have the dimension value as well. Let's see an example here. Here we have a dimension table a customer dimension that is on the left side and we have a fact table on the right. You see that the fact table uses columns like o underscore name or o underscore city which are basically the result of the select on top of the customer dimension. So this is where the join is executed. As soon as I load data into the fact table directly into the fact table without of course loading data that arrives from the dimension all the data from the dimension will be populated automatically. So let's have an example here. Suppose that we are running this in third as you can see we are running the insert directly into the fact table and we are loading oid, customer id and total. We are not loading neither name nor city. Those names and city will be automatically populated by Vertica for you because of the definition of the flat table. You see this is all you need in order to have your wider table built for you your flattened table and this means that at runtime you won't need any join between this fact table and the customer dimension that we have used in order to manipulate name and city because the data is already there. This was using default the other option was using set using. The concept is absolutely the same. You see that in this case on the right side we have basically replaced the o underscore name default with o underscore name set using and same is true for city. The concept as I said is the same but in this case which we use set using then we will have to refresh. You see that we have to run these select refresh columns and then the name of the table in this case all columns will be refreshed or you can specify only certain columns and this will bring the values for name and city reading from the customer dimension. So this technique is extremely useful. The difference between default and set using just to summarize the most important difference remember you just have to remember default will populate your target when you load set using when you refresh and in some cases you might need to use them both. So in some cases you might want to use both default and set using. In this example here you see that we define the o underscore name using both default and set using. And this means that we will have the data populated either when we load the data into the base table or when we run the refresh. This is the summary of the technique that we can implement in Vertica in order to make our data browsers even more efficient. And well basically this is the end of our presentation. Thank you for listening and now we are ready for the Q&A session.