 Well good morning. My name is Corey Hinker. I am a programmer at Moat at the moment and today I'm going to be talking to you about ETLs and some hacks and benchmarks that I've done over the years. Just recurring problems that I've seen in my work. A bit about me. I've been a database programmer for 20 years. The very first database benchmark I did was the TPCC for an AS400. I'm not sure they even have those anymore. My work is included Postgres, Oracle, DB2, SQL Server, Sybase, Informix, SQLite, DB400, and regrettably MySQL. When not doing this stuff I have dabbled in improv theater and being a roller derby referee. I somehow survived. A bit about Moat to start off with. They are an advertising analytics company. They measure viewability, reach, and fraud detection in ads. It is a lot of data crunching for that. They are collecting tens of billions of ad events per day, which they distill down into a mere 500 million rows of data about an ad per day, which roughly translates to 50 million ads tracked. We want to see how well those ads were viewed and what percentage of the views were done by actual humans. We are hiring. Our tech office is in New York, but we have offices in London, San Francisco, Los Angeles, Singapore, Sydney, Los Angeles, Cincinnati, Miami, and check with me later in the day we might have opened another one. One good thing about working at Moat is we are usually not at work this early. Most of what I'm talking about with ETL today has to do with data warehouses. Show of hands. How many people here are familiar with the difference between a data warehouse type situation and an OLTP database? Okay, pretty good. I can shave a little time off this talk. Okay, OLTP is online transaction processing, also known as short request. The characteristics of it are that you have very high currency, very many connections at one time. Most queries are very small, fetching only a row or two, and perhaps updating a row or two, your standard CRUD operations. The new data coming in is, for all intents and purposes, random. There's no real way to predict when it's going to come in or what data is going to be coming in. You're at the mercy of your users and your data events. Your primary issues are locking, keeping tables available for others to use, and looking at deadlock issues, and the key challenge with an OLTP database is keeping your latency low. You want quick response times to things. This talk is not about that. Data warehouse workloads. You're usually working with derivative data, usually pulled from that very same OLTP system. You're often bringing in data from other data sources. They can be publicly available. They can be flat files from other sources. There's somebody else's extract. Ultimately, someone else had a database, and what you have there is their version of it that they're willing to share with you. Your latency tolerances are much higher. Your data isn't really up to the minute. It may not even be up to today. You're often using yesterday, last month, last year kind of data. The data is imported in batches. Those batches are usually subdivided by something in common. They came from the same timeframe, the same day. They came from different data systems within your organization, or they were externally sourced, and then in which case you're just dealing with somebody else's partitioning of the data. Individual query performance comes in second to overall performance of the ETL. If one job ends up running very, very slow to make way for a lot of other jobs, that's fine. You don't need to be fair to your packets. The data is often stored in ways that are highly optimized for reading, not writing. That means that you can violate the common normal forms of design. You can flat out duplicate data. You're often storing the data multiple times in ways that make access paths sensible for different types of questions. As a result, this data is not well-suited for being updated in a timely fashion, if at all. ETL. Extract, transform, and load. The extract phase. Get the data out of the system where it originally resided. This could be a query to a CSV. This could be a curl to a website. This could be some other publicly available data. This could be a series of FTP files in a landing zone and the semaphore files that tell you that that FTP file has finished loading. Maybe you control the format of these things. Maybe you don't. T, transform. You want to scan the data for correct formatting, valid values, things like that, and make the decision on whether to discard that data or fix it on the fly. You then reformat the data to fit the shape of the tables that you wish to load, and then finally load. Get the data into the warehouse as fast as possible. The first tool set everybody starts off with is writing their own. Homegrown data file reading once entered at a time. Open file. Read file. Parse values. Organize that into rows. Hmm. This seems like something I've done before. You get bored of this. You start to wonder, isn't there a faster way to load these things in? And your DBI's, your, uh, uh, Pearl DVD and things will offer you convenience routines. I can execute many. Uh, in the case of Postgres, these are all lies. They are just doing per row inserts under the cover. And, uh, the only one that I know of that actually has a legitimate insert. Many is, uh, for Oracle. So then there's tool sets. You have several command lined external tools. Pg loader, Pg bulk load. Uh, these are analogous to SQL loader on Oracle and SQL, uh, SQL servers, BCP, uh, simple command line tool. Name the table you want to, uh, load to name the database that table is in. Here's the file, some indications about the format of it and away it goes. Those are very limited. Those only do one to one particular table at a time. They have no workflow to them. Uh, what comes into play are a bunch of goo gooey tools, uh, which attempt to be all things to all data load types. Uh, the one most uncommonly used for, uh, Postgres is Kettle. Uh, Microsoft has SSIS IBM has data stage and then by far the biggest one is Informatica. Um, all of these tools, uh, are graphical. All of them sort of simulate a flow chart and it's been my experience that given enough time trying to make the flow chart look attractive, everything becomes either a seashell or a spaceship. Nice things about those tools. Um, they're custom designed with many common data problems solved. I think the ultimate goal of these tools is to let your most junior programmer get several stages of ETL done. Um, you will never need to parse CSV again in code. They're usually aware of the bulk loading facilities available on a variety of databases. Um, but the fewer databases that ETL tool covers, the more likely it is to know those optimizations. Uh, be aware of vendor bias here. For instance, SSIS has bulk loading tools for Oracle, but they're really slow and, uh, they have absolutely no incentive to make it faster because if you come to them and complain, they'll try to sell you a SQL server database. But simple things are made simple. The cons, um, you can store a lot of app logic in these flow chart type, uh, structures, but you can't get your whole application logic in them. So you're left with having part of your load logic in one place and part of it in the same code that you had it in before. Simple things are simple, but non simple things can be nearly impossible. Uh, and often the only way to make them possible inside those, uh, structures is to write your own plugin for SSIS or data stage or whatever, at which point you're now writing code that's outside your core competency, uh, C plus plus or something like that. And you're spending a lot of time building into this framework that ultimately doesn't have anything to do with your business. Uh, these graphical interfaces are extremely good at concealing application logic. If there is a bug in the thing, you have a very hard time telling what filters are on a particular load step without clicking and opening that particular step and searching through a variety of menus and options and sub menus. Uh, the source code generated is ungrepable. The code itself, when you do get a dump of it is usually a XML or a binary, which means that it's totally resistant to get any other things. And you couldn't actually patch it in a source control system if you wanted to. There is no way to hand edit this stuff. You are left with, uh, an XML file that is essentially read only. Uh, the business logic that has to interact with the state of that database must pull down that data from that database. So for instance, if you are, if you have a data file that has, uh, addresses along with country and you're not sure if your database has ever seen a particular country before and you wish to assign a synthetic key to it, uh, normally what you would do if that data were already in your database is you would attempt to insert it on failing. You would, uh, ignore that because you've already seen that country. That's a standard insert on conflict situation. Here you don't have that and the solution that, uh, third party tools tend to use is to pull the entire country's table down and then make an in memory hash of it. And then you essentially spaghetti test every row in your data file against that in memory hash. There's several problems with this. Uh, the first of which is that table could be really big and you could exhaust memory and 32 bit SSIS you will. And then you've got a copy of this data. If there's another process running doing the same thing, it could have also have seen Zimbabwe for the first time and you're going to assign different synthetic keys to them. Now you've got a race condition. This is all really sad because what you have here is a left outer join and it's a totally solved problem if you're already in the database. So what people tend to do more often is to switch the letters around extract load and transform extract same as before load. In this case, you're picking table shapes that exactly match your input data. You're bending to them, not bending the data to you. Then once you have it loaded, you use regular SQL tools to transform it into what you wanted. Filtering, removing rows that you know you don't want. Validating, testing for correctness. This is a birth date. Is it a birth date of a person who could theoretically be alive? If not, why are they in our healthcare system? Classify. The data might be destined for multiple tables, have key linkages. You may have what is a denormalized row that you need to renormalize in your system. Endcoding. Date strings become dates, things like that. De-referencing. Like I said before with countries, maybe you have a star schema or like a Kimball model where you have facts and dimensions and you need to take those dimensions which are essentially overblown enumeration tables and fill in the enumeration values. Then once you've encoded those values, you can then finally dereference them in the primary fact which will only have the foreign key values and that is ultimately the table that pulls the most weight in your database. Then insert that data into the tables that are actually shaped for appropriate customer queries. So can you just think of this as ETL where the T also happens in the database? Yes. Pros of this. Your transformation logic is usually written in SQL statements. You already know SQL. It is easily worked into your existing source control. Referencing your existing data is trivial and transactional. It's right there. If you have multiple processes that are trying to load new enumeration values into a table, the trick is to just sort them first. That way if two different processes experience a new country at the same time, they will have gone through those countries in order and one will block on the other and you will be avoiding deadlocks. Whereas if they were coming in in random order, it would be very easy for one process to find Zimbabwe and then Albania and the other one to find Albania and then Zimbabwe. You can reuse a lot of your existing data logic. You can do things like applying the same check constraint that you ultimately want to have on your existing data table to your work table. There's no sure way to know that that data is going to conform than to use the same check constraint. If you get tricky, you can actually write programs that extract the check constraints off of one table and apply them to another table. If only in the cases where one of those check constraints fail, do you go into more rigorous per row analysis of why that data is not conformant? Const doing this this way. Well, you're writing to the database twice. If you have an IO loaded data warehouse, you're increasing your workload there. You're loading up some data you don't want. Some of these rows are going to get filtered out, so you will want to find ways to not pay a big IO price for something you're ultimately going to discard. You will need additional disk space to do this transformation work, at least for a short period of time. Additional CPU burden during the ETL, that's pretty straightforward. This is not as big of a deal as people are increasingly doing having one data system that is the master database and having all clients read off of read replicas, much more feasible to do that with data warehouses because by definition the customer is not editing any of this data. Some of your data validation rules may still rely on external factors. Does this DNS host name resolve right now? That might be something that you want to filter on. Looking up things in external systems, your Redis or something like that. If your data files are very, very large, you may still have to split them up. If your data is in a large number of files, you may have to use multiple workers and the workload management of having tens, hundreds, thousands of workers loading into an ETL system is itself an OLTP-ish type of task where each one needs to be blocking off certain regions of the database that it's loading into. The first case we're going to look at is copying a big file. Hereafter in all the cases that I'm showing today, the file we're using was CSVGZ. It was 40 megabytes compressed. It uncompressed to 1.1 gigabytes, so it was really good compression. It had 5.2 million rows in it, including a header row. It represented the data for one particular customer's add loads for one particular day. The customer name and the date of the data was encoded in the file name. Those are not present in the file. What we instead have are eight taxonomy columns, just the customer's classification for what they thought of this ad and whose department it was for, what ad campaign it was in. We don't control these fields and as a result they can be quite large and more often than not I see our users putting the URL of the page that they intended it to load on in there. The number of metrics in the table? I don't know. It's going to be over 100, might be 120 something, might be 150 something. The format has changed over time as we have discovered new things that we want to collect about ads. We have added more columns and the process for rendering these CSVs often gets reapplied to data that say got resampled or fixed at a later date and when they did that they used the modern format as opposed to the format that was modern at the day that the original data was created. So we cannot always rely on the same number of columns being there for any one given day. We have an expectation that there will be an overflow of columns, columns that are expected in the end result data but are not present in the CSV but generally speaking we don't get an underflow where there are columns in the CSV that are not present in the end database. So what are some tricks you can do for actually discovering the format? We generally know that any dates and taxonomy names are going to be towards the front but we can't rely on it and as a matter of fact we used to have data encoded in there and presently don't so that's a difference that we have to work with. Any columns with unfamiliar names can safely be assumed to be a new measurement and you can do a trick with copy by just piping it copy from program and grabbing just the header line into a single text field table and then using regex to split that out and get your column names. I didn't think that was very clean so I decided to hack it a little bit and made a Python program to essentially do the same. One other issue with doing with copy from program is that the pipe to head causes a killing of the Zcat which generates a false positive for a non-zero return code so wanting to avoid that I instead wrote this Python wrapper to give me an array of the column names in a arbitrary text file back as an array. Then adding on to that with those columns at hand you can generate a column list based on our simple rules. If the column's name was load date which we don't have anymore well then create a load date column. If the column name was in one of the taxonomy columns that's not actually the name of this they had to scrub some of our column names for this presentation. Well call those text and anything else keep the name it's got and call it a big int. String ag that together and lo and behold you've just created a temporary table that is exactly shaped like this CSV. The test machine we used for this I should have probably mentioned this earlier. These tests were all run on an AWS EC2 4 core box and it was running Postgres 10 pre-alpha because I was doing a lot of branch work and didn't want to risk any of my development accidentally linking to the 9.6 production libraries. So the easiest way to make sure that it wouldn't do that was to not have Postgres installed. The workload that these are actually done on when we're detailing in anger is a 32 core 250 gig RAM EC2 I3 8x large. We tend to run about 5,000 of these type of data files per day. Most of them are a lot smaller than this one but some of them are as much as two orders of magnitude larger. So the load processes that we use for these have to be able to scale big and play small ball. AWS machines have a reputation for being IO starved. Some of the work I in helping Peter Gagin's index optimizations involved timing index creations and his first reaction was how quickly things got IO bound on AWS machines and that has definitely informed our ETL processes at Moat that we try to make things as skewed as heavily away from IO as possible and that will become relevant later. So for the second case we're going to look at load times by various methods. Just to simplify things we're working with one data file. I now know it has 123 big ins in it and if you wonder why we're not using a column store it's because we usually query most of those big ins most of the time which is absolute anathema to a column store database. First method we tried was PG loader. It's an external program. You write a little config file for it. You can do this with command line switches. I do not recommend it because you're going to try to say fields optionally enclosed by single quote, double quote, single quote and there's no clean way to do that in bash that I could finish in time to do this particular presentation. So I just went with the config file timing PG loader that file loaded in three minutes 33 seconds and so I said hey AWS is IO starved what happens if we load from the GZIP file and we'll use parallel GZIP, PIGSY. Oh well it got slower a lot slower. It turns out with in this case PIGSY is not able to do much parallel and as a result you're getting the CPU bound of one process feeding into the same copy that you had before so while it is considerably slower it is much lower on IO so it plays nicer in the ETL. Try it with PG bulk load. Now PG loader and PG bulk load both ultimately end up using Postgres copy internally and that shows much better here. PG bulk load was able to load that in 25 seconds and with the PIGSY version loaded in 37 seconds. Okay pretty good. So then I tried just doing it with a regular copy. Now copy versus backslash copy is not strictly relevant in this case. The test machine and the test database were the same. Just doing it by hand was able to get it loaded in 24 seconds. The PIGSY version also loaded in 24 seconds interestingly enough and then I tried doing copy from program and got essentially the same time so 24, 25, and 25 seconds. One thing you can do when you are doing ETLs, you're loading up into a work table. That work table is ultimately data you don't care about if this process died, the database crashed and you started over. You just reload this table again. So there's no point in making it a logged table. It's probably going to be a temporary table but unlogged tables are the same in that they do not write to the wall log. So you can save on a lot of IO for data that you're ultimately not going to care about. Going to unlogged tables, the times dropped to 1920 and 20 seconds. And for the pretty graph here, just to sum that up, obviously PG loader was the hands down loser. It also managed to occupy all four CPUs nearly 100% the entire time. So the additional tests that we're going to be doing in this presentation won't be using PG loader. Everybody else came in about the same because everybody else is essentially using copy in the same way. So takeaways from this, copy from program performs the same as external pipes. If you look in the source code, that's what it's doing. It's launching another program and reading from a pipe. Pigsy, like I said, was only slightly better than a single process GZip. The overhead is only slight. It does, even though it performs a little bit slower, it does play better. PG bulk load was ineffective in this situation in that it was an additional step that gave us no performance gains. And PG loader just swamped the system. Next case. We just got done copying into a work table. But we're not going to keep that work table. So all the rights we did to it are ultimately wasted. So we would like to avoid bouncing data through this table. So the standard pattern is to create a temp table like we had just done, copy into it, analyze it, and then do some transformation on it. And in the case of mode, it's very heavily geared towards aggregating that data and aggregating it several times. So you're paying for reading the file, writing the temp table, reading it right back out, discarding the temp table, burning anoid in the process. And it would be nice if we could just avoid some of those reads and writes. One thing you can do is create a temporary foreign table. You do not want to create a permanent form in table because you're going to have to be constructing unique names for all of these. And like I said, we've got 5,000 of these processes that need to run a day. And generally about 30 of them running at the same time. So creating temporary foreign tables made more sense. There is no create temporary foreign table command. However, if you create a table, a regular foreign table in PG temp schema, it is effectively temporary. Once you've created that and have it pointing to the file you wish to be loading, you can then do an insert into the table you actually care about, selecting directly from that foreign table. And that will essentially do all the steps that you had before. It should eliminate unnecessary writes. It should have the number of disc reads. It allows for filtration in the where clause and in the columns clause. We should get some performance boost out of selecting only the columns we need because the ones that we haven't request will not get turned into datums and put in the tuple store for any length of time. We still need to look out for invalid data. We are still burning an OID in the case that there has been a table created in PG temp. And we're also doing something slightly ugly in that we're creating the equivalent of an as yet uncreated command create temporary foreign table. So actually trying those times. If you select straight from a foreign table, the time for summing five columns was just under 10 seconds. If you up to 20 columns, you got just under 12 seconds. Okay. Well, what's that based against? Copying to an unlock table took 17. That's quite a lot longer. Wow, it should have been the same amount of data. Analyzing it took under half a second. And then the big payoff there is that once you actually have it in table, it only takes 1.3 or two seconds to some five and 20 columns. Going to the graph, you can see that there is a big cost to actually copying it in. But once you have it in, you have a significant query bonus. Basically, if you're going to only be reading from this table once, a temporary foreign table actually makes sense. If you're going to be reading from it a second time, you're better off doing it the conventional way. The timings for file FTW get a lot closer as the number of columns you're reading out comes equal to the total number of columns in the table. Copy doesn't have that ability to filter, so it's paying the price for all of them. File FTW can use the program option starting in version 10. File FTW is itself just a front end to the copy command wrapped in a foreign data wrapper. And the issues with that are that the analysis of how many rows a given foreign data wrapper file FTW table is going to take is based on the file size. And it gets a pretty good row estimate from that based on the number of columns you've declared and the file size. If you're copying from program, it has no file size to check. The row estimates that gives you will be artificially low, and that can lead to some really bad plans, especially if you do nothing but aggregates, which is exactly what Moat does. So we had some problems there and ultimately had to abandon that method of loading files, at least for the large ones. I had a bit of a hacker's mailing list misadventure trying to get a set returning function called copy SRF in. The goal was to, instead of doing the work of creating an on the fly file FTW table and reading from it, I wanted to instead just have a set returning function. This would save us burning the OID. And with the number of these things that we're running concurrently, we did notice that there was a fair amount of PG class and PG attribute bloat. There are some problems with doing set returning function like that. It consumes all the columns whether you're going to use them or not. It puts them all in a tuple store. That tuple store is sitting out in memory. There is essentially an optimization barrier to it. There is no way for the query to see into that function and say, yeah, you don't need these four columns. And yeah, you don't need these rows and things like that. It is very, very analogous to a CTE optimization barrier. You still have the bad row estimate problem. And then the thing that ultimately got me to abandon it is that I wanted to be able to do a copy SRF from standard input so you could do copies much like you do with backslash copy in PSQL. The problem is that the wire protocol is such that the query says, here's the start of the result set and then you start the copy, which it then assumes is the end of that result set. It finishes the copy, is all set to put it somewhere, but the result set it was going to put it in has already been closed. So aside from a change to the wire protocol, we are not going to be able to have a copy SRF like function. So last case, roll up tables. Oracle has a really neat feature that I would like to bring to Postgres someday where if you have a very large table, you can create materialized views on it. All of your queries can be written against the very large table and the query planner can notice cases where you are doing an aggregation or a filter of that very large table that happens to already have been done inside a materialized view and it will simply change that query to use that particular materialized view either because it's been pre-aggregated much much smaller or pre-filtered. That doesn't exist yet, so right now we have to do that in our own application logic. We have a variety of use cases and after sanitizing the names for your consumption, those use cases are below. There is all eight taxonomy columns. There is discarding the fourth, then the fourth and third, second, third, fourth, and then keeping just the fifth through eighth and then the other side is discarding the last four and then slowly rolling your way up to a final empty roll-up which by definition will have exactly one row in it per roll-up and that particular table happens to be the very first one a customer lands on for the day so it's nice that every day is exactly one row for them. One way you can go about this is with dumb inserts you can have a copied table and you can insert into it with the full aggregate and then you insert into your next destination table with a slightly more rolled up aggregate and then so on and so on until you've done all the tables each time hitting the same just copied in work table. That should be a lot of I.O. you don't need. You can get cute with it and you can do chained inserts. You can insert into the most granular table first and then you can then re-query for the next table by going to the table you just wrote into. Now I've oversimplified that in that the table in this case D9 probably already had data in it and so whatever a filter you do when querying D9 is going to have to reflect the dates and other restrictions that that data had so that you don't accidentally aggregate in data that wasn't a part of this load. You could avoid that by doing a CTE where you do an insert into first destination table returning the number of rows that you generated and then feeding those rows returned into the very next slightly less granular destination table and so on and so on. Now we have to reference D9 twice because our aggregations are the perfect hierarchy they're split into two. You could do group by grouping sets and in which case you just repeat all of your roll-ups there and after you have created this temporary table you straight insert those rows out of the one temp table you've just generated. Group by grouping sets is really really fast and takes advantage of the fact that the data has already been sorted for one aggregation to use that sort in the next aggregation and so that was what we thought should be the most efficient of the bunch. And then you can take that a little bit further where you do the inserts all in the same from the same CTE and then just so you have row counts that you can get out you return null and then at the very bottom you do select counts from all those temporary tables just to see that the number of rows inserted are what you expected. And then the last case that I decided to test was a roll-up where you create one table and then you use version 10 native partitioning. You accept the fact that your roll-up tables are going to have certain columns in them that aren't needed because they aren't relevant to that grain. But now that your tables are all shaped the same they can be partitions of this master table. You can do the same group by grouping sets roll-up and then you can insert straight into it. I think for a moment which one do you think is going to be fastest? We're a little short on time so I won't keep you in suspense. The dumb insert was of course the slowest and that's not surprising because it is endlessly rereading from a potentially very large table on disk and it is going over rows that it has already processed several times. The chained insert did fairly well but I think that's sort of a false economy because there was going to be additional filtration to not accidentally read in rows that were from a previous ETL. Chained CTE insert and chained CTE roll-up both basically perform the same. The planner was able to see through those situations and was able to make about the same use of it. The dumb roll-up for 9.6 ended up performing the best and I was a little bit surprised. I really thought it was going to be chained CTE. It seems that when you've just got done the group by grouping sets is clearly faster as a means of doing the aggregations but then scanning over that table linearly with no additional aggregation steps seems to be quite fast. Then the ultimate winner was automatic tuple routing with a partition table and I'm really excited about that because that's going to be something that we're all going to be able to use in the upcoming release version 10. So that's some of the things I've learned from the trenches of ETLs. Do you have any questions for me? Sir? Oh yeah I left that out just because there's you know we were short on time as it is. Yes they are indexed to the 9s and I could do a whole separate presentation on methods of disabling those indexes, doing a big insert and then re-enabling them and techniques for remembering what those indexes are. And I would like to add a feature to Postgres in the near future where you can just disable indexes because the best place to remember what indexes you needed on the table is the data dictionary. So that's something I'll be looking forward to doing but there just wasn't time in this talk. Sir? We can create a temporary table to avoid writing to the wall. Yes, yes, temporary and unlogged tables both do not write to wall. Oh yeah, so when we do that and if we have replication, better replication of PG logical does that work? It does not exist. That table, well if it's temporary table it doesn't exist in the other session period. It's not there and if it's merely an unlogged table then it's just not replicated. So what you're doing is right, you're loading from flat file to the temp table and from the temp table to the target table. Right, the target tables must be logged. You're not going to get far if they aren't logged and besides that you actually care about that data. So did that answer your question? So because the step right like one, two, three right and the step two was missing in the wall right but how does it know what to write in the destination database? Well the table does exist so the data coming out of that unlogged table is present and just becomes the result set that gets pushed in. Oh sorry. Any other questions? Go on one. Yes, we were using a lot of databases for, I wouldn't say wraparound problems were it but our big problem was bloat in PG class and PG attribute. Finding that we had a one terabyte PG attribute table was a big problem and then now we have to take this database down, do a very, very quick vacuum full on PG attribute but nevertheless it was downtime. It was something we wanted to avoid. So we, you know, but anytime you're creating work tables yes there is some necessary burn there. That was the reason why copySRF was initially so attractive for me. Is that it? Thank you very much.