 Okay, we'll get started with the next talk Welcoming Adrian who will be talking about open-source data lake at scale in the cloud Hello, good morning My name is Adrian woodhead. I work for Expedia group as you can see I'm a principal engineer working on the big data platform and also all things open source So yes, I'm going to be talking about building an open-source data lake at scale in the cloud so a lot of the components we use in our data lake are open-source and One of the things we feel very passionately about is that a lot of companies just take take take and build stuff on top of open source And don't give a lot back So we've tried very very hard wherever we found gaps in the open-source ecosystem To build tools that fill those gaps and then we've open-source them So I'm going to be talking about a few of those today So hopefully those can you know other people can use them improve them We also making sure the whole ecosystem is sustainable So the agenda for the talk today, I'll just give a little bit of background about Expedia group How we structured because that has some sort of impact on how we built our data lake We'll talk about what we consider the foundation for that how we store our data and the metadata We'll talk about some options for high availability disaster recovery redundancy those kinds of things We'll look at some options if you operated a certain scale it can be useful to federate access to your data and then we'll look at How we enable event based event based data processing and we're going to look at a concrete use case that we have for this So before everyone runs out of the room screaming in terror. I promise you this isn't marketing slide This is actually important for how we have had to structure our data lake So Expedia group we consist of a number of different companies that we've either you know developed or bought to quiet over the Over the past 20 years or so So we have you know online travel agencies. We have flights. We have hotel bookings vacation rentals car rentals all kinds of different companies and These all operate at a different scale. They have different requirements They all generate a lot of data and some of them have datasets that they're just interested in for their own usage But then we also as a business we like to be able to get a holistic view across all of this Another challenge is as we've acquired them over the years some are more integrated into our platform than others some of their own technology platforms So getting one single view onto all of this is actually quite complicated And then sort of make it even more complicated the scale at which we operate across all of this is really really huge So we have literally billions of events coming in every single day from all of these different companies By streams batch processes massive file dumps you name it And then we have thousands possibly even tens of thousands of data processing jobs ad hoc queries reports running Joining all of this data together producing even more datasets and so on Also our data platform today. We didn't just build it from scratch in a vacuum in the last few years We've got like 20 years of legacy in some cases that we need to deal with and Initially our data warehouse started off as what you would now probably call a traditional data warehouse So this was a lot of data stored in relational databases Most of the data processing data querying was done in sequel and Then about 10 years ago. There was the rise of Hadoop big data all of this stuff So we built an on-premise Hadoop cluster, you know, six seven hundred nodes distributed file system and then we put Hive on top of that as a metadata platform and This was also very useful because Hive provides metadata services, but because of our legacy with sequel Hive also has a sequel query engine on top of big data So a lot of our data processing that was written in sequel could just be moved over more or less as is and be run via Hive What we found is that it was very hard for us running all of this on-premise on our own data center Meeting peak demand became quite a challenge Obviously, we had to send people into rack and stack machines and that that becomes quite expensive And then the whole upgrade path for the software and the hardware was very very painful and quite expensive So around that time the cloud vendors all came along and they started offering big data Capabilities so we then started migrating our primary data lake into the cloud So in our case, we use Amazon web services or Amazon, you know, S3 EMR all of these this functionality That's our primary cloud provider So that's what I'm going to be the terminology. I'll be using my talk today But most of the concepts will apply the same if you're using Google Azure or something else So at Very basic core foundation. What does a cloud data lake mean to us? So we obviously have a lot of data. So we store that in a distributed file system. So in Amazon's case, that's S3 Wherever possible, we used efficient compressed binary formats like Avro, Parquet, RSC, etc And then we need to store some kind of metadata about the data So that's where we use Hive's meta store service. So we store the schemas in there So all the fields the types, etc And then what's also quite important is we generally don't allow our users to go directly to the file system to access the data And there are a number of reasons for that which I'll touch on later But a lot of it is around the eventual consistency nature of cloud file systems because they're not really POSIX file systems. They're actually object stores. So often you can't tell if when data is complete Then there are no atomic move operations and things like that So we use the Hive meta store as a way to register when data is available So all our users we direct them to the meta store. They find the dataset They're looking for they get the data locations and then they can access the data on S3 The Hive meta store it has a backing relational database That's mainly an implementation detail. You don't need to worry about too much And then the smiley face sort of represents all our users data processing jobs and so on They don't always smile, but let's just pretend they do To make the setup highly available. It's actually fairly straightforward You put a load balancer in front of the Hive meta store service You set up some kind of an auto scaling strategy. So the nodes, you know, scale out and in as demand changes The backing database you can use something like Amazon's RDS So it handles that for you and then what we found is the the cloud providers distributed file systems generally scale Very very well, so you don't need to do much yourself So one thing again, why we've gone with such a simple setup here You know, we could have chosen to use some very very fancy specific data warehousing technology or day Lake technology But coming back to that first slide where we've got so many different users with so many different technology platforms We've kind of gone for a lowest common denominator Approach so generally what we found is if you have the data in the file system the metadata in Hive most Upstream data warehousing technologies things like spark flink legacy technologies like Map reduce cascading And then all kinds of other tooling like Tableau, etc. They can all interact with this using JDBC and ODBC So we enable a really really wide variety of use cases above all of this and now we've made it highly available That's great But being highly available in one geographic region or data center is a bit risky So you obviously want to have that redundant and ideally you want to be able to run your entire setup in multiple regions So in order to do that again You can just spin up your entire setup in another region and then you can decide whether you want to run this an active active mode Or just have a bare minimum in another region and just fail over to it if and when a disaster happens So the key thing from a data lake point of view is if you start operating in another region Is you have to have your data and your metadata available in the other region? So one thing you could do is you could have all your data producers set up to produce data to both regions But now all of them have this burden on themselves where they have to be able to synchronize and what if the one region? Right to one region fails and another one succeeds So we decided to not put that burden on all our data producers and instead make that a core feature of the platform so our platform is responsible for replicating data into another region and The key thing is that we need to do this in a very coordinated fashion Because you want to make sure that if users are in different querying data in different regions that they have a very consistent view of the data And it's always as correct as possible And what that really means is when you're replicating data or metadata into another region You want to make sure that users in the other region can't get partial reads of the data whilst it's in transit So what we generally do is we only advertise data in the hive metastore when it's been completely replicated over And this means there's latency getting the data into the other region But we generally feel we'd rather have late complete data than incomplete or incorrect data So to do this we've built a tray a program that we call circus train And the name comes from the idea of taking the Hadoop elephants and all of these other crazy circus Animals in Hadoop ecosystem putting them on a train and moving them from one place to another And one of the interesting things we found when we built it and why we built it is there were no replication tools out There that would only advertise the availability of the data in the metastore After it had been copied. So there was this possibility of getting these partial reads So that was that's one of the main core features of circus train It supports various different distributed file systems HDFS S3 Google's cloud store, etc And by default it uses Hadoop's sort of famous standard DCP mechanism for copying massive data sets at scale We've also written some optimized copiers for S3 that take advantages of some of some of the aspects of that And we've architected in such a way. It's got a plug-in architecture. So you can write your own custom versions of various aspects of it and Then we also there's some quite advanced features in it that can Analyze the data on both sides and then make sure that it only replicates the bare minimum Which is quite important when you possibly replicating terabytes of data every day. You want to keep that to the minimum Initially we built circus train with the idea It would be run on like a time base schedule once a day once every four hours, etc Which works well for certain use cases But we found if you really want to scale this out and minimize latency between Data being available and replicated you want to be able to have an event trigger for your replications So we built a layer on top of it called chanting odd, which basically monitors the high meta store for changes When data gets changed it then triggers replications automatically in the background So what we then found is sort of an unexpected side effect of this big cloud migration that we Did across our entire company Is that these different business units different parts of the company started moving to the cloud at different speeds Which is kind of good for them But then what happened is they started building their own data lakes In their own amazon accounts sometimes even in different regions so We basically had the situation now where we've got data silos So when we're on-premise and we had one hadoop cluster one half meta store Most of the data was in one place and people could very easily do Joins and queries across all of this data But now in this case here. I'm just showing three of the business units We've basically got three data lakes And hive was never built in a way that it was it was never designed to be able to federate queries in this fashion So we'd built these data silos and this wasn't going to be acceptable to our end users So we thought a little bit about how we're going to tackle this how we're going to you know break down these silos So one option would be that we would tell all of these business units You need to move everything back into one big central data lake in the cloud Um, which is probably if you're operating on a smaller scale. That's probably actually quite a good approach to take It's quite a bit simpler But we thought we were going to have some scalability issues There's certain limitations to how much you can do in you know one or two amazon accounts And we're also a little bit concerned about the blast radius of having the entire company all these different users operating in one Um amazon account if something went wrong the blast radius of that could take down the entire company, which wouldn't be good So instead another option we we thought about we have a great replication tool We could possibly look at all the shared data sets and then replicate them into each of these data lakes So everyone has their own copy of the data And if you only have a few data sets that you share that is actually also probably quite a good approach But we had thousands possibly tens of thousands And this idea of setting up and maintaining thousands of replication jobs all the transfer costs increased storage costs We decided that was a no go for us too So instead we looked at how could we federate the hive metastore So we built a open source tool that we call waggle dance And the name comes from the dance that bees make when they want to indicate to other bees where to find pollen or food sources And what this is is the hive metastore Has a thrift api that it makes available for people who want to get hold of that metadata So we basically built a proxy that exposes the exact same api And what you do then is you configure that proxy with different downstream hive metastores And then what people can do they query the proxy the proxy then goes out to all the federated hive metastores Gathers all the results aggregates them back and presents it back to the user as if it was a single metastore So that's good for the metadata access when it comes to the actual Data on s3 you need to set up corresponding access permissions, which is generally fairly straightforward And then as an end user what you do in your client application There's a URL that's a configuration setting that you can Change instead of pointing to your local metastore you point to waggle dance and then you get this federated functionality So kind of what this looks like in practice if you set up a waggle dance service in this example We've got like what's considered a primary metastore, which it has right access to that one We have two external metastores that are set up in read only mode And so if a user query comes in here It can basically see and do joins across all three metastores as if they were just one So what this actually looks like in practice for us? This is just an example here showing you know three of our business units hotels to come Verba and Expedia And they're operating here in three separate amazon accounts. That's the vertical dotted lines And we're running in two different geographic regions us west two and us east one in this example And so what we do then is within a region all the data is You know pretty much Co-located latency and data transfer costs are low so we can federate data access across them We replicate data into another region and in that region we federate again So some of the best practices we've learned from operating all of this for the past few years Is generally wherever possible we expose read only endpoints to the end users You don't want you know an ad hoc query writing into some unexpected place And similarly wherever there's critical path infrastructure, so etl or streaming jobs which need to operate You need to you really really need to have them running or the hundred percent reliability again separate all of that infrastructure from people doing queries And then yes, whenever you're going to federate Data access within a region you federate and then if you're going to Have a need for the data in another region You basically just set up one job to replicate the data into that other region and then you federate again What we always want to avoid is federating across a region boundary because then you're transferring data across the region Which is slow and expensive There are other Alternatives out there if you need this kind of federated query Mechanism, so there's the presto projects. That's also distributed sequel query engine for big data It can federate hive it can also do my sequel postgres and many others The big problem with it is there's been a huge disagreement in the community of fork There are two versions of presto the same name Foundations behind them. So good luck picking the winner And as you can imagine setting up and maintaining and configuring all of this It's it's there's quite a bit of configuration that's needed to do this So we built an umbrella project called apri where we aim to put all of this stuff componentized as much of it as possible And then you can pick and choose the bits that you want So we have docker images for all the services We have terraform deployment scripts for being able to set up all of the networking load balancing infrastructure and so on We have a range of authorization and various optional extensions Uh, so i'm going to give an example of one of these uh optional extensions So it's a metadata event framework that we built And one of the reasons we did that is that we found when you're operating thousands of data processing jobs at scale Doing this using some kind of time based mechanism using cron and relying on when data should be there or not Is very very painful. So instead we prefer to be able to do everything based on metadata events So what we do what what this framework does is it emits Events whenever there's a change to the data they go into kafko s and s and then downstream people can subscribe to those So one of the the big problems we have is rewriting data at scale So we generally partition our data by some kind of like when the data arrived like when a hotel booking was made Um, ideally those partitions would be immutable, but that's very rarely the case You've got late arriving data. Sometimes you have updates And one of the big problems when you have massive data sets massive partitions is even just doing an update takes a lot of time So what we do to Ensure we have read isolation for queries that are running across those partitions So when updates come in we basically write an entire new version of the partition And then when that write is finished we repoint the meta the hive meta store over to the new location But then you have the problem what to do with the old versions of those partitions And because it's such a big distributed system, we don't actually know when people have finished reading that data So we kind of have these orphan data sets that sit around So how do you expire them? So what we did is we wrote a tool called beekeeper And what that does is that sits downstream of that metadata event framework and it watches all the time and it detects when one of these Um update operations has happened that is potentially orphaned data and All the the data only needs to do is they put this Hive table parameter on their table You plug in beekeeper onto the event listener It finds the rewrites and then it basically schedules the data for deletion in the future So we have a time window three days just to be completely sure that all people have stopped using that data And it will then do the data deletions So this is one of the places where having a central platform makes it a lot easier for all of our end users As they don't have to do all of these housekeeping operations themselves So there's some other Alternatives that you can have if you want to be able to do these kinds of updates and have consistency in your platform Newer versions of hive have acid semantics that relational databases exhibit We have iceberg and delta lake which both use metadata files on a distributed file system instead of a database And then there's hoodie which is another patchy incubating project That does something similar They're all under very active development. So your mileage may vary if you use any of them We also find it's very very important to test your data processing jobs So there are a number of unit testing frameworks out there that we've contributed to or open source So there's hive runner which you can use to test hive sequel We built a layer on top of that called mutant swarm which gives you code coverage of your sequel So you can find the parts of your sequel code that are missing tests And then we have another project called bijou which basically spins up a frift hive meta store service or the hive server 2 service in memory So you can write unit tests against it So the things we're looking at what are we going to do next? What we really really want to do is have our entire data platform built on open source solutions That we can run where we want wherever Based on performance cost scalability. So hybrid cloud would be very nice We could run things on premise or in the cloud provider Multi cloud being able to use different cloud providers for their strengths But obviously both of those come with the cost of increased complexity So what we're really hoping for is the combination of docker terraform kubernetes That we can basically take this entire platform Deploy it You know at scale without too much effort either on premise or in one of the cloud providers kubernetes engines So these are some of the projects I've talked about some of the links if you're interested Have a look they've all got mailing lists. We'd love to hear from you And that is the end of my whirlwind tour of data lakes and I think I've got three four minutes for questions Please stay seated while we do questions so we can actually hear the question Any questions for adrian? Hi, so thank you for presentation. So you said about distributed file system within The amazon cloud so what particular components you use for for store s3 or any other bundled hadoop cluster within the amazon Thank you Sure So yeah, so we use for the data lake we use s3 for long-term storage Whilst we're running data processing jobs. We use emr. We also use kuball What a lot of those do is they spin up a temporary hdfs cluster For writes whilst the job is running, but then when the job's complete they write everything back to s3 So the the foundational piece really is s3 Any more questions? Can I throw it? Hi, I'm wondering if your users have low latency Needs in that they don't want to wait for like the replication to finish before they can query the data Yes, that's a good question about um users who have lay low latency needs. So generally what we recommend people do Like the people who really want access to the data should be running their jobs in the region where the data is being produced So most of those replicas are just for disaster recovery and those use cases can generally handle High latency The other thing I didn't really talk about in this talk as I was focusing on the data lake aspect Is we're trying to move as much as possible to stream first approach So actually a lot of the data that comes into the data lake It arrives in real time and it goes on to we have a big streaming platform based on kafka So if you have really really really low latency needs you write a streaming application that pulls the data off kafka So then you get you get kind of that low latency use case And then all the data that arrives on kafka it then gets put onto the data lake But the latency there is usually minutes possibly an hour Okay, maybe one short question a very short Hello, uh, do you include the data catalogs in your process? So the question is about how we catalog data. Yes. Yes. So that's a good question So hive again is our basic data catalog because it has it has the schemas it has the net Everything's organized into databases and tables, etc. We also have views on top of this We use a product called elation which can basically spider the hive meta store and pull all that data in And we can then also register relational databases with that and get kind of one view over it There are a lot of other tools out there that do that. They all have pros and cons And I haven't seen one that I'm massively happy with Amazon have something called glue which does that for you, but then there's the whole it's great, but there's vendor lock-in. So it's really It's not there's no one beautiful open source solution for this at the moment that I'm aware of Okay, unfortunately we are out of time. Thank you very much Adrian for this talk