 One of the, yes, I have a question for everybody first, why is this conference in English? When when when Paradigma called me and asked me, hey, you want to do a presentation in big data Spain? I said awesome in a first time in my career that I get to do a Technology conference speaking in in my own language and it's gonna be good It's gonna be and then now it's in it's in English. I said damn you paradigm Damn you guys. So Moving on Thank you very much for coming This is my name is Mariano. I'm here with my friend co-worker and director of data development for Expedia and Selmo he's gonna take over the presentation at the point that it becomes super boring Then he takes over and it's gonna be awesome. I'm the CTO for Expedia partner solutions So I'm gonna tell you a little bit first About who we are Hopefully when you know us We are the world's trouble platform, right? We are the Expedia group work with Expedia group You surely can guess that Expedia group has brand Expedia Expedia com Expedia ES here in Spain We also have hotels.com You probably know that as well But there are many other really really big brands that that we that they also part of our travel platform And our travel group a very very good example. Of course, we have three bagel probably not not many of you guys know That three bagel is is is one of the Expedia group companies We have travelocity. We have orbits but then interestingly Home away as well and home away is one of the largest vacation rentals platforms in the world And it's part of the Expedia group and that is actually a very interesting Synergy that that it is very relevant to to the rest of my talk so a few a little bit of Information about Expedia, which makes hopefully what we're going to show you next a little bit more relevant We're we're quite big right we have 600 million visits in more than 75 countries that we operate more than 20,000 employees I think last year we did 90 billion Usd in sales a year that is probably the biggest travel platform in the world We have more than 10,000 affiliates different type of affiliates But as you can see, you know, we have a very interesting scale Problem and a very interesting scale opportunity There is a secret and this is our secret I'm putting here at the beginning of the power point We're about travel everybody Expedia loves travel and you could say we're a travel company But if you ask any experience And there are some of here in this room probably half of you are from Expedia or I see one right now It's just just a few We're a technology company. We're an engineering company, right? Everything in Expedia It's about technology. It's about how we use technology It's about what we do with the technology to bring people closer and bring the world within reach for those people Every decision there is always a technology team and technology people in the room, right? So this is not the typical model of traditional companies where You know that there is a business side of the company and that business side makes decision about the strategy of the company And then they just dump all of that into the technology guys and say figure it out, right your Expedia is a little bit kind of there. We're on sometimes So we embrace technology we do technology for living and of course data and data science Which is also part of technology and and we saw very early on the power of that that can have in our products And in the experience that we get to our part to our customers So within Expedia, I work for the Expedia group partner solutions, which is the B2B side of Expedia We we power more than 1,000 companies So we we have more than 1,000 partners that they use and they get all the XP inventory and products through our our Systems, of course our flagship digital product is our API which is most certainly I'm quite certain it is the biggest public API in the travel industry right now We do more than 600 million searches a day We have been growing massively every year because you know, we're taking all those awesome Expedia products and giving it to many many other companies and A little bit relevant to what we're here today. We process around 10 terabytes of data per day So it's a very interesting data challenge and data opportunity. We're going to talk about a Few of our partners. I have to say these these are the guys who makes us great As you can see a lot of companies that you probably know Worldwide not just in Europe or in the US that they all get all this a lot of inventory and a lot of really awesome products from Expedia so Interesting things that have happened are happening in the trouble industry that that create opportunities and challenges for us Augmented reality and VR If you haven't watched that movie Don't watch it read the book first. It's a lot better and it's an awesome book probably one of my favorite ones and then watch the movie But basically, you know, we're already working very hard on augmented reality and VR If you if you've been on holidays You know, you surely been you can you probably know that if you if you get a bad flight You buy a holiday from Expedia from anywhere else if you get a bad flight You have a few painful hours, but after that is okay I'm out of here But if you get a bad hotel or you get a hotel that you're not happy with that is not catered for you It's not the best for you your whole holiday can be ruined, right? So it's very important and people spend a lot of time Looking at the content that we provide for the hotels and we have the best content in the industry But this is this is about to change, right? Everybody many of you probably have a VR headset already augmented reality with your phones We have to completely rebuild our data structure and data strategy around content Which is again one of the most important things about selling a holiday Around augmented reality and VR, right? You you want to give the the customer the ability to really go into hotel and really see every aspect of a hotel before They make a final decision and for some families the holiday that they're going to take this year the amount of money They're going to spend is the most important financial decision They're going to make in the whole year So so we want to assist them when I give them the best possible opportunity to do that in the best possible way But from a technical perspective most of you probably are technologists You realize it's a complete overhaul of our content systems Which is one of the most important things that we have another interesting and all these opportunities and challenges and quite Connected to each other. So another one is the market is growing for us. There is consolidation Expedia had added hundreds of thousands of new properties this year into into our systems So every day you go to xp.com. You want to search something you're going to see a new property I mentioned before home away. We're integrating all the home away properties into Expedia and the whole Expedia group landscape Which means you don't have just the hotels anymore. You have all the vacation rentals, right? So even before you you want to go to Paris you have a thousand hotels there Now you're going to have a thousand hotels plus another thousand two thousand whatever the number of Vacation rental flats there and if you guys have Used a vacation rental before you can see that sometimes it's as good as a hotel even better, right? so so so our Data is growing our supply is growing massively which is good for the customer But it creates a lot of changes for us as well for every new hotel that we add or every new property that we add We have a whole new set of content, of course We have of course the property data, but then we also have Billions or millions of fair price combinations, right? So our our pricing engine becomes you know It's growing exponentially because it's billions and billions of more prices to calculate every day every time I mentioned before we do 600 million searches a day And imagine that multiplied by a hundred hotels 200 hotels 100,000 hotels 200,000 hotels So so basically this this growth is fantastic for the company fantastic for our users But us as technologies is bringing a very very very important challenge Not only too much to have to manage that data but also to be able to give our customers what they want Which is connected to the third point and another travel industry challenge and opportunity Which is voice and personalization so Over the past 20 years. I mean lucky that I spent almost 18 years in travel industry It was one of the first industries that was properly disrupted by e-commerce and by the internet And and interestingly of course a lot of people have been shifting their their buying habits towards online and you know right now It's online industries begin to travel the retail industry But the retail industry is still there actually we power many of retail partners and there is a massive difference still between going to Expedia going to travel into home away going to orbit And searching there and going to a travel agent and talking to somebody that if you go there often They know you they know what they what you want. They know what type of hotel you want They know which holidays you took So there is a personalization component that traveling industries have or retail during this have that no Company has taken on and yet and that is a very interesting data challenge as well And that's something that where all these technologies that we're talking about in this conference, you know Data machine learning AI can help out with that and boy search It's a little bit of that but then because personalization becomes a very critical factor in boys because basically you go to Expedia right now You search for a hotel in Paris You get a thousand results. We add all the vacation rentals You get 2,000 results if you're in front of a computer and you have patients, you know I think the average the average customer use Spends at least 17 or 18 hours Browsing through the content, you know, you can look at the pictures, but you want to search through Alexa You know the standards are okay Alexa. Tell me what hotels I can use in Paris. It's okay result number one of 2000s, you know this hotel and describe you're not gonna sit there for two hours just listening to Alexa Tell you everything that's available. So that's our job and it's one of the Very strong use cases we have on the science where if we need to give there's only room for one or two options Maybe three so from all these hundreds of thousands of properties billions of Potential fair combinations we need to be able to select those few specific results that this customer He or she wants And that is a fantastic data challenge fantastic opportunity But then also that if you don't do it right, you end up with a horrible user experience as I think Oscar was talking about before so some of our Data science use cases some of how we're spending our time and our effort Sorting of course very relevant to what what I said and very relevant to what we're seeing in the industry It's a massive difference in conversion if you sort the right property and that sort in is not a static sort in the best Sorting for me is not the person for you and it's not the best sorting for somebody in Hong Kong So it's combining that personalization with sorting with what type of traveler you are with what we know about you So this is one of the biggest use cases we have right now and we're using plenty of different tools And someone's going to talk about later to enable that and to empower that and again 600 million searches of a this is a huge amount of Data we consume every second to try to do this Image classification would you go to that hotel over there? It's actually a good hotel I've never been but you know I heard this and there is a fantastic presentation from our director of data science Nuno About how we did that this but basically the hero image of a hotel in search results is super important Right the first image you see it's it's incredibly important to try the conversion and some hotels have 2030 images so for us to select which is the best image for each hotel Especially when the one we are than hundreds of thousands of hotels every year is not that easy It's not that simple and even sometimes what the hotel thinks is the best image It's not really the best in conversion So in this case is actually interesting because we use mechanical Turk and AWS service I don't know if you guys heard about it, but basically use people as a service to try to sort a Sample of all the images we have and say okay Which one of all these images from the hotel is your preferred and we took all that data from the actual manual sorting from From real people thousands of people and then we use that to train our algorithms This is what most of the people selected and then we apply that to To to our entire data set of all the images of all the hundreds of thousands of hotels to Resort and reshuffle the best order for the images one specific hotel once you're ready there And of course the the best hero image And then after after doing that we actually test AB tested or MBT multiple different algorithms multiple results right and we can see in real time what is The customer reaction and you know, I can already tell you the results were amazing, right? Forecast and anomaly detection so this is an interesting case We put it up here because it's not just customer facing It's not all it's not just giving the customer about their products also for us to run our business Better if you look at the numbers I send I show before In a very if most of our core systems are down for a minute in Expedia. We lose thousands of dollars Right for one minute You put that to many minutes a year. It's a significant amount of business. These are the problems that you have at scale and basically We develop our no internal systems to try to Forecast but then also to try to detect anomalies which sometimes okay If the whole system is down it's kind of easy to detect because there are people shouting all over the place But sometimes is you know this point of sale for this particular Destination for this particular hotel is not selling and supposed to be selling That's where the data is too big to consume for our operations team or any any person itself And that's where our system kicks in and detects, okay You know this type this very specific use case is failing and you find out another at a scale of Expedia Those things eventually end up to be a significant amount of revenue that we could lose Boys and bots I talked about boys already You probably heard a lot about chatbots in this conference over the last few days I'm not gonna talk too much about it But basically, you know part of the part of our customer experience is not just the cell part But also support side right and then we want to make that support The best for the best possible and some some customers many customers grow myself They don't want to call somebody I want to talk they just want to chat And that's where we have a massive opportunity to reduce the cost and therefore make make the service even better for everyone this is a very straightforward use case and Again very very Related to to what I said before It's very important for us And you know who who many of you probably been to Amazon before and you know one of the interesting features and Amazon is you're buying One thing and say hey other people like this as well We recommend these other products right and that doesn't happen in trouble very often This is the personization component where hey if you like this hotel that based on your history We can recommend all these other properties as well that we can cross over another property So that's another very interesting use case how we use data science To actually improve our product and therefore improve our business So finally as a as a CTO my job is to bring problems into my team And I'm very good at that so then I can relax and let the team sort it out And that's what and some on his team do so what are the challenges we have kind of a summary of what I've been talking about We have a huge amount of data. We're completely different Meaning right we have partner data. We have supply data. We have customer data We have very different types of data that we need to try to figure it out all together and create the right context for our Machine learning and our artificial intelligence to make the right decisions Supply size. I mentioned it many times. It's it's big and it's growing and therefore It becomes a completely different problem when you reach a certain scale Partner size, you know, we need to we cut that we have more than thousand patterns all around the world And I know the partners what one the same thing or need I'll have the same problem So we need to cut it for all of them content side, you know more than a million images as I mentioned before that we need to sort out and we need to using the best possible way and 10 terabytes per day, you know, that's the amount of data that would generate right now We're growing double digits every year at XP a partner solutions. Therefore, you know, it's becoming bigger be bigger So we actually need our our data Architecture to and our data services to be able to scale at that pace even faster because we want to rebuild everything we have every year right and and a Couple of years ago XP announced that we're moving everything today AWS and that gave on someone I the opportunity to start rebuilding most of the data lake and data warehouse products that we have and We believe with that AWS mindset And therefore with the scalability that we need and we know that all these new technologies and the new customer Heroes are gonna require in order to keep For us to keep giving a good service So my name is on Selmo as more and we're saying I run the technology at XP department solutions When we started to Migrate our data platform to the cloud We wanted basically to leverage the capabilities that the cloud provide to us not only to Benefit our business, but also to benefit our users and our partners Our scale is immense and increasing every single day The only way we actually can Operate an effective scale is to leverage the capabilities of and exploit the machine learning artificial intelligence and data science For that with the opportunity of building a data platform in the clouds We define a couple of guiding principles that we follow essentially to lower the barrier to produce data Enforced governance quality and security, but most importantly also to facilitate the way people consume data Let's go through it First of all the cloud migration plan the way I like three main motivators We wanted to follow the data producers that move to the cloud So those 10 terabytes of data suddenly stop to be produced on-prem and start to be produced in the cloud And for that was important to be efficient and to be close to them the second one was To improve our security scalability in resilience That translates to faster queries to just being being faster delivering things And the last one we wanted to promote technology on your previous state working on a shared data center We had a lot of stress on our loop cluster was hard to promote innovation Being in the cloud is just easier to facilitate. It's much faster to integrate new technologies One of the challenges we faced was actually to during this transition was actually to maintain both environments at the same state How could we provide the same data on our on-prem the on-premise data center? But also in the cloud we wanted to start to move our users start to move our services to the cloud and And and with that It was One of one of the main challenge was to ensure the data on both sides We use we use that the power also of Expedia with such a large group And we develop an internal tool named circus train that basically replicates a hive tables between clusters It replicates not only the table the data, but also the metadata associated with it It's quite light touch and it can copy and partitioning data or partitioning data. It's not even driven It's still a batch process But that allow us to start to move our users and our processes to the cloud This is a quick example of circus train a YAML file a quite simple Process that you can specify the source catalog a replica replica catalog And also a table replication options We the challenges of all those brands that Mariana presented in the beginning moving to the cloud what we faced was We moved from a centralized world where the data was just in one place to a decentralized world Of course wanting to promote innovation Scability all those good parts, but also at these bad parts suddenly we start to create a Desense of data silos across the organization suddenly because we also use shared services across Expedia group was becoming hard to reach out to to other groups data or to other companies data For that purpose one thing we decided or we thought about was Thinking about data and data lakes as federated data lakes in a way that data lakes can can be connected with each other It won't be different for me to access my data from H-com hide matter store or from EPS hide matter store to these federation access Easily we could facilitate and again accelerate the access to our data For that we also develop on a speedy group a tool named your wiggle dance Both circus training waggle dance. They are both open source You can check on github and waggle dance is basically a request routing for hide matter store That basically proxies the requests for the right matter store Whatever whatever data lake where you are where you're trying to access it Again simply ammo configuration. We have each running across all the entire Numerous numbers of accounts on AWS that we have we use waggle dance to federate all the information on another Another focus or another vector in terms of how one of the solid foundation is data quality framework So from the beginning you want it to have trust and for that we wanted to manage our data assets like we manage any other product We wanted to promote instrumentation observability and other things Ultimately, we wanted to be the first to know like We are down if you are our services down for any reason we want to apply the same principles to other assets If some process fails, we want it to work the first to know to know when the data is accessible If it's fresh enough complete accurate enrich it or integrate it second principle That follows the solid foundation is we wanted to be easy to produce data with the exploits of microservices with exploits of server less we suddenly we move from monolithic architectures to An enormous numbers of services that every week every day they are created or in the cloud For that we wanted to be extremely easy from start from when you start your first development Or your first code commit to start to produce data to our data lake We moved away from four new real-time services we had internally or previously to just one There was a massive movement we had to do, but we wanted to concentrate that That's we wanted to find one way to produce data That's we obviously choose Kafka as kind of our main ingestion engine That give us scalability performance in the fantasy for the 10 terabytes of data we ingest every day and increasing every day We wanted a simplified schema management support on all the environment as I mentioned from labs to staging to production and At the end we want to strive to a full-hands off service with the facilitating data being produced also Brought some responsibilities for the producers. We wanted to change the mindset of our producers we in the past Our services our monolithic service the O producing data and they were not worried about the data They were creating or even the downstream service. They were consuming that data. So We define this contract which is basically telling our producers when they are on board now platform that they own the data Schema that they are managing they own the data. They are producing They should stream events in real time. They should offer skate sensitive information They should try to document to amplify those data assets so people can consume it easily and Also monitor the data that's being produced just like they're monitoring their service Obviously we are Providing Simple ways to produce data and and which which then sets and jumps for the the third guideline Of the platform we want to be easy to consume data to easy to produce easy to consume data We provide Being in the cloud. It's just much much easier. You want it to benefit from that So even if those bullet points are in there, they sound big actually We are just leveraged at the capability of being the cloud So we provide easily high-pressed and spark on almost any versions that they want We use EMR for data processing as a native solution for my WS We use data bricks heavily on data science sites We use cubal for querying sites across the business and we use a Tina for our operational support along that we also built an analytics API that Basically Allows programmatically access to any analytic data at with a granular ACL or access control level On data sets column and rows We allow people or services to consume analytics data to search for data breakdown filter To build time series do comparisons forecasts on key data sets And we allow that on a simple and fast way with a sub-second response time the fourth pillar is that data science pushes the envelope and Pushes the envelope in many fronts Pushes the envelope on the development cycle from models tuning algorithms training and model storage On the features pipeline that support that development cycle on the batch execution where we are able to execute backtesting on any new models that's being developed or new algorithms train and Also, obviously with that comes a performance evaluation observability Obviously the goal is to those machine learning models not to exist on an offline world Actually to become and see move to the to online world and for that we have a machine learning service that we call decision engine Where we deploy those models on the continuous integrations and continuous delivery methods This is the phase of the time where actually the soft-creating engineering Patterns the best practices come together and join with the science world We Obviously, we have online features store model to realizations. We have model serving We use different we serve different type of models from men lip tensor flow PMML at the end as I was mentioned we want to Deliver the best hotel for our for for a customer at the right time at the best price And for that That's kind of why we have all these massive architecture behind I Used to say internally that to achieve all that it takes a village It's the scale of our operation and the scale of our data and the challenges that are put up to us is kind of Defines the full picture just to finalize The set the meaning of it takes a village. I also used to say that this is no not more than just software development It's important to have cross-functional teams coming together. I've had this ever-solid platform to be on top promote best engineering practices having and this is very Important especially when working with a person like Marianne, which is having critical execution path Really be really nimble to achieve something quickly and deliver value and then of course measure an operational cost to see the impact That's it. Thank you If you have any questions, please if not, we'll be available outside for any Conversation if needed some questions upstairs just shout. It's fine. We can hear you. Maybe you can talk Spanish too, so Much better in Spanish or Portuguese Obrigado well done come on by both So thanks a lot for the presentation so my question is how long did it take to build this whole thing? We are still building it is still evolving We and I think that is what's important for us to have that solid foundation those principles And it needs to be clear to everyone So that when you have a new component coming it makes sense on the full picture So we are still building still evolving you have data scientists pushing us for new algorithms to be deployed We have to find a solution But the principles and the guidelines are the one actually that they are there and they they make sense Yeah Yeah, and I think well we studied our cloud migration around two years ago And that's where we decided to be the new EPS data cloud So we've been at it, you know actively for one and a half years I don't see it I don't see it ending any time at all right because data keeps evolving and our systems keep evolving I think you do have the advantage if you're doing a project like this that we had all the old systems still in place What will be in the new systems right? So you create some data replication processes and basically you don't have to interrupt your business while you keep doing the new thing But then at some point you need to switch over okay from the old data warehouse to the new data warehouse So that you have that flexibility, but other than that I think we definitely have a few more years ahead of us before we can rest Take a holiday to Portugal Thank you So on the slides that you use Databricks notebooks you use high Athena. What do you exactly use cobalt for? cubal cubal is our Hive Is how you interface basically so allow us to we have also they also have notebooks and Basically, we use that for hive to run hive to run spark We use both basically Databricks is more popular with data science While cubal is kind of more used across the other units and functions in the business Okay, and also if you have too much money in your pocket, and it's burning just get cubal and you know it goes away very quick It's a good way That the the benefit is by being the cloud and having the right interfaces in there It's just easy to plug things and easy to try easy to try new vendor easy to try new solutions and When you are on premise data center, it's much much harder Okay, since I still have the mic. How do you organize or orchestrate all the jobs all the pipelines? What do you use for that? So that's a good point Illustrations we we had informatica. We had ask a ban. We move them to airflow We also have data scientists using Luigi, but the things that actually excite me more all of those solutions We have to pay a lot the most the one that excites me more is actually starting to use serverless so in example in AWS we have Lambdas and we can leverage the power of step functions We just pay for the execution We actually don't pay for the full cluster of those are that orchestrated running and most importantly you have Observations you have monitoring and you are sure that thing won't go down Because it's serverless and it's maintained by AWS. So that's the thing that excites me now to to to orchestrate our jobs Thank you. Thank you