 Hello, sorry for the delay. There seems to have been some kind of technical issues. I'll try to make out for it so data lakes for financial entities and Why data lakes? There are many kinds or many types of big data architectures such as lambda architectures or kappa architectures as you may know But even though they're used in banking I believe that for the main information of repository or main informational System data leaks are probably the best kind of architecture for for a bank and The reason behind it is that well for one instance banks have a heterogeneous information, which is basically common in big data and Banks have a vast amount of information both in terms of volume and attributes There's also a lot of regulatory Requirements that are constantly and rapidly changing, but I think overall the thing about financial institutions is that You don't need all of the data at once You just need it sometimes and the thing is that once you need some kind of data You won't need it from that point forward, but you will need to load as much historic information as you can and and that's making them great for for data lake solutions The thing is that probably data leaks are one of the most complex architectures to implement in any kind of entity and We did a research early this year in the United States Where we met most if not all of the tier one institutions in the United States to think about of the big bank names the thing of the United States and Out of 36 entities We found out that only seven had implemented a data lake solution and 18 or 16 sorry, we're in the place where in the process of implementing one and What really shocked us is that? When we try to find a success case we found zero We didn't have in one case of a data lake that they were happy about that was running on a financial entity It's extremely hard to implement a data lake And we tried to narrow down the the main issues for failing in Implementing a data lake and we found four main reasons the first one being Focusing on the irrigation processes and getting all the data in the lake. We will speak about this later the second one is about trying to Cure all of the data to validate and cleanse all the data by moving them to two net layers So basically everything that's coming in is moved to the curated layer or a process layer the third point being not paying enough attention to metadata and Metadata is critical in data lakes and we will see this later again And the last point is trying to approach it as a big man pray thinking okay this is the final picture you want and how can I get to that big picture instead of Trying to narrow down the steps of the of the process so let's go to to the architectures and Well, one of the things about data lakes is that there is now a common architecture that you can say okay This is the common architectures for every data lake. It really changes from one to another and I've tried to keep it simple and go out in more layers as we will see but I Think we could all agree that the basic layers within the data lakes are the role a year Basically the role a year is where data arrives and It's a story in a row and treated format just as it looks in in the source systems the second layer will be the process layer and here It's store all the data that is actually used We don't have all the data from the role here Just the things that are being used or move or should be moved to see the process layer It may be cleansed data and maybe calculate the data And this data is as I was saying ready to be used now the third layer will be a reporting layer and Here basically data is aggregated in order to be consumed and it is where it's consumed by the user so I Think more or less we could agree that those are the three main layers that would define a data lake in general I know we could split into more layers for instance We could split the role layer into a landing layer where actually the data arrives. It's not treated And I say it's a layer where you actually store the data the landing layer would be more of a temporary stage And we could split the process data into two parts when having the trusted data Which is data that you cleanse from staying layer and the business layer that is data that you have calculated and is derived from trusted data But since we are going to get into more complex layers Let's try to just keep it simple and I'm pleased very much. This is just one point of view There are a lot of ways to to define a data lake, but in the end more or less There are three layers a role layer a process layer and a reporting layer now Going back to the reasons for failure. I think that When we see a data lake, we want to get there as fast as possible. It's a great architecture Working with it. It's amazing and we will see the best ways to to take advantage of these kinds of solutions But when we see this picture, we want to get to the picture as fast as possible and Since data comes from the role layers move to process and then goes to reporting We usually try to approach it the same way say, okay Let's go to the role a year and feed everything in it and then move it to the process later And let's move everything and then we'll do the reports and maybe in two or three years We can use it and that's not the point I mean you want to use it as fast as possible one of the great things about Improving the data lakes time to market that you have with these kinds of solutions and The the right way we live to approach it is not from a bottom-up approach, but from a top-down approach And the way it's done is basically, okay You do half a business need you never have a data need data without uses isn't it's nothing so you should focus on what are you going to do with the data lake and Take any project take any business area take whatever you want and start with the reporting and Analyze what the data will be used for analysis the business itself the reporting layer From that we've done and go to the process layer and say, okay in order to generate these reports What are there for management or regulation or whatever I will need this data And I will define the set of data that we need in the end in the process layer Whether it's derived data from additional attributes that I may have or it's data that exists in my current source systems And once we've done that we can move to the role layer and take a look at which our systems have the information that I need Okay, here's the point when I go to a source system I will bring everything in not just the attributes that I need but everything that I can get out of the system The thing is that I will only move to the process layer what I'm going to use And then I moved to reporting layer what is actually the reports and the need for for the business area So we're going to be adding some additional layers the first one and and for me the most important one besides these three layers is the metadata and For those of you that aren't as used to it's meta data as The thing about meta data is that it might sound scary It's just data about the data is data about the fields the attributes how you calculate them and so on but this is critical and We are seeing a lot of data lakes that end up being data swamps and Most of the times is because of not having a good meta data layer So there are two main kinds of meta data There are these are ones, but the main two are obviously a technical metadata What we mean by technical is everything that's related to the technical side of the attributes or fields Such as the format the length the naming the different stages the name in the operational source system If they have a domain of values, what's the domain? Everything that's not business related and obviously the the second kind of it was committed There is a functional metadata and here is if we have a field that is calculated We will express how it's calculated if we have a field that has a meaning We will try to define the meaning of that attribute so that when everyone anyone wants to Search the metadata for some kinds of attributes or some kind of information. They will have the necessary data about what they need and The technical metadata is generated the raw layer once the data comes in it should come in With as much technical metadata is needed. Where is it coming from? What's the name in the source system? Where is it being stored? What's the name whatever is being stored and so on the functional metadata is generated the process layer If I clean something that comes from the roller. I will say, okay What's the definition of this attribute? Is it an amount? What kind of amount is the principal amount? It's an interest amount. It's a default amount and If it's calculated, of course, which attributes does it come from? How is it calculated? what's the Frequency of the calculus and so on There are more kinds of metadata and you can add as many as you want, but probably the most common two besides Technical and functional our security metadata, which is very useful About okay, who can access which attributes or which kind of data or which domain of data for instance if you have Multi-entity a data lake which entities can access everyone and Another one that's not so common, but it's very useful is the reporting metadata Usually reporting your aggregating data And aggregating when you have financial data underneath it is not our self use For instance, there are some fields that you need to just aggregate the amounts generally but for some of the months you need to do some some Calculated process in the in the monofaggregation for instance if you are aggregating the market data You need to wait it by the duration. So having this kind of information. It can be quite useful now as for the more fun layers We have external engines this is very common in banking and Well, you have and certainly in for for risks or for MIS or for whatever your needs are They should be fed from from a data lake. It really makes sense You have cleansed data You have the data in there and even more if you can return the data to the data lake That's great You will have everything in and and that's the goal in the end If you're bringing in data from an external engine, it's always come from the roller Basically everything that comes from outside the data lake should always enter from the roller So they can include third-party products in touch product and so I was saying it is fed from the process layer And if it comes back it comes back through the rolling Next one would be machine learning and and this is great I mean, this is the probably one of the most important points of big data in the annual and Big data is not about doing things faster or doing things with more data. It's mainly about Solving paradigms that cannot be solved otherwise with traditional technologies And here we thinking about AI and we thinking about the deep learning or graph oriented database and so on And again, it's fed from the process layer And this is probably one of the few cases that it comes back to the process layer because in the end the machine learning part It's an integral part of the architecture. It's not like an outside information. It's not an outside source system It's a common part of the of the whole architecture Machine learning has a lot of obligations within a financial institutions. We've done recently Neural network system for great scoring based on behavior modeling, which is great And it really can be useful and extremely useful if you have a data like solution No We can add on top of that a speed layer. You can have real-time information within a data. It's not just for patch processing You can have a layer where you have real-time information It should go through the roll a year and then move to the process layer and the reporting is needed And you can do that in real-time. I mean moving three from layer to layer doesn't imply that it's a batch process But you can do it on the fly so as for the last layer that I'm adding to the to these common architecture for data lakes It's the manual entries There's a lot of information whether it's for internal processes or for reporting or for many other needs That does not exist in any source system that cannot be divide from existing data that you will need to input somehow Okay, the only thing about manual entries is that they should always come from the roll a year This is not go straight to the to the process layer. Okay, let's move to two technology and I'll try to go About some of the most most interesting Technologies for each of the layers and please forgive me if I leave something out because I will leave something out There are a lot of choices and there are a lot of ways to do this but Starting from the roll a year And I'm splitting this into two parts open source solutions and private solutions Which I think are evolving great as well. So as for open source. It's it's basically HDFS I mean, that's the way to go. There are another Options, but they're not as useless as or even close to change the face as for private technologies We have and we are comparing Amazon and Google just to be clear. Sorry for leaving out Microsoft, but it's Microsoft so As for Amazon, we have a S3 Which is great for Google cloud We have a Google cloud computing. So those are basically the two options that you can go to now for moving data between layers and and This does not mean only from the role either to process layers. We will see it can mean also from external sources to roll a years With an open source solution, we have HDFS Again with the maps reduced which is not as useless as it used to when it came out Everyone's moving what everyone has already moved to to spark my previous processes You can also we go with alternative noise. It's just a scoop Which is great from a moving data from relational databases to HDFS systems And of course if you want to move them on the fly in real time flinks the way to go Now about private technologies again, we have options within Amazon and within Google Within Amazon, we have glue If you don't know about glue, it's a technology that moves data again from relational databases to HDFS The cool thing about it is that if you create a new attribute or a new table or there's a change in the schema It will import that per part automatically. You don't need to configure it or Alter it if it changes in the source system We also have a emr Basically emr is a managed framework for big data Architectures manage a cluster So you can be beneath it spark or Haroop or flink and We have DMS as well from Amazon, which is a database migration services and Again for migrating for databases to HDFS or S3 in their case Now for for Google you can create jobs with the in BigQuery if you want to go to a serverless solution And about serverless solution the thing is that there are a lot of technologies that are coming out that are not Server-based the you can rely on them without a server They are usually more expensive depending on the amount of information that you need or how often you're accessing it But you could be the whole data mark in a in a serverless infrastructure. I won't go into detail because I mean it would be Way a lot more time that I have but yeah, you could be a Solution based entirely on serverless architectures and from Google you have data flow and as I was saying you can create a Jobs with the in within BigQuery to move data from one layer to another Now for the process layer the process layer in the end is the same as the as the role a year So we will have again HDFS and we will have again hs3 and Google cloud computing The thing is that maybe you want you to have the data if you were to HDFS in a different way Maybe in a common column oriented way or a row oriented way in those cases You have Favro and a park it which are great options for modeling data in different ways and now for the reporting layer and As for open source You can go we have you can go within but I you can go to a relational database if you go to a relation database recent delivery relational database We always try to go to dimensional models based on a Kimbark architectures and Kimbark architectures even though they come from all technologies and all architectures and traditional technologies work great for for reporting and Reaction databases are quick and the way to model them for reporting is an amazing way so as for private technologies we then I have a BigQuery from from Google and And We have a red sheet from Amazon as you can see there are some Technologies that go straight to the process layer and some technologies that are mainly Databases that lie in a physical separated layer and about reporting is that besides these technologies and you can add another kind of Soft layer on top of it which would be the actual tool that you can use to access information and in those terms in about open source you mainly have Pentaho and You don't have that many options and new ones are coming out. Hopefully they will do great, but but in the end You can go to solutions such as a click or tableau Which from our point of view are way more powerful than open source ones or even any other competitors Again, I'm leaving out Microsoft Perhaps not fairly because they do have a good reporting tools for this later now for machine learning open source, you probably all know that you have a again Harup and and spark with them a leaf and Mahog and For private technologies now again Google and Amazon You can go with a forgot about Well, cool and Amazon sorry about it Now coming to the speed layer and I've just with one technology and that's fling and it's probably the most important or the most representative Streaming in real time real streaming a technology that there is right now You could go through different approaches. You can go with Kafka and sparks You go to a micro watches option if you have a messaging broker within your organization such as rabbit or Deepco or whatever You should really leverage on that. You can use that to get the information in the system real time There's no need to be the whole new infrastructure if you have something of that already works And that is already delivering information and and well from our initial three layers. We've come to a Way more complex architecture And as I was saying, I mean, this is not the only way to go through a Data lake architecture. There are many ways to approach it and just keep in mind that about the technologies There's not a right one That's probably mean it really depends on what you're going to do with it and your needs and the way you want to approach it This is just an overview of the different ones so for the next part we're moving to methodologies and I believe that Data leaks and the best way to approach them wouldn't exist without two things besides a big data in general Which is of course cloud computing in general big data wouldn't exist or wouldn't be as strong as it is today We thought don't computing whether it's an external public cloud or a private one and the second one is methodologies and and the thing is that they apply extremely great for for for big data for for data lakes and I take methodologies Really push the the time to market premise that I was talking before if you want to build a data lake You need to be extremely fast in delivering data and getting data up and ready and In the in the end the only way to go is to go to to an agile methodology. You can you can not go to a Whole waterfall and say okay. I'm going to build a data lake. I will build it within three years And okay the first year I will be defining what I want the second year and a half I will be Developing with a half and the last half of a year. I will be testing and if we don't it won't work I mean you shouldn't go that way and it will you will find a lot of money and and I mean You should try to go to to agile methodologies now the thing is that it's easy to speak about it, but Banks and agile methodologies are not yet yet best friends. So It's really hard for for a financial institution to to go to an agile methodology They need to have everything but you did they need to know what the times are you cannot say, okay? I don't know how much it's going to cost But I'm gonna know it's going to cost less than it's going to be faster and that won't make sense so We found two approaches to go to agile methodologies within financial institutions that might help you out The first one is that we've created our own agile methodology, which is based on a micro waterfalls and I won't go into much detail about our methodologies. I mean you probably don't want you to hear me speak about that but what I can tell you is that they are based on Larissa Moses methodologies and They are you can take them out in willian in moons That I was house to two point zero book or it has some papers and so on And so I was saying it's really important You know what we come from in terms of business intelligence and know about Traditional technologies and data warehouse and data marks and methodologies and so on because I mean there's a lot of of useful stuff and this is basically a methodology that is based on a quick definition quick prototyping and Go as fast as you can to testing phase and iterate and the way we do it is we create micro waterfalls Which last between two or three months? I know two or three months It's a lot of time compared to two or three weeks of a scrum methodology But it's better to have two or three months deliverables that having a huge waterfalls that last for two or three years and The second approach the approach that we found that works Well, well, this is the part that we iterating if you want to check out the methodology within our micro waterfalls So the second part that we know the second way that we've seen that suits Originally well within finance institutions in the way of approaching a general methodologies It's okay find those first reports or first business needs that you want to start your data lake with and Just use a waterfall for an MVP. It won't be a big project. It should be something small It's to start with something like a tracing bullet methodology and I won't get it to that kind of trade methodologies But they basically try to go from end to end of the architecture with the single most minimal Part of information but just it does that everything works and that It's laid out correctly But go to an MVP with a waterfall setup a date set up a budget set up a time And that will be easy for a finance institution to each adopt instead of trying to go to to a natural methodology straight away But as soon as you have that Move to scramb for product evolution and have your own scrum team And and go to agile methodologies for for the continuous development now Next thing we have is priorities within within the bank and We should really start with whatever is going to be used I mean don't start a daily just for the sake of it just for technology reasons Just I mean find a user fighting a business area that wants to have a data leak and start with them It doesn't matter if it's big if it's small just start with something that is going to be used by the user And that's the most important part if you do that you will start within your large years You will start growing and you will have something in place that is easy to grow instead of having a Monster that no one is is actually using once you have that the second point is general agent information and If you have to work in banking or in banking or any other finance institution For any regulator regulatory report or requirement, you know that if data is not reconciled with the geo It's useless So if that first project that you had is not a gl related try to get the geo as as soon as possible Once you have the geo the next point that you need is management information What there is from a mass many formation systems or for asset and liability management in the end the The board of bank Speaking terms of profitability revenue and so on about data that is not necessarily related with the GL as much but that is management information and You need to speak in those terms in any part of the bank both terms GL for regulation and Management information for for management of interesting So the fourth thing would be to get personal information Mainly for great risk a great risk is mostly based on on the clients information rather than in accounts or in balance information So that should be the same priority. Of course if you go through any of those in your first project Which I should I mean it's related to any of those It will alright, but this is basically the priorities that we are seeing and lastly Go to non-core information and that's marketing that CRM That's whatever you want you to do with the information that is not a directly business related You could start with that But it's it's not the way we would recommend to go I mean if you start with a big data pray for human resources and I really love those kinds of prays don't don't get me wrong, but it won't be as effective as going to a business area that that is actually a Trying to to improve and use the the data leak for balance and personal needs okay, and For the last point and I'm closing the presentation with this size I've tried to go a bit quicker to the delays on the technical side Hopefully we have some time for for any questions that you may have It's about taking advantage of the solution Okay, let's say we have the the data lake in place and in place I don't mean the the perfect data Data lake that does not exist. The data lake is constantly evolving and What's the best way to take advantage of it and in the end? We have two sides. We have the business side and We have the project team And from the business side What we really recommend is to have a data scientist within the business area They will be able to access the role a year. They will be able to Go through all of the information that is in there They will be able to create other reports and if a new regulation comes out, they will be quick in order to deliver it And once they they have the information and they've sent it to the user because I mean Then they are part of the user and they're using the information Having requirements for the project team will be extremely easy. I mean it's just a matter of okay This is what I've created. I want this lever on a monthly basis or in a daily basis or whatever and now for the project team This is a project that should be owned and sponsored from a sea level Make at the way to say somehow So there's already a figure for that and a figure that has been a bit confusing or are not well defined At least since I've known it which is the CDO the chief data officer and they are doing things that No one really knows about data lineage and understanding the data of the entity and so on now I mean they're always to own this project they should be the product the product owner for the scrum team and the sponsor for the scrum team and That's the way you should go they should they should own this project and they should review this project and they should be accountable for the delivery of this project and Basically, I think going that way data lakes are extremely good architecture for banking and you can really get a lot of it if you end up having a great day like and That's basically it. Hope hopefully you've enjoyed it. I don't know if we have any any question from from the audience There's a I don't know if we can get a microphone up there Please don't be too hard on me I Hello, thank you for your presentation. Thank you very much for coming I I just would like to know how does it fits and I'm not structuring that I in your architecture because I haven't seen None a structure is that that within the whole architecture Everything is colonel or not colonel Non-structural and there is lots of data within national institution Which are non structures see like a contract for example Which in some of the fields there are not as a structure that's we would like to be in a Yeah, I mean in the end as for the role a year and what I was speaking about the process later about going to a column in our way or a Row way for both the the raw data and the process data You can go however you want. I mean if you want to go with an structure Modeling you can go with that it probably makes more sense to go with that in the role a year and try to go with a more Structure one in the process later But in the really in reality those are the two layers where you can lay out information as you want If you want to check out letters with the more technical guys than me the way we are modeling each of the layers We have a boot there so if we choose to come by and we'll get into more details But yeah in the end about modeling each of the layers It really doesn't matter or you have a structure as you want whether it's a structure or non-structured data There's another question here in the front Hello, hello, I have two questions and Could you give a further explanation about what one of the reasons of failure is you call something like focus on an irrigation Sure, so I'll Develop on that. You know the irrigation process it means getting the data inside the data lake in the in the row data and from the entities we've spoken with Some of them have tried to to approach it as okay Let's go to all of the source systems and get all of the data inside the the data league There are some exception exceptions were or you could do that I mean if you're basing it on let's say a glue for instance from Amazon and you can be extremely quick doing it I mean it could be all right But for most of the cases it's the wrong approach Mainly because it will take you a lot of time to get all the data from all of your source systems And keep in mind that a whole source systems in banking are very heterogeneous and have a lot of information and really differ from one another So it's it's very hard to get the data from from everything And you will spend a lot of time there and the thing is that once you have the data It will be very hard to understand what you have beneath it Because if you start with that without having any business requirement You won't know what to do with the data and if once you have a business requirement It will be very hard for you to get the data that you're actually looking for so basically The second point is all right, but the first point and main one is about the time to market You should be as quick as possible to get to the reporting layer and have the data leg up and running and not spending so much time In other layers that don't really produce results Okay, and the second question was a you didn't spot any technology for the metadata. I guess you are talking about a schema Yeah, I just kept that on purpose and the thing is that we've really tried to find a good solution for metadata and And and even for the presentation we've reviewed it just yesterday to see if there's was something out that we were missing out But the things that besides some technologies that have some kind of good metadata solutions for instance glue I was talking about has a kind of full case solution for technical metadata We haven't found a single solution that that is Good for global use of metadata What we end up doing is doing In-house projects and non-demand projects for creating a metadata database and we have purchased in a very different ways And you could do it with different technologies and so on but in the end all of the places that we've done have come out with Self-development or within house developments for for metadata Okay, thank you. Oh, thank you So I'm not seeing any hands. So thank you all for coming. Hope you