 Hello everyone, my name is Chris. I'm a Data Infrastructure Lead for Data Analytics Platform at ING. I had some episodes in high frequency trading, also quite some fun migrating the data infrastructure part of Spotify to the cloud was part of my career. Coming from ING, a large financial institution multinational. You can understand there will be some challenges on the journey of building something that's connected with data. You can also think that yeah, data is already in silos, closed in silos, how to leverage all these data stored everywhere. That's a challenge in that kind of a company. We heard today a lot about embracing open source, embracing the power of community. And there are also some talks about data platforms. How about I combine all those talks together and maybe share a bit our journey at ING, how we get to that point. So the ultimate goal is to become a data-driven company. So the data that really leverage all the data inside the organization that can build products based on data that basically it's the goal. But how to put that in numbers. So if you have that ambitious goal, if you build a product, what does it mean really? What do you want to achieve? In our times when we started the vision of Data Analytics Platform, we wanted to have 50% of all our employees making use of the data. It doesn't mean that everyone, that 50% of my ING employees would now become a data scientist because that's impossible. But if they start using the outcomes of what the Data Analytics Platform provides, the reports, different products that are based on the outcomes of the computation done on the platform, that's gonna be a success. And that's the high level goal, the North Star metric that we have on our platform. But what does it take to get people on that platform? That's the human aspect, the human centricity that we added to that point. We claim the term that we have here democratized, but we have the different meaning to that one. We wanted to make data, we want to make the analytics accessible to everyone in the company. But it doesn't really mean that the data should be open to everyone. We also have some regulatory means, we have some internal regulations. It doesn't mean that data should be available to everyone at all costs or any time they want. Security is a major concern in our financial industry. So we need to make sure that when choosing Data Analytics Platform that people that sharing, for data owners that are sharing the data with that platform, they're also making sure that data is being then securely used by our users on the platform. You need to also bring the necessary tools to the platform that so users can basically leverage the access to the data that they have. And most of all, you also need to have the computational power to be able to tackle the amount of data that you start to store on that platform. The three liars that we created in our vision starts with data democratization, go with analytics democratization in machine learning democratization. The single line one cannot be used within the foundation line being done before. Data democratization, access to all the data sets stored within the company. As I mentioned before, not open to everyone. It needs to pass some checks before the data can be used, it can be shared with anyone. Analytics democratization, when you can start, build your models, share the data, share the dashboard with the users, also this is the place where the common taxonomy, the common dictionary, we call it IMJ Esperanto is being then leveraged here. So we can have a common understanding between data sources coming from different regions, different banks across the Europe. But then in that place, we combine all this terminology. So everyone accessing the data, I don't know from the Netherlands data from Turkey, they know what kind of data they're talking about. And finally, the machine learning democratization, so cherry on the cake. So basically when all these things are then delivered, you can start building a leveraging machine learning because you need to trust the data, you need to understand the data, only then you can start applying some machine learning rules that can make use the data in other products. Focusing on data democratization, we leverage the self-service aspect here. So definitely what we want to, if you want to scale and put the 50% of your organization, that's gonna be using the analytics platform, you need to make sure that the users are able to do as much as they can on their own. So this is the key factor here, the key factor for success. We started with centralized team for data ingestion, we call it the data liberation squad. This is how we bring the data to our platform from different data lakes, we have the federated data lakes initiatives at ING, but we also can bring the data from system of records. Preflab in the data lake route is something that we should be embracing here. But with that framework, the generic ingestion framework that we prepared, users in the future that would be envisioned, they would be able to bring the data on their platform on their own. They could free, they could democratize their own data, they could become data owners on their own and bring the data in the same way how it did in the past with this data ingestion framework. What does it mean? What is being done in that framework? We basically apply the checking the proper schema and applying the schema, applying those taxonomy rules that come on dictionary. We also then checking that we, making sure the data lineage is embedded in that framework. The integrity is there. We also make sure the data quality metrics are being delivered and exposed to the users. So they know they can trust the data, they know where the data was coming from, so they can use this data in their report and their system and their product. And the discovery part, that's something I'm gonna mention also later on, how it's gonna be easy to find this data because if there are no users coming to ING, maybe new employees, and they have some new genuine idea that they won't like to use some kind of data, how they can find the data in your organization and this is something that also in this part we should be able to answer those questions. Some open source technology that we are using under the hood, one of it some of them see like Roku is probably unknown to you, we open source that technology that creates an authorization layer towards storage. Analytics democratization, and this is basically how we, whether we see that 80% of the analytics is being done. So not really that machine learning, machine learning maybe comes to like 20% or even less the more powerful outcome, so the place where the analytics is being done, but 80% is exactly here. When you try to do data cleansing, when you do sorting, emerging then you prepare the visualization, this is exactly the place that it is happening. You have some BI like tools, like Apache super certificate come from the open source world, you can attach W, Power BI, but you can do it in this layer. The data science in a box is our concept of bringing machine learning tools and frameworks for data scientists. So they can basically specify the environment, all the dependencies they need to work, and then they just start on our computer cluster and get access to the data. Finally, machine learning democratization, you're building the machine learning pipelines, you reuse the pipeline definition that we have, and then from that place, when you train that model, when you store that model in the model registry, you can then bring that model into production. So we have a framework that helps you to bring that model into execution. In exploration, you are free to test your different hypotheses, you are free to mingle with the data, but then from that place, when you have the definition of your model, you can then put that model into production. Okay, three different layers, but conceptually, what we want to achieve, the human factor, how it make it easy for this 50% of employees to get access to the platform. Well, the simplest answer is a search bar. Basically, it's a Google answer to the world what I want to do today. And here, the search box, basically you start with searching your data. We see that the journey starts when you start looking for the data set that you want to use in your organization. But this is the single powerful way of accessing data in your company that should be available to our users. And this is how we see it. But there are different drivers, obviously, to the platform. So not only the search bar, that's the only those layers. Containerization allows us to scale out the list mentioned about the commodity in the keynote. So we embrace containerization as a way to commoditize infrastructure to make sure everything is packaged in a single deployment artifact and then being then used on the platform. Open source as its core, open for developers that they speed this up, making things cloud native in the sense that we envision this platform should be extended into the cloud. This is actually something that we are doing right now. Started with on-premise technology but thinking of how to extend this platform into the cloud. Should be integrated data lineage end to end. So as I mentioned before, the data needs to be guaranteed. Users need to trust that data. So definitely this is the key drivers that we see as a successful driver to success for the platform. Our humble beginning, some humble roots but how it's possible in our organization that it could happen, that this product like this could emerge. The product was started in a tribe that thrive on diversity. So we are a product tribe that consists of 27 nationalities that are spread across four countries, nine product squad. So in that diversity of different ideas, different way of thinking, for example, I'm coming from Poland, you have quite a presence in Romania, here in London, obviously in Amsterdam. So that's combination of diversity, different minds that helps us to build those different products. Conceptually, we are a product tribe. So not only the data analytics platform that started in that tribe was in the first, it was only started supporting products for hosting banking part of ING, now it's a global platform, but you also have different products that starts here. So this product design thinking is really important here. On the journey, so maybe a little bit of a brief of history how it started with that. So that concept was started maybe 2018. We had kind of a platform before based on Hadoop technology, but conceptually, we started to think of something bigger or something more towards the future in 2018. And then we started, we built assembled a small team that built an MVP and just one year later, we did a rollout for 15 countries and we started with 500 users. Constantly improving our product, adding new features. And right now we are with 1,700 users across 17 countries and 240 products currently we support in ING. So there are different projects of the different state of their maturity, mostly in exploration, they are testing their ideas, sometimes they fail, but this is that inevitable world of trying things out in today's world. Our users, so how it starts, how we build the products at ING at our tribe. Start with the design thinking. So this is something that I'll try to, the human sensitive part of that platform that I would like to share here. It starts with emphasizing with the users. So you step outside your biases, you, the help of the designer, you go, you talk to your users, you try to get an idea of how they work, what are the problems, what are the challenges that they have. With all those experience combined, then we gather as a team, then we really define the problem, what we really want to solve in this particular product. With this, after that phase, we define the problem, then we can start generating ideas. And this is something that is the place where we like use your imagination, how would you solve that problem? Is there any better idea? Maybe there's some bad idea, maybe some different examples from the market. When this part is done, then you have the phase of prototyping. This is something that we are building together with our team. We are doing some prototypes, evaluating ourselves, but then we start the phase of testing. When we test those prototypes with the users, with the pilot users, or with the bigger group of users, getting the feedback, and the circle goes on with the new features, everything we built, we follow the same principle. We've got designers, user researchers, that are helping us, helping the engineers to build the right product to answer the right needs. But most of the time we spend on getting, do we tackle the right problem? Are we going to not spend too much time or chasing goals or something that is not gonna be used then within the company? So that's the way how we iterate. Different personas that we identify to the platform. So you have different types of users that will be using your platform. This is something that we identified in our case. It might be different in your companies. They are using different layers on the platform. So for example, the leadership, they're also only making use of the dashboard being created in the analytics democratization, on the analytics democratization phase. But data scientists, they obviously need access to the latest tools, latest libraries. They need those two Python notebooks with the dependencies that they built on their own or they bring from the internet. We also need to make sure that we can deliver those products to them but making it in a secure way. Human centered city, once again. This is something that we believe is key to our success. We did do this user research. We do comprehensive user support. We have our Slack group that we give the support to our users. It's something that they really embrace on that journey with using our analytics platform. That's something that I was a little bit afraid of. That it scale, that it scale beyond 1,000 users. But then out of the sudden, I realized that I stopped supporting for users because the users support themselves. So the first users, the key enablers, they start using the feature. They know how to use it. They help others. Out of the sudden, we built an internal community and that's something maybe not a surprise for you but that really helped. That really tremendously helped to give the support to our users to welcome them, to help them with the first steps on the platform. This is also the place where we get the feedback and get some ideas for new features. So when we see that sometimes the problems are reoccurring, then basically you can tackle it or maybe you have some new ideas, how you can help and you can identify potential areas. When then you can do, again, a research, you can start empathize with the users, you can start thinking with the users and build new features on the platform. Testing, refinement, everything done with the users. Some of the testimonials that we always like to share with our key stakeholders within the company that people like sometimes being asked, what would you do if the app stops existing? They wouldn't be able to work. This is one of the key testimonials from them. Obviously, there's a technology part in that stack that you cannot forget. Lots of open source tools here on the picture but parts of it, so you can take the platform from a vendor but you can build it yourself. This is what we decided to do in that case. You can cherry pick the components that you like because if you embrace the containerization, if you have those concept in mind that you can bring any tool on the platform as long as it's containerized, as it supports containers framework like Kubernetes, then you can move with this platform anywhere. That's the beauty of that stack that you can replace the storage layer. Let's say you want to go with Google, then you can replace that object storage that's used S3APL with GCS IPA. You can go to AWS, you can go to Azure. This layer is computer replaceable. The same with compute. You can have your Spark cluster on Kubernetes, you can have your Presto clusters on Kubernetes but why don't you want to use BigQuery or something from Azure or AWS Athena? That's up to you. One thing that is key in our opinion is the front end layer. So this is the user journey that we identified that we prepared for the users that we want to keep and maintain regardless where our infrastructure is in the cloud or on premise. So the layer, that the users are familiar with already, that's something that we wouldn't like to change. And the logo, I won't send, they may not realize there's something I mentioned before. It's our data discovery portal. This is the search bar when it's going to search for data. It looks through the whole metadata of our data and then it helps users to find relevant information. It helps you to find who are the key users of that data so that can help you to understand it better maybe. That give you a lot of insight into the dataset that we already have. And this is something that's emerging the different tech-oriented companies, Spotify has its lexicon other companies have different data discovery tools. But this one is open source. We contribute to that project. That's something that also I encourage you to check. More diving into details, maybe the user journey here. Everything is surrounded around CICD. You've got the code repository obviously. So the definition of all of our code is in the code repository. You package everything in the container. This is the environment builder. So if you specify your environment that I don't like this Python dependency Java of this version, this is then being containerized and shipped on the platform. And it co-exists with different tool set on our platform, everything as I mentioned containerize gets access to the object storage to the storage layer and from in that space we operate. And the whole layer to the right, the containerized platform can be also shifted anywhere. Every component is basically replaceable. We welcome other components. That's something that we constantly have in mind checking if there are any better tool that we can bring on the platform as long as it's missed our criteria being containerized because we are a small team. For example, the infrastructure team of DAB is a team of five that supports the whole machinery. So that's why we will make this use the commodity view. And with that one, that will be all but I will open to any questions. Yeah, go ahead. When you, on one of the earliest lines you talked about product, or product and squads and platform squads, what were the which squad and which parts of the solution? Okay, I mean about the product squads here. Sure, you can think always about the layers, yes? So you have the infrastructure layer, they have the layer of the platform and then for example, Domino is one of the product that was also created within the trap but then leveraged DABs, leveraged the analytics done within the analytics platform for their own product. So they obviously started building their own analysis, they built their own models, they then are feeding their product. So then Domino is one of these 240 products that we have on the platform basically. So it's also like dog foodings. So also it's on food or whatever we prepared. For example, the dashboards that are tracking the user's activity on the platform, the adoption rate, it's also built within the same platform. So that's why we're also checking sometimes and also found myself, yeah, this is maybe suboptima, we have to do it better because I'm using the platform myself for my own purposes. So that's also something that I encourage you to do. So dog foodings try to use your own tools, whatever you built for your own purposes and then you'll see where you're missing something. There's no other some of you can shit. So that's the finalization of day 10 analytics. One outcome could be you get a lot of mess because everybody's just diving in and doing whatever they want. And that is fine. That is fine because if you're in exploration, if you are checking your ideas, trying new ideas, it is still fine to produce a series of results. The only point is that from this point, you won't be able to move this into production, into execution environments. As I mentioned that you've got the lineage up to the point that we guarantee the health of this data set. So this generic ingestion framework brings you the data from system of records through the data lake to this part. And from that part, you can only put in the execution. If you start mingling with the data, then you're doing this in your own product environment. You've got your own database when you can create your own, bring your own data. You can add extra information over there. But with that information, you won't be able to put into the execution. So they are still in exploration phase. Control-bind is kind of that something that moves into being produced. Exactly, yeah. To guarantee lineage, to guarantee integrity, then you just need to define what's gonna be the source of your data in the production model. So you then define your model, bring it into production, pinpoint the data sets, the input data sets, and then is how we guarantee integrity. The conceptual data exploration execution, this is how we divided it. It doesn't have to be, it still can be done on a single platform. It's just conceptual. It has this concept of a golden data set that cannot be touched, cannot be modified because we need to guarantee the integrity. As you built out this productized data platform, were you at the same time kind of decommissioning, killing, and legacy of all the systems? Or did you build this up completely as a new thing on the side? We had previously had a cluster before, like purely had a cluster that this platform replaces. Obviously it's a bigger concept. This like step is put aside. Definitely we killed some other initiatives, small analytics initiatives in the company, but we just still believe we would like to encourage users to onboard on that platform. This has happened organically that they're choosing that because it starts with all the data. If you have all the data of the company prepared in a format that is ready for analytics, if you have the power of computation in one place, everyone will start using it. As long as you also have allowed them to use the tools that they really want, the open source tools, that's something. Both, yes, so we support both, but the standard ways that we scheme on right here, that we put the data into the tables and then the people read them from the tables, from the tabular format, that's most of the thing. But in particular use cases, people can get access to raw data. For example, if there's some, for example, doubt that we passed the SEPA or SWIFT formats incorrectly, and in particular use cases, then we allow access to raw data setting. So they can apply their own scheme. Because people could just keep adding data, data, data right forever. So there are different drivers to that one. So for the story, definitely we keep an eye on it. We embraced, we started with some vision on how big the cluster should be for a couple of years and we are still good, but obviously reaching the capacity, we have to still keep this in mind and increase the capacity power of that one. And that's for the storage, it's easy because you calculate how much other different projects use the storage and then you can divide the cost. For the computation, we have a simple trick because it's exploration. We share the resources evenly among the users. So if the cluster is idle and your job comes to the cluster, then basically you can get the whole, the power of the whole cluster. The second party comes in, then you divide the cluster 50-50. So you get basically then share the cluster resources. For the cloud, that's a different story because that's something that we still have in mind, how to tackle the cost properly to make sure that some model is not really going crazy. So then we should really put some quotas here. Tell us about bringing the data from the system of records. How are you going to manage duplication or not having duplication on your platform? You've got different use cases. So there's always the golden data sets are there. So most of the data originates from the data legs of ING. So it takes the data from the data legs. So it's already there. And on the platform, we only have a single version of the single data source on the platform available. But this one is read only. So the projects can get read only access to the data set. They can create subsequent data set based on that one, but they cannot modify the existing one. And then basically when they go into execution, then pinpoint that I wanted to have the model to use the data from data source one. Everyone is using the data source one for the models. But that we guarantee there's no duplication. There's only a single version of true. Okay, there's no further question, I guess. That will be it. Thank you.