 So, hi everyone, this is Praveen Poddar and I am working as a specialist BD for analytics in AWS. Today, I am going to discuss bit about modern data strategy and how AWS can help you in achieving this modern data strategy using AWS services. So, let's start. So, as I said, we are going to discuss bit about modern data strategy, we are going to see how you can build a data architecture using AWS services and then at the end, we are going to show you how to create a architecture using AWS services or the one reference architecture I am going to show it in my presentation today. So, let's start. So, now we are going to discuss bit about modern data strategy. When we speak about modern data strategy in AWS, we generally say that it is based on three pillars, modernize, unify and innovate. Now, we will go to these pillars one by one. Let's start with modernize. So, when we say modernize, what happens that your data, when you are storing it or when you are creating a data platform, most of the time, we think more about infrastructure like suddenly a business problem come that you want to analyze. Let's take an example. Let's say you want to see that what kind of traffic is coming in your platform or in your website or what exactly people are searching in your website. There are a lot of e-commerce website available, which generally checks like if certain person is buying an iPhone or any other phone, he might buy the phone case as well. So they are trying to improve their fails based on the behavior of buying pattern of a user. So, what happens that to analyze this thing, generally what these company do like they check like how much, what exactly you are clicking on their website. They gather those data and do perform some kind of analytics on top of that and then based on that they decide, I can show it like if some person is buying a mobile phone that means later on he can buy a mobile phone case as well. So, the thing is now to have this kind of data, now just imagine like you have billion of users, billion of users or at a worst case just assume that you have millions of users and every second you are capturing those click stream data for how big it can be. Now what happens in a normal case is that generally you think how I am going to store this kind of data, whether my infrastructure can scale it or not, what kind of license I need to buy, whether my traditional relational database it can store this kind of data or not, whether I can analyze, whether I can analyze this data faster or not. Now the thing is like instead of concentrating on this kind of business problem, generally people concentrate more on current infrastructure. So when you are building a modern data strategy or you want to build a modern data platform, instead of thinking about your existing infrastructure, you have to think more about your business problem. In order to do that what you need to do is like you have to choose that data platform which can be easily scalable, which can be trusted, which can be secure. So rather than focusing on infrastructure, focus on business problem. Then second thing is unify, so in an enterprise or in any agency, you have multiple applications, it can be an ERP, it can be a CRM, it can be normal, if you talk about attack industry, it can be an LMS, if you go to retail, there can be different kind of applications are there in an enterprise. Now if you want to see the combined report, then you have to unify all this data. And the problem with unify is like data can refide anywhere and in any format. So some data can be refiding on cloud, some data can be refiding on premise, some data can be refiding on different private cloud as well. And some data can be on relational data, some data can be a non-relational data, some data can be coming as batch data, some data can be a real-time data. How you are going to unify this data? There can be different type of data, there can be different place of data. And in order to unify this data, you need those services which can easily take data in any format and from anywhere and push it to a data lake or a data warehouse. And lastly, which is innovate. So when we say innovate means once you have data, what you want to do with that data? You might want to create some kind of a very good dashboard with that. You might want to expose an API for that data so that other department or you might want to fill that data as well. Or it is possible that you might want to do some kind of AI ML model with that data. So it all depends upon your requirement, how exactly you want to consume that data. So these three pillars, first of like, do not think about the infrastructure much. Choose that platform which can be easily scalable and trusted. Then you have to choose a data platform where you can easily unify all your data in one place at one place and then you need to choose that platform where you can easily consume your data. Be it as a dashboard, be it as an API, be it AI model or predictive model, it is up to you or up to your business requirement. Now we will discuss a bit about data architecture. How you can build the data architecture and today I am going to discuss like how AWS can help you achieving this modern data architecture. So when we say about a modern data architecture, so typically at a very high level we can say that generally in a modern data architecture you can have different type of data sources. It can be streaming data source, it can be a batch data source, data can be structured, data can be unstructured and you need some service by which you can ingest it. And once you are ingesting the data then you have to clean that data. So that means you have to store it in a landing zone then you will be doing some kind of processing on top of that and then that curated data will be there by which once your curated data will be there you need to do some kind of cataloging and of course you can be consumed that data using some kind of consumption service and last not the last is how to secure and govern this data. So generally if we see the preview of architecture, so this modern data structure has generally fixed stage or becomes a fixed layers, data ingestion layer, data storage layer, data cataloging layer, data processing layer, data consumption layer and security and governance layer. I will go to these layer one by one, we will start with data ingestion how you can ingest your data into your data platform. Now when we are talking about this data ingestion layer there can be different type of data sources, your data source can be a relational database or can be a non-relational database, your data source can be a file, it can be a plain CFB file or an excel file as well or even sometime you know EDI files or it can be a different type of file, it can be an XML file also. Then there can be some data which you can be getting from some third party data products like you know SAP can provide you some data, Salesforce can provide you some data, so there can be different third party data products which can be providing you some data. And then custom data sources are also there, so you have some custom application by which you want to get some data and then these day we all knew like there are some IoT applications like in manufacturing it's quite relevant, in retail it's quite relevant. So there are multiple places you can see like you know IoT applications are there and those data can be streaming. So clickestream can be a very good example for this like you know those clickestream is streaming to your data platform quite regularly. How you are going to ingest all these data sources or data into one platform? We will start with the database data sources. I have presented two particular services here by which you can move the data into the data platform. Now please note that the destination I have mentioned it as F3 and Redshift. Now we will discuss bit about Redshift later on, so Redshift is our data warehouse and when I am talking about this F3 please think that I am talking about a data lake. So F3 is a storage for our data lake. So just think that I am talking more about here data lake and data warehouse. So let's say you have a relational database and you want to move that data into a data platform and when we say data platform so that means we are saying that we have to either move it to a data lake or to a data warehouse how we are going to move it. So the first service which is available with AWF is AWF data migration service which we call it as AWF DMS. Now what it do like you know whether your database is refiding on premise or whether it is refiding in any cloud this AWF DMS can read the data and move the data to F3 and Redshift. Even this service can be used for you know replicating your data. For example if let's say you have a SQL server here on the source side, okay. DMS can also move this data to a SQL server instance in AWF side, okay. But this session we are talking more about you know building a data platform or building an analytical platform here. So I will be concentrating more on moving the data to a data lake or to a data warehouse. So this DMS can help you do that it's a true CDC kind of service so that means first time it will be full load and later on you can have like you know delta load you can do with this particular service and it will be a real time service so that means any time this change will happen in your database immediately when I say immediately the gap will be there some you know 2, 3 second or 5 second gap will be there but you can move it to your destination. Now if let's say you don't have requirement of moving the data from your data base to your data lake or data warehouse immediately or if you want to do some kind of you know batch transformation or batch you know batch movement of data. So then there are different services available I will be talking little bit about AWF lake formation blueprint, okay. So another service is AWF lake formation blueprint I am going to speak about lake formation later on in my presentation but there is a feature in lake formation service which we call as AWF lake formation blueprint. Now what this service do it can read the data if your database is MySQL or Postgres it can read the data and move it to the data lake it cannot move it to a data warehouse but of course a data lake if you want to create then lake formation can help you there. The good thing if like it will be a batch service kind of thing so that means let's say you have 100 tables in a data base behind the fin it will create 100 jobs and all the data at certain you know configured interval it will move the data into the F3. Of course in F3 in data lake you can decide which format in which format you want to store the data be it a parquet format beta ORC format or you want to store it in a hoodie catalog or anything okay you can decide it here but this lake formation can help you in this. Now so one service was helping you in real time migration another service was helping you in batch migration but please understand that both these service you can move the entire database but what if I say that you don't want to move the complete database rather than you want to move certain table okay so we have a service called glue what this glue is doing here if that it's a serverless service behind the scene a spark cluster is running okay it can help you in ETL things so there is visualization layer on top of that is park which is built by AWS which is kind of low code you can say okay where you can drag and drop the thing and you can do your ETL very easily then if let's say you are a business analyst you don't want to do any code so we have another feature or another service in glue which we call it as a glue data brew by which you can prepare your data you can do those basic and some advanced level ETL as well using glue data brew okay and of course with glue you can also do the data replication because it's behind the scene it's a spark cluster and with by using that glue visualization tool what will happen that ultimately you are dragging and dropping those you know ETL things behind the fin it will generate a pie spark code for you okay so of course if park you can do using spark you can of course do this data replication or do your ETL processing as well okay so here in glue also you can do the same in fact with that lake formation blueprint which I just talked about behind the fin tree actually creating the glue jobs only okay so here you can do that data replication using glue and of course either if your destination can be your destination can be a data lake your destination can be a data where of even your destination can be a normal database as well so glue can help you in achieving all these things okay so in this presentation you are going to hear this glue word multiple time because glue for us is not just a ETL tool there are other features also available in this glue which I'm going to talk later in this particular presentation so moving forward to another data source which can be a file your file can be refining at your own premise let's assume that there are some file which is available in your data center and you want to move it to the cloud how you can move it so we have our data sync service in which we generally install a data sync agent on your premise and then this data sync agent can copy your data and move it to the F3 now there are snow devices available as well if let's say you have large chunk of data and you have some bandwidth issue and you want to immediate not immediately I would rather said you know you want to move it to the AWS cloud faster of course it can range up to 10 to 15 days okay where we generally ship a snow device to you you can copy your data and then once we'll receive that snow device we can copy into to F3 or EFF or anywhere okay please note that if you have huge chunk of data which will take months or you know lot of months or couple of months of time you know then only the snow can device generally is used now moving into this SAS application data so we know like in an enterprise generally we have we know like there can be a different ERP system right it can be SAP it can be dynamics it can be Oracle it can be anything okay and if let's say you have you want to move those data into AWS then we have a no code solution available which is Amazon app flow okay it's a no code solution so that means you have to just configure your connection string provided that your SAS application have a connectivity with with this particular service what it will do it will read the data and can move the data to S3 or Redshift okay of course there can be other data destination as well but today I'm going to only talk about S3 and Redshift okay so app flow is also there which is which can read data from SAS application there are 100 plus connectors available currently in app flow and it is growing every year okay and so if let's say you have data in SAP HANA so I may then app flow have SAP HANA connector which can read all data and can move to your data platform now this is very important third party data sources now let's assume let's take an example of a retail store so let's say that you on a fun day in a fun day suddenly you see that the footfall is very low in a mall generally what we have seen that on weekends or in weekends footfall tends to get higher why because people have holiday and they want to visit or they want to go for shopping or go to movie with their family but finally we realize that certain fun day or certain Saturday or certain holiday we don't have that much of footfall even in fact it has actually decreased why and if you want to analyze this kind of data and you want to understand the pattern you might need a third party data data to analyze this thing it may be possible that that day it was raining heavily okay and so that's why people did not came out to visit mall for how you will get the those third party data so we have a service called data exchange now this data exchange is a kind of a marketplace where you can buy data or even if let's say you have data you want to publish your data and you want to sell it you can use this particular service okay let's say for example you want to buy some kind of weather data okay and then you can use this service this service you can directly push it to to redshift or a data lake even there are some API by which you can consume this data so this data exchange is integrated with our data warehouse and data platform so you can use this thing now last or the leaf for this ingestion layer if the streaming data so let's think about a clickstream data or an IoT data how you are going to ingest it in your data platform so we have different services catering to different requirements for if streaming data sources so we have our own IoT platform which IoT course if let's say you have IoT data you can use IoT core we have kinesis data so if let's say you have video streams or you have textual data streaming just assume like kinesis not kinesis like click stream data then you can use our kinesis services and if let's say and of course dmf edge as we discussed in in previous slides as well that of course real-time ingestion or real-time migration of your relational data or database data you can use the dms of course dms also supports you know some non-relational database like MongoDB as well then we have MSK connect so MSK connect is nothing just manage Kafka okay so manage Kafka server is there by which you know if let's say you have some application which is pushing or publishing data into Kafka then you can use also this particular service MSK okay now with this particular services you can ingest data to Redshift or F3 of course open search service is there which is if let's say you want to create some kind of search engine or some kind you know log aggregation then of course this particular service can be used even these day this is also used as a vector DB but yeah so you can use this particular these particular services for ingesting data to F3 or Redshift so when I say F3 as I said it is more about a data lake now so now the first layer is done will move to the data storage so as I said the data storage can be your database let's say for example you have a data in my SQL refiding on premise and you want to move it to AWS cloud of course you can use dms service or you can use glue both way you can move the data okay but today as I said like you know we are going to discuss more about a data platform from analytics perspective okay so we will talk about the data lake or the data warehouse okay so if let's say we are talking about the data storage layer the first thing will be a data lake so in my previous diagram I have actually shown you like you know first it shall be there will be a landing zone then you will be profiting the data and then there will be a curated zone now that curated zone can either be a data lake or can either be a data warehouse or it can be a combination of both these day we call it a lake house architecture from data and data warehouse and some data in a data lake okay so your storage layer can be a data lake or can be a Redshift if you are building a data platform now we will discuss bit about this Redshift stuff okay which is our manage offering for a data warehouse so Redshift is essentially a column that data base okay so all those relational data base what you have heard till now like my SQL Postgres SQL Server Oracle they are actually transactional database okay they are row based database now this Redshift is a column based database okay it's actually used for analytical purpose so row based data is just to highlight something like you know for everyone if that a row based database is designed for a transactional system where you know data records are changing quite frequently then a row based database generally work better but if let's say you want to have an analytical system okay so that means you are analyzing historical data and the data going to change bit lesser you know like the frequency of changing will be bit lesser so that's why we have seen like you know people wants to move data in a batch okay then these column the database or a big data or you know a data lake comes into the picture for Redshift is actually storing the data in a column of format now what kind of features with this Redshift have because it's it is actually storing data in column format so that means the aggregation this analytical query will run faster in Redshift now there are few feature which I want to highlight for Redshift first is this which is a spectrum you can see here spectrum okay so just few minutes ago I talked about a lake house architecture so that means let's say you have some data which is stored in a data lake and you have some data which is stored in a data warehouse okay if you want to execute a query which can get some data from Redshift and which can get some data from your data like how you can do it so ultimately what will happen that your analytics app or your BI app will execute query on Redshift inside Redshift there will be you know some configuration for external table you have to do what it will do like you know it will get data from Redshift table and then it will create a spectrum query which will execute on your data lake and it can give you a combined result so what I mean to say here is that if you are even if let's say your data is refiding in S3 and Redshift combined you can have a single query by which you can read both the data okay now which data you will move to Redshift which data you will move to a data lake okay so one there can be different design patterns for this so generally for example we can say like if let's say you have if let's say you have ten years of data and two years of data you are frequently using okay so those frequent data you can move it to the Redshift and some non-frequent data you can move it to S3 because generally storing cost of S3 is you know bit lesser and of course you can have different tiers of S3 by which you can again save the cost okay so this way you can decide like you know which data you want to move to a data lake which data you want to move a data whereas both way it will be fine okay so we'll discuss something about federated query now so let's say you don't want to move your data your data is still refiding in your traditional data base which can be a SQL which can be a Postgres or anywhere okay you don't want to move the data or if let's say because the data warehouse system are generally means for you know historical analytics it can be like you want some report for like let's say 30 minutes of 40 minutes you know the data which is like committed in the system 30 or 40 minutes before then Redshift will give you better result but if let's say you want to do some kind of real-time analytics okay or you if you can't move the data okay then you can run those federated query using Redshift so that means the query will be fired in Redshift but ultimately inside Redshift this Redshift will query your operational data then of course there is another feature which is which I want to highlight is ML and analytics so there is a feature as a Redshift ML available in Redshift so you can create those ML model writing those you know SQL statement so you don't you might not need to have those you know Python skills to create those models okay what you can do like you know you can just like like create model this kind of a statement and then it will create a model for you analytics model for you and then of course there are certain set of algorithm which is available in Redshift based on that you can create those model and because your data is already refiding in Redshift or in data lake it can train that that data based on Redshift data or your data lake data and then of course after that you know once the data model is trained that inference part means your once the prediction part will be there it can be deployed on your Redshift cluster itself saving your cost okay you don't need a separate cluster for that of course that division you have to take but this is also possible with Redshift and there are other features which are coming in Redshift one is like the zero ETL features for example if let's say your data is already in Aurora DB Aurora is our DB database for Postgres and and my SQL which is a cluster okay and if you want to move it to the Redshift you might not need any any separate ETL service like glue or DMS for that going forward you can directly move it to the Redshift so we are generally going towards that zero ETL kind of feature right now so currently this Postgres sorry Aurora Postgres and Aurora MySQL zero ETL feature is available okay I need to check like I believe that this is not available in India yet but it will be sooner or later it will be available in India now so data storage layer if done what about data cataloging so the third part is data cataloging so data cataloging if nothing how to make sense of the data for example our mind is trained to see the data in a relational format we don't know how generally generally I mean most of the people don't know how internally Oracle or let's say Postgres or let's say SQL for they are storing the data but how generally we visualize like you know we generally query it in a SQL statement and then what happens that we see the data in a tabular format okay behind the thing whether they are storing it in a file format whether they are storing it in a certain different format most of the people don't know it okay so generally when we say cataloging for means like how to visualizing it in a certain format be it a relational format generally our mind is trained like that okay so for for Redshift generally you create the table structure first and then of course you are infesting the data for those that those table or those iskima you can consider the data catalog but what about data lake so you have data in F3 but how to make how to use those data okay so you need to have a relational structure on top of that how you will you are going to create that so for remember I told you that I'm going to use this blue word again and again so here again the data catalog we build on top of glue so glue is not just about the ETL it also have a feature for data catalog by default this data catalog will be built on hype okay when we say hype so generally hype data are immutable so you cannot change it but glue data catalog also supports hoodie eye work you know those those data catalog which can be you know which can have those if it transaction as well okay so glue by using glue you can build a data catalog on top of that and generally you know like you are if you are building a data lake you are storing data in F3 so that means your cost of storing will be very less then if you are building this data catalog using glue okay so generally it cost you less than you know two three dollars for one million object for a month okay so we can definitely check the price but of course it's quite less okay so cataloging your data don't take too much of you know too much of cost on AWS and you can choose your format whether you want a high format whether you want whether you want what you can say iceberg or whether you want a hoodie format you can use this data catalog a glue data catalog feature so of course one side of the spectrum one side of the spectrum edge actually red strip which is our data wear off the same thing you can build a data lake also again the data will be stored the data you can store it in a column the format column the format is also possible in a in a in a data lake which can be a parquet or can be an over fee and then on top of that you can build a effort data lake which can be iceberg data lake or which can be a hoodie data lake using glue now so data processing layer so of course you have ingested the data and now you want to do some kind of cleaning how you can do that cleaning okay so we have services called AMR and glue I have already discussed a lot about glue right now that it can be used as an ETL service it can be used as a catalog apart from that it has own visualization tool as well by which you know if you don't want to write code you can do those drag and drop things and you can create those ETL pipeline but if let's say you are a geek and you want to write some code PI spark code then you can also use AMR services okay say AMR if nothing just manage Hadoop okay and of course in manage Hadoop you can install spark and then you can run your spark code on top of that and then of course you can process your data and the definition can be anything it can be your F3 it can be your data we have even it can be your you know normal data base as well even if you can send the data to open search as well moving ahead so with AMR so now you already have an ETL tool you already know how to ingest the data you know how to catalog the data you know how to process the data but while processing there can be a business problem as well like let's say you have 20 table or 30 table okay and you want to move the data in a certain sequence or let's say you have a problem that you are ingesting a data into a table and then you want to do some kind of work and then after that only that but another job you want to run okay how you are going to orchestrate this thing so there are couple of services which is provided by AWS first one is a imagine manage workflow for Apache Airflow is just a manage cluster for Apache Airflow if you want to go for an open source thing then you can go with this thing of course there will be a cost associated with this another one is a step something so if let's say you have you are using AWS native services and you want to create some kind of workflow you can create using AWS step something we'll discuss about data consumption now so data consumption when we talk about data consumption of course you can consume it via AML models you can consume it via API and you can consume it via your BI tool I will discuss bit about our BI tool which is quick site actually so let's say you have a business problem by which you want to embed the embed your report in your application and if let's say you have a multi-tenant solution and you want to expose the data set and you want your tenant to create the report yourself that themselves okay and let's say you know you want to run some kind of an LQ query okay in just like English statement then how you can do it so our quick site solution provides all these things in one service okay and it is price based on user so there is a price for author which creating the report there is a price for reader who is reading the report and the good thing is like reader price is based on best so let's say an LQ feature is not enabled and you have a reader so it will be priced like 30 cent per session so when we say 30 cent per session and it is kept at $5 it I'm talking about you know something which is when NLQ is not enabled and it is kept at $5 so that means let's say if a user is reading the report three time of four times okay so it he will which I like let's say three time he is reading the report so that means only 90 cent will be applied cost will come but if let's say if a frequent user is coming and he's every day he's opening report 10 times okay 300 times or 400 times or even 1000 times he is opening the report in a month then the maximum it will charge you $5 per month okay so that is for you know named user of course the user who is you know registered with with quick fat but there is a possibility that you want to build a public dashboard okay where you don't know who the user will be there so a capacity pricing is also available with quick fat with quick fat you can do 100% white labeling so it's just like drag and drop thing it has in memory computation available as well up to 1TB you can pull the data into quick fat memory and it can do like you know in memory calculation and which will be very fast it is possible with quick side so now the data consumption layer is completed so I am not going to talk a bit about AML of course we have different service for AML as well I am not going to talk about API of course you can create some kind of micro-surveys or you can create your GraphQL API on top of you know on top of your data layer but yes there can be different mean by which you can consume your data lastly it's security and governance so I told you like I am going to come back to this lake formation service so lake formation is a very important service for us okay by which you can create a data mesh architecture so that means you have different different department and each department want to share the data with the data platform they want to put some kind of restriction what data they want to share you know and there will be a consumer who will be consuming the data fattened consumer that you don't want to you know access fattened data so that kind of you know that kind of restriction you can put using lake formation of course lake formation has a feature called blueprint by which you can ingest the data lake formation you know can provide some kind of restrict can put some kind of you know restriction on the data like which data will be accessible to whom you can share your data to different account as well using lake formation you can put your security rules so there are other services as well by which you can go on the data for example backup is there if let's say you want to do some kind of business continuity then backup is there if you want to do the auditing cloud trail is there you want to see error log cloud watch is there and if you want to secure a sensitive data then message is there okay so as I said there are different services which can help you in doing those in doing those you know security and governance compliance for for on your data lake formation as I said you can do multiple thing with lake formation you can create a data mesh architecture with lake formation you can put some kind of restriction on data like you know which data will be seen by which one you know some department if you want to put restriction you can do that thing with lake formation and of course there are different services like awf backup is there if let's say you want to do backup data at certain interval you can use awf backup if you want to audit your data of course you can use cloud trail you want to see the error logs you can use cloud watch if you want to secure your sensitive data if you want to hide those data you can use message so there are different services available in in awf by which you can govern your data so we'll skip this slide uh now last is the reference architecture so so if you see this architecture okay so you can have multiple data sources you can use dmf data saying canific services or kafka or iot core or app flow you can ingest the data in f3 of course you can use lake formation to govern the data and then once the data will be here then you can process the data using blue or data blue you can also put an emr here and then once the data it processed either the destination can be your redshift it can be even a f3 data lake you can consume it via quick fight you can move it to the open search service okay if you want to create some kind of search engine you can do that and of course you can consume it via m models as well so you can do all these things using awf data so this is a just an reference architecture of course you can have a different architecture you know depends upon your business case or depend upon your business case complexity the architecture can be different but in a nutshell this is kind of a overall architecture for building a data architecture on awf with this i'd like to finish this presentation thank you very much