 you are going to take some details and you can try your hands dirty on that. So my name is Amit, I work for Autodesk and if you have seen the reinvent videos, so Autodesk is one of the first customers who is using the glue already and that is in a production. So of course I will going to use some diagrams, so of course I am not violating any copyrights, so that is my disclaimer. The talk today, three things, my talk is basically covering the ETL, maybe you are using the ETL just for dashboarding or you may be using machine learning or you may have any analytical purpose. So whatever things you are doing, you are going to use the ETL, I extract, transform and loading. So what it does like, from your all prediction data applications, whatever source of data, you just bring that and do some analytics, so you do the extractions, you transform it and you just load the destinations where you want to run some of your analytics. So basically the glue is managed service from the image on and it is going to give you some of the, it takes away all the worries regarding the schema change and all, so I am going to cover that. The next thing is I am going to cover some glue components, of course that is how you are going to just try hands on it and the third one because it is a serverless. Not only serverless, I think you guys know what serverless means, it is going to save some cost, speed up the cluster and just destroy after the uses but of course I am going to cover that but I am not going to be there. This is very familiar to many of your ETL pipeline, your analytics pipeline, it is very general. So the thing is here, the three folks on the left hand side what you see here, the data sources and from the last decades the format of the data is really drastically changed. You have the new formats, so the guys who are doing already the ETLs, they have to write the script to understand the schema and put into a target data source. So you can see here relational databases, you have the semi structure database, JSON, CSV and all, you also have structures. So what you do today, you take this data source and put into the, you use your ETL jobs or you are using the tools there and put in some stazings and do kind of the transformations you do. So glue is going to help you on the ETL portion only but you may have the questions like oh guys I do have the already some ETL tools available, Informatica is one of the leader, why can't we use these tools? Of course you are using these tools, it is not going to replace it. The only thing is, there is a lot of heavy lifting work when you want to extract data from the data, the data source. You have to understand the schema, you have to write the script to understand that, in the target side you have to create the schemas, new tables, new databases there and you think of the days when you have some changes in your source schema. All the downstream systems will have the impact, I am going to show that. So this is the heavy lifting work and this work is exactly like the courier delivery boy who has to deliver a courier which is just hand over the packet, it is just two minutes task but he has to go to the house or the office or wherever, it takes longer time. So these are 70% of the work today ETL, you invest in the ETLs. What we are going to do is going to help on that, not only that, this is what I already talked about, if there is some schema changes, your whole pipelines, all the component inside is going to share the changes as well, they are forced to change. The next thing is, you have to maintain the structure. We always forget about this in the big data world. We say oh no, we have the do, we have the spark, we already can run the job very fast. Of course you can do, of course the problem is gone for the time but that is what the problems, schema changes, doing the heavy lifting work is still there and if somebody is working on the spark already, they know, they have to trigger the jobs, they have to schedule the jobs, they have to maintain the different components. What those components are like, you are submitting the job, you have the Uzi and all those things kicks in. Then who is going to provide you the fleet of the infrastructure? You have to maintain the structure or you are making just like, you are having the EMR already which do the job and just the cluster is destroyed, terminated but still you are the one going to configure which component you want, there is a lot of heavy lifting work there again. This is glue, what you see here is going to help you out and especially the discovered whatever I tacked there is discovered, this is the beauty of this product. So what it does, how does this work? You ask you, just you point the glue to the data source and it is going to scan the data source and create a catalog, glue maintain its own catalog. Even before you are writing a schema or something, what it does, it goes to the destination, whatever your source data, it scan the tables, the schemas and all and it create a table for you in the glue catalog. Using that, you are right away and explore the data. So you may have some unformatted data, maybe some JSON file or something like that which have the hundreds or thousands of records or hundreds of thousands of entities there. It's very difficult without such kind of the tools you can explore the data. Data is sitting there but you are not using analytics, that's called dark data. So glue help on this, it is scanned very quickly, very fast, the component, the glue component which help on this is a crawler. I will show you how it does. The next thing that the hand coded ETL jobs and of course nobody can take it away. You have to do the complex integration, the data integration, the data joining, so many transformation you have to do but that's what how ETL developer does every day and that's not going to take away. The glue is not taking it away, glue is giving you the more features, they are flexible code and all the features is like most of the features are made built. The third one is serverless. When you spin up, when you start, you trigger the crawler which basically scan the schema and the data source or you spin up the glue job, it spin the cluster behind the scene and there's a spark and once it does the job, the job is completed or crawler scan the data and catalog it, it just go away, it can terminate it so you don't get charged. So heavy lifting work, automation by the glue, components like catalog and crawler and the second, how you do the job altering and third one is the infrastructure. These are three things. Now let's talk about the first one, the screen right there you can see the diagram on the left is like your data source, it may be anything. I just for sake of this presentation I have two, the S3 which may have JSON which may have the CSV, other parquet format and all and the next is the group crawler and the right to do that is a data catalog. What the crawler does is scan the data source and create the schema in a catalog where right after that you just do explore the data using the Athena or if you have the Japlin you can do even visualization and you have not done anything else. By writing, you have not done writing a single line of code yet, the crawler has done. I hide in most of the things there but you see the rectangle I put there is a left hand side, this is a component, database is where you keep your all the schema, the crawler keep the schema there for the schema there and you see the crawler there. So this screen is basically after the crawler has already scanned the data source and in for the schema and create the table for you. So that's the screen represent if you see the right side, the red rectangle there, these are the classifications. So you can see there a list of the services there, there may be the RDS, may no SQL databases or the other image on services or the formats like CSV, parquet, it already the crawler crawls the data source and create this tables there and right after that either of the tables you can do and query the using the Athena. So most of the things like it understand the partitions, if you have already the workload there, you have the data partition already there, it understand it well. If you are already using some of the things like Hive is the center of today's data warehousing, if you are using it the one big data stack, it is the catalog is high compatible. You can migrate from each to other, one to another one. So this is how powerful it is. But you say, man, you're kidding. This is something which is already pleased. If you are already something, having some data is already very, very organized. If it is dynamo DB something, okay, what I tried here is you see here it's an array. And the people who are working on ETL work there can understand how the array gives the problem stack, how many times your Hive says, hey man, no schema definition, something like that, that are. Even I make it more complex. I use the struct, the struct having internal items towards our array. So I run the crawler, does it do anything else? I didn't do, I didn't write any line of code. So it automatically identifies a JSON format and it also understand the schema and all those things it does. So how it does resolves basically resolve the schema. You can see here on the right hand side, the below diagram is flattened. So what you do using the flattened, the function here, the crawler already flattened it and the prepare and create the table for you. And right after that, you can explore it. So that's the glue is, and this 70% work, I haven't seen any, I crawler took more than two minutes. The last thing I did is my SQL server, ODS Data Warehouse. The ODS component is on on-premises. And I run the crawler, which is on AWS Cloud. It created almost 1500 tables and it hardly took three minutes. So crawler is too faster. And if you want to do something else like you want to do some development and want to do visualizations, just create an endpoint to the glue. And you can do all those visualizations, okay? If still that does not satisfy your need, you have even more complex data type, customized data type, which you created for your organization. So glue provide a classifier and you can put the grog patterns, grog patterns, the regular expressions. So you do that to let the crawler understand, oh, this is the data. And this is how we crawler should understand the scan and create the table back in the catalog, all right? So this is how crawler is basically scanning and make the data available for you so that no data is sitting as a dark data there. You can do the analytics right away. You can do the visualizations and all this stuff. Until this point, you have not done anything by your hand coding stuff. Anything else, it's just like you configure a crawler. All things done, now you have the table. The table over there, what it says, it has some property like glue already created, but another problem in the ETL world we see is like, hey, my table has changed in the production application database. My table has changed. It added a new column. It deleted the column. It changed the data type from string to somebody else, some data type. So what you want to do is break the pipeline currently. But crawler, what it does, it understands the schema changes the source. And it's very well configurable stuff. And it creates the different version of the table. So what the different version of the table does is, you can see over there, two tables got created. And even though you want to see what the change, you can see there, a new column is added. And it's up to you which version of the table you want to use. So this is how we handle the changes. And today's world you can understand while if something could change in the table, it also impacts the change across the pipeline. The last thing is the ETL job, job or string. It not take away, but give you the flexibility, built-in functions, all the transformations. Once you configure your job, it basically generates the PySpark code in the back and you can change it. And coding for the development purpose, you can use the Japlin. And you can also PyCharm integrations and can do your development stuff what you want to do. These are all functions. I'm just running, I mean just to interest all the time. The last thing is when your job is going to trigger. You have the schedulers and if you are very well familiar with the Cron expressions, you can use that. And you can set the dependency like my first jobs is successful, then only trigger the second job. So that you can do there very easily and you can run the ad hoc. Of course, this is the very much kind of the beauty of this. I would say the glue triggers. If you're putting the data in S3, you don't need to schedule. As soon as the data is just S3, it should trigger the job and just does it work. But the problem is S3 is eventual consistent. If people knows that, you may see the data immediately or you may not see or even sometime you see or sometime you don't see. It takes time. Glue has a feature like bookmark. When the events trigger the job, it only process those which are available, those partitions or those part of the data. In the next run, it maintained through the job bookmark. It process the rest of that. That is one of the very amazing things. In fact, sure, not much on this. People knows like when you spin up, it automatically spin up the cluster and destroy after the execution. You are charged for only those time when you're running that job actually. But for each job, you can see how many DPS. DPS is a special category of the EC2 instances, which is for the intensive workload. So you can configure like, oh, understand, the cluster is like serverless. But what if my data is just a one GB and it's going to spin up around 20 or 100 or 1,000 DPS behind? You can very well configure it. And this gives you a feeling of the production. Like it's not you using the screen or production. You're going to script, but sort of API is available for you, but you configure it one time. What happens if the data is not as per your best observations? Based on the best observation, you say, oh, 20 DPS are well enough for my workload. But what about this one? I hope AWS provides some way to provide these metrics based on your work, the data load, or something. How we did it, it's very normal programming way when there's no data, don't spin up the cluster. That's fine. But whatever fear more. So what we have done here is there's API given, which give the jobs runs. How many times the job has previously run already, it gives you the statics. When it started on, how many DPS were used, when it's completed, it is still running or completed or so. So you can put your logic like the one instance is already, hey, wait, don't just run my next job run. Don't run my job next time. So it don't do that. We analyze this for the people from the Java background. The one you see the next is an object, which is the map, exactly Java map. And those from the Python background, it's a dictionary. So what you can read this object already. You can get it from the AWS. So what we have done there, we create some kind of the framework, which sense it. And when, before triggering a job, actual ETL job, we decide how many DPS we want. So we make it more flexible based on the workload, the data volume, we spin up our cluster. So even the serverless is one thing, but we did something very macro, because we run very, very data insensitive and high volume workloads. So I hope I have given all these informations like this is just ETL. So there's a lot to cover on the glue. And I think I already ate the other stuff. But just a quick, I talk a lot. But just a two minutes. So here what we are doing, first create your database. Just one line. Give your name. Now time to propulate your tables. How we do? Let's tell the crawler. Hey, crawler, that's your name. And what I want to do is I want to read the data sitting at the S3. And this is, oh, you want to add another source? Yes. Just do some RDS or whatever or S3. Over here, I just did all the source is S3. And at the end, what it does, hey, if I encounter the schema change, what I do? So I say, OK, update it. And then at the end, it say, OK, when you want to run, any time. And it runs. So what it did after it run, it populated all the tables. And you can see here it populated the eight tables now. This is so fast. And you just click on any table on the left, the first column, and you just go and explore the data. This is so simple. And last thing is a job. So when you create the job, you have to have source. You have to target. And in between, you have transformation. So I screen in front of you, giving the source now. Let's tell which the target database here. And this shows you the mapping. You can change that mapping. And at the end, it generates the PySpark code. You do whatever change you want to do, how many DPs you want to run. And after that, once it succeeds, it just moves the data from here and use the transformation that you have provided. You have done, even after crawler or even after job, what you have done. You go to the table there and choose that option. And you jump to the Athena. There, you can create well. And Athena is also high compatible and all. So I did not write anything else for these screens, just couple of the clicks on the screens, navigations, and my job is done. The crawler is too powerful. And also, there's a lot of things around glue. So I cover minimum. So I hope it's helped you to try hands on that. Thank you. I think he can take one from here. Yes. Sorry. Sorry? Of course you can do. What is X amount?